HPC AI500: Representative, Repeatable and Simple HPC AI Benchmarking
Zihan Jiang, Wanling Gao, Fei Tang, Xingwang Xiong, Lei Wang, Chuanxin Lan, Chunjie Luo, Hongxiao Li, Jianfeng Zhan
HHPC AI500: Representative, Repeatable andSimple HPC AI Benchmarking
Zihan Jiang , , Wanling Gao , , , Fei Tang , , Xingwang Xiong , , LeiWang , , , Chuanxin Lan , Chunjie Luo , Hongxiao Li , , Jianfeng Zhan , , Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Science International Open Benchmark Council (BenchCouncil) { jiangzihan, gaowanling, tangfei, wanglei 2011, xiongxingwang,lanchuanxin, luochunjie, lihongxiao, zhanjianfeng } @ict.ac.cn Abstract.
Recent years witness a trend of applying large-scale dis-tributed deep learning algorithms (HPC AI) in both business and sci-entific computing areas, whose goal is to speed up the training timeto achieve a state-of-the-art quality. The HPC AI benchmarks acceler-ate the process. Unfortunately, benchmarking HPC AI systems at scaleraises serious challenges. This paper presents a representative, repeat-able and simple HPC AI benchmarking methodology. Among the sev-enteen AI workloads of AIBench Training—by far the most compre-hensive AI Training benchmarks suite—we choose two representativeand repeatable AI workloads. The selected HPC AI benchmarks in-clude both business and scientific computing: Image Classification andExtreme Weather Analytics. To rank HPC AI systems, we present anew metric named Valid FLOPS, emphasizing both throughput perfor-mance and a target quality. The specification, source code, datasets,and HPC AI500 ranking numbers are publicly available from . Keywords:
HPC AI, Distributed Deep Learning, Benchmarking, Metric
The massive success of AlexNet [1] in the ImageNet [2] competition marks thebooming success of deep learning (DL) in a wide range of commercial applicationareas. Like image recognition and natural language processing, many commer-cial fields achieve unprecedented accuracy, even outperforming ordinary humanbeings’ capability. Though it is much more challenging to obtain high-quality la-beled scientific data sets, there is an increasing trend in applying DL in scientificcomputing areas [3–5].With massive training data available, recent years witness a trend of applyingdistributed DL algorithms at scale in commercial and scientific computing areas. § Jianfeng Zhan is the corresponding author. a r X i v : . [ c s . PF ] F e b ZH. Jiang et al.
Motivated by these emerging HPC AI workloads, the HPC community feelsinterested in building HPC AI systems to reduce time-to-quality, which depictsthe training time to achieve a target quality (e.g., accuracy). For example, severalstate-of-the-practice HPC AI systems [3, 5] are built to tackle enormous AIchallenges. The benchmark accelerates the process [6, 7], as it provides not onlydesign inputs but also evaluation and optimization metrics and methodology [7–9]. However, there are several challenges in benchmarking HPC AI systems.The first challenge is how to achieve both representative and simple, whichessential properties the past successful benchmark practices establish. On theone hand, the SPECCPU [10], PARSEC [11], and TPC benchmarks, like TPC-DS [12] emphasize the paramount importance [8] of the benchmarks’ being rep-resentative, and diverse, as no single benchmark or metric measures the perfor-mance of computer systems on all applications [13].On the other hand, TOP500 [14] establishes the de facto supercomputerbenchmark standard in terms of simplicity. Simplicity has three implications:first, the benchmark is easy to port to a new system or architecture; second,the benchmarking cost is affordable for measuring systems at scale; third, thenumber of the metric is not only linear, orthogonal, and monotony [14], but alsoeasily interpretable and understandable.In the AI domain, there are massive AI tasks and models with differentperformance metrics. For example, by far, the most comprehensive and repre-sentative AI benchmark suite–AIBench [8, 9] contains seventeen AI tasks. It isnot affordable to implement so many massive benchmarks and further performbenchmarking at scale. What criteria are for deciding the representative and sim-ple benchmarks that can measure the HPC AI systems fairly and objectively?Second, the benchmark mandates being repeatable, while AI’s nature isstochastic, allowing multiple different but equally valid solutions [7]. Previouswork manifests HPC AI’s uncertainty by run-to-run variation in terms of epochs-to-quality and the effect of scaling training on time-to-quality [7].None of the previous HPC AI benchmarks simultaneously achieve represen-tative, repeatable, and simple. They either are not representative [15] or evenirrelevant to HPC AI workloads in terms of kernel functions [16], or overlookthe differences of HPC AI workloads between scientific and business comput-ing [7]. This paper presents HPC AI500–a comprehensive HPC AI benchmarkingmethodology, tools, and metrics. Compared to our previous position paper [17],this paper proposes a brand-new benchmarking methodology that simultane-ously achieves representative, repeatable and simple.We quantify the characteristics of AI models and micro-architecture and per-form further randomness analysis. Among seventeen AI workloads of AIBenchTraining, we choose two representative and repeatable benchmarks: Image Clas-sification (business computing) and Extreme Weather Analytics (EWA), to mea-sure HPC AI systems. Image Classification and EWA achieve state-of-the-artquality on the ImageNet dataset (business computing) and the EWA dataset(scientific computing). These two benchmarks represent two clusters of AI bench-marks of AIBench Training from perspectives of computing areas, diversity of
PC AI500: Representative, Repeatable and Simple HPC AI Benchmarking 3 model complexity, the computational cost and micro-architecture characteris-tics. To rank HPC AI systems, we propose valid FLOPS, emphasizing the vitalimportance of achieving state-of-the-art quality and an additional metric–time-to-quality. Our benchmarks simultaneously achieve representative, repeatableand simple.
Fig. 1:
Against the FP32 implementation, the mixed-precision version speeds up more than 2x theFLOPS of two micro benchmarks (Convolution and GEMM) and ResNet-50. Still, it incurs ResNet-50’s accuracy loss when the system scale increases: 0.12% at one node while about 1% at both 4 and8 nodes.
Fig. 2:
The kernel function breakdown of the 17 representative AI workloads from AIBench Train-ing [8]; indicates the LU factorization is a not representative kernel.
TOP500 [14] defines two distinctive characteristics of the de facto supercom-puter benchmark standard: simple and scalable. We have discussed the implica-tions of simplicity in the previous section. Scalable means the benchmarks canmeasure the systems with different scales.In the AI domain, there are massive AI tasks and models with different perfor-mance metrics. For example, AIBench training [8] contains seventeen representa-tive AI tasks, covering a diversity of AI problem domains. It is not affordable forHPC AI benchmarking to implement so many massive benchmarks and further
ZH. Jiang et al. perform benchmarking at scale. The traditional micro or kernel benchmarkingmethodology, widely used in the HPC community, can lead to the misleadingconclusion, as the mixed-precision optimizations indeed improve the FLOPS ofa micro benchmark like convolution while significantly impact time-to-qualityof an AI task like Image Classification. Fig. 1 shows that the mixed-precisionimplementation increases the FLOPS of both micro and component benchmarksbut incurring an accuracy loss as the system scale increases.The representativeness of a benchmark indicates that it must measure thepeak performance and price/performance of systems when performing typicaloperations within that problem domain [18]. The micro benchmark, like HPL-AI [16], which only contains LU decomposition, is affordable to perform a faircomparison of competing systems by isolating hardware and software from sta-tistical optimizations [7]. However, we found it can not represent most of the AIworkloads in AIBench. As shown in Fig. 2, the dominated kernel functions areconvolution and matrix multiplication.
This section introduces our benchmarking methodology. We firstly conduct aseries of experiments to prove why our methodology can guarantee representa-tiveness, simplicity, and repeatability (Sec. 3.1 and Sec. 3.2). Then we finalize theHPC AI500 benchmark decision considering these analyses and the additionalrequirements in the HPC field (Sec. 3.5).
We choose AIBench Training [8, 9]—the most comprehensive AI benchmark byfar—as the starting point for the design and implementation of HPC AI bench-marks. The experimental results of AIBench Training [8] have demonstratedthat the seventeen AI tasks are diverse in terms of model complexity, com-putational cost, convergent rate, and microarchitecture characteristics coveringmost typical AI scenarios. To achieve representative, we identify the most typi-cal workload in AIBench Training from both microarchitecture-independent andmicroarchitecture-dependent perspectives.From the microarchitecture-dependent perspective, we choose five micro-architectural metrics to profile the computation and memory access patternsof AIBench Training, including achieved occupancy, ipc efficiency, gld efficiency,gst efficiency, and dram utilization [19]. GPU architecture contains multiplestreaming multiprocessors (SM); each SM has a certain number of CUDA cores,registers, caches, warp schedulers, etc. Achieved occupancy represents the ratioof the average active warps per active cycle to the maximum number of warpsprovided by a multiprocessor. Ipc efficiency indicates the ratio of the executedinstructions per cycle to the theoretical number. Gld efficiency and gst efficiencyrepresent the ratio of requested global memory load/store throughput to requiredglobal memory load/store throughput, respectively.
PC AI500: Representative, Repeatable and Simple HPC AI Benchmarking 5
300 200 100 0 100 200 3006004002000200400600
Face Embedding
Image Classification
Image GenerationImage-to-Image Image-to-Text
Object Detection
RecommendationSpatial Transformer Speech-to-Text
Learning-to-Rank
3D Face Recognition 3D Object ReconstructionImage Compression Text SummarizationText-to-TextVideo Prediction Reinforcement (a)
The microarchitecture-dependent of AIBench Training.
100 0 100 200 3002001000100200300
Image classification
Text-to-Text translation Image-to-TextSpeech recognitionFace embedding 3D Face Recognition
Object detection
Recommendation Image compression 3D object reconstructionText summarizationSpatial transformer
Learning to rank
Video prediction (b)
The microarchitecture-independent (Fig. 3b) Clustering of AIBench Training. Image-to-image, Image Generation are not clustered due to the lack of widely accepted metrics todetermine the end of a training session. NAS [21] probabilistically searches the network struc-ture in each training session, resulting in unstable computational cost (FLOPs). Therefore,NAS is not included in the clustering result.
Fig. 3:
The microarchitecture-dependent (Fig. 3a) and microarchitecture-independent (Fig. 3b)Clustering of AIBench Training. The x-axis and y-axis are the positions of the Euclidean spaceafter using t-SNE technique for visualization.
We profile the above five metrics and perform K-means clustering on all sev-enteen benchmarks to explore their similarities through our TITAN XP GPUs’experiments. Note that the operating system is Ubuntu 16.04 with the Linuxkernel 4.4, and the other software versions are PyTorch 1.10, python 3.7, andCUDA 10. We further use the T-SNE [20] for visualization, a dimension reduc-tion technique to embed high-dimensional data in a low-dimensional space forvisualization. Fig. 3a shows the result. The x-axis and y-axis are the Euclideanspace’s position after using t-SNE to process the above five metrics. We find thatthese seventeen benchmarks are clustered into three classes.From the microarchitecture-independent perspective, we analyze the algo-rithm behaviors, including model complexity (parameter size) and convergentrate (epochs to achieve the state-of-the-art quality), and system-level behaviors,including computational cost (FLOPs), for all seventeen workloads in AIBenchTraining. Further, we conduct a clustering analysis using these microarchitecture-independent performance data as input. Fig. 3b shows the clustering result.Combing Fig. 3a and Fig. 3b, we conclude that the AIBench Traning work-loads consistently cluster into three classes using both microarchitecture-dependentand microarchitecture-independent approaches.
ZH. Jiang et al.
Repeatability [22] refers to the variation in repeat measurements (different runsinstead of different epochs using the same benchmark implementation under theidentical configurations) made on the same system under test. A good bench-mark must be repeatable. Thus, repeatability is another critical criterion to se-lect workloads for the HPC AI500 benchmarks. However, AI’s nature is stochas-tic [7] due to the random seed, random data traversal, non-commutative natureof floating-point addition, etc. It is hard to avoid. Thus, most AI benchmarksexhibit run-to-run variation, even using the same benchmark implementationon the same system. Therefore, we need to ensure repeatability by choosingrelatively stable workloads in various AI tasks. We perform repeatability analy-sis using all workloads of AIBench Training on TITAN RTX GPUs. The otherexperiment environment is the same with Section 3.1.To eliminate the influence of randomness as much as possible, we fix thehyperparameters for each benchmark, i.e., batch size, learning rate, optimizer,weight decays, and repeat at least four times (maximally ten times) for eachbenchmark to measure the run-to-run variation. Note that our evaluation usesthe random seed and does not fix the initial seed except Speech Recognition. Weuse the coefficient of variation–the standard deviation ratio to the mean–of thetraining epochs to achieve a target quality to represent the run-to-run variation.Table. 1 presents the run-to-run variation of seventeen workloads of AIBench.As we see, each AI benchmark varies wildly in terms of the run-to-run variation.According to Table. 1, the most random workloads are Video Prediction, TextSummarization, and Image-to-Text, and their variations reach 38.46%, 24.72%,and 23.52%, respectively. For Speech Recognition, even sharing the same initialseed, the run-to-run variation still gets 12.08%. In contrast, Object Detection,Image Classification, and Learning-to-Rank the three most repeatable workloads,and the variation is 0%, 1.12%, and 1.9%, respectively.In Section 3.1, these three workloads (The highlighted one) are consistentlyclassified into three classes using the microarchitecture-dependent approach inFig. 3a and the microarchitecture-independent approach in Fig. 3b. Overall,
Image Classification , Learning-to-Rank , and
Object Detection achieve both rep-resentativeness and repeatability.
Simplicity is another important criterion for benchmarking. However, bench-marking an entire training session of all seventeen workloads in AIBench Trainingis extremely expensive, which reaches up to 10 days according to Tang et al. [8].We emphasize that
Image Classification , object Detection , and Learning-to-Rank achieve not only representativeness and repeatability, but also simplicity.
Against other domain AI benchmarks, there are two unique differ-ences in HPC AI benchmarking. First, the challenges of HPC AI benchmarking
PC AI500: Representative, Repeatable and Simple HPC AI Benchmarking 7
Table 1:
Run-to-run Variation of Seventeen Benchmarks of AIBench Training. Note that Image-to-image and image generation variations are not reported due to a lack of a widely accepted metric toterminate an entire training session.
No. Component Benchmark Variation Runs
DC-AI-C1 Image classification 1.12% 5DC-AI-C2 Image generation Not available N/ADC-AI-C3 Text-to-Text translation 9.38% 6DC-AI-C4 Image-to-Text 23.53% 5DC-AI-C5 Image-to-Image Not available N/ADC-AI-C6 Speech recognition 12.08% 4DC-AI-C7 Face embedding 5.73% 8DC-AI-C8 3D Face Recognition 38.46% 4DC-AI-C9 Object detection 0 10DC-AI-C10 Recommendation 9.95% 5DC-AI-C11 Video prediction 11.83% 4DC-AI-C12 Image compression 22.49% 4DC-AI-C13 3D object reconstruction 16.07% 4DC-AI-C14 Text summarization 24.72% 5DC-AI-C15 Spatial transformer 7.29% 4DC-AI-C16 Learning to rank 1.90% 4DC-AI-C17 Neural Architecture Search 6.15% 6 inherit from the complexity of benchmarking scalable hardware and software sys-tems at scale, i.e., tens of thousands of nodes, significantly different from thatof IoT [23] or datacenter [24]. On this point, we need to make the benchmarkas simple as possible, which we have discussed in detail before. Second, HPCAI domains cover both commercial and high-performance scientific computing.Currently, business applications are pervasive. Because of the difficulty of re-cruiting qualified scientists to label scientific data, AI for science applicationslag but is promising. In general, the scientific data are often more complicatedthan that of the MINST or ImageNet data: the shape of scientific data can be2D images or higher-dimension structures with hundreds of channels, while thepopular commercial image data like ImageNet often consist of only RGB [17].So we should include the scientific data in the HPC AI benchmarks.
Computation complexity
A benchmark with a small amount of computa-tion cannot fully utilize the performance of the HPC AI system. Therefore, weexclude Learn to Ranking because it has the lowest computation complexity interms of FLOPS, which is only 0.08 MFLOPs in terms of a single forward com-putation. According to [8], Image Classification and Object Detection are morecomplicated than that by one or two orders of magnitude, respectively.
Based on the existing analysis, we can conclude that
Image Classification and
Object Detection are the final candidates to construct the HPC AI500 bench-mark. We investigate the broad applications of Image Classification and ObjectDetection in both HPC [3, 4, 25, 26] and commercial field [27–30]. We choose the
ZH. Jiang et al. most representative workloads and data sets from these two fields. The detailsabout the dataset and adopted model are introduced in Sec. 4.
EWA is one of the pioneering works that uses a deep learning algorithm toreplace the rules predefined by a human expert and achieve excellent results [4].Most importantly, EWA’s goal is to identify various extreme weather patterns(e.g., tropical depression), essentially object detection . In 2018, a deep learning-based EWA implementation [3] won the Gordon Bell Prize, which is the first AIapplication to win this award.
Image Classification is widely used in many applications of commercialfields , which is a fundamental task in AI research. With the development oflarge-scale deep learning, Image Classification has become a well-known showcaseoptimizing HPC AI systems [27, 28, 30].
In this section, we first introduce the details of the HPC AI500 benchmarks,including the model, dataset, target quality (Sec. 4.1), reference implementation(Sec. 4.2), and proposed VFLOPS metric (Sec. 4.3).
The EWA dataset [4] is made up of 26-year climate data. Thedata of every year is available as one HDF5 file. Each HDF5 file contains twodata sets: images and boxes. The images data set has 1460 example images (4per day, 365 days per year) with 16 channels. Each channel is 768 * 1152, cor-responding to one measurement per 25 square km on earth. The box datasetrecords the coordinates of the four kinds of extreme weather events in the cor-responding images: tropical depression, tropical cyclone, extratropical cyclone,and the atmospheric river.
Model.
Faster-RCNN targets real-time Object Detection [31]. As one of thelatest models of an RCNN family [32, 33], it deprecates the selective search usedin the previous RCNN version. Instead, Faster-RCNN proposes a dedicated con-volutional neural network, named region proposal network (RPN), to achievenearly cost-free region proposals. With such a design, Object Detection is muchfaster. As a result, Faster-RCNN wins the 1st-place entries in ILSVRC’15 (Im-ageNet Large Scale Visual Recognition Competition).
Target Quality.
The target quality is
M AP @[ IoU = 0 .
5] = 0 .
35, whichis our best training result. MAP means the average precision, which is a dedi-cated metric for object detection. The IoU means the intersection over union tomeasure how much our predicted boundary overlaps with the ground truth.
Image Classification Dataset.
ImageNet [2] is large visual database designedfor use in visual object recognition research. More than 14 million images havebeen hand-annotated according to the WordNet hierarchy. Both the originalimages and bounding boxes are provided. The data size is more than 100 GB.
PC AI500: Representative, Repeatable and Simple HPC AI Benchmarking 9
Model.
ResNet is a milestone in Image Classification [34], marking the abil-ity of AI to identify images beyond humans in a particular domain. The spirit ofResNet is its success in reducing the negative impact of the degradation problem.The degradation problem means in the very deep neural network; the gradientwill gradually disappear in the process of back-propagation, leading to poor per-formance. Therefore, with ResNet, it is possible to build a deeper convolutionneural network and archive the higher accuracy. Researchers successfully build aResNet with 152 layers. This ultra-deep model won all the awards in ILSVRC’15.
Target Quality.
The target quality is
T op Accuracy = 0 . Table 2:
The Datasets Summary of HPC AI500 Benchmarks
Dataset Channels Resolution Size
The Extreme WeatherDataset [4] 16 768*1052 558 GBImageNet Dataset [2] 3 256*256 137 GB
The reference implementation of HPC AI500 benchmark is summarized as shownin Table 3. At present, we provide the implementations using TensorFlow [35],which is a popular deep learning framework in the HPC community. For commu-nication, we adopt Horovod [36] instead of the default GRPC protocol in Ten-sorFlow, which is not extendable for large-scale cluster [37] due to the limitationof the master-slave architecture and socket-based communication. Horovod is alibrary originally designed for scalable distributed deep learning using Tensor-Flow. It implements all reduce operations using ring-based algorithms [38] andother high efficient communication algorithms, widely used in the traditionalHPC community.
We propose Valid FLOPS (in short, VFLOPS) to quantify thevalid performance that considers both the system throughput and model quality.The goal of this metric is to impose a penalty on failing to achieve a targetquality. VFLOPS is calculated according to the formulas as follows.
V F LOP S = F LOP S ∗ penalty coef f icient (1)The penalty coefficient is used to penalize or award the FLOPS if the achievedquality is lower or greater than the target quality. Its definition is described as Table 3:
HPC AI500 Benchmark Suite.
ProblemDomains Models Datasets Target Quality AI Frame-works CommLib AI AccLib Epochs
EWA FasterRCNN EWA mAP@[IoU=0.5]=0.35 TensorFlow Horovod CUDA,cuDNN,NCCL 50ImageClassifica-tion ResNet50v1.5 ImageNet TOP 1Accuracy=0.763 TensorFlow Horovod CUDA,cuDNN,NCCL 90 Comm Lib refers to the communication libraries. AI acc lib refers to AI accelerators libraries. follows: penalty coef f icient = ( achieved quality/target quality ) n (2)Here, achieved quality represents the actual model quality achieved in theevaluation. target quality is the state-of-the-art model quality that we predefinein our benchmarks 3. The value of n is a positive integer, which we use to definethe model quality’s sensitivity. The higher the number of n, the more loss ofquality drop. EWA has a much more stringent quality requirement than ImageClassification; we set n as 10 for EWA and 5 for Image Classification by default.Previous work [7,8] shows most AI tasks are stochastic in terms of the trainingepochs to achieve a specified target quality. However, for training on a given sys-tem, the FLOPS is fixed. According to Equation 1 and Equation 2, we know thatfor an AI training workload with fixed epochs, VFLOPS is only related to theachieved quality. So the variance of the achieved quality decides the repeatabilityof VFLOPS. We have conducted a thorough analysis of the run-to-run variationof AIBench in Sec 3.2 and further select the most repeatable workloads to assurethe repeatability of VFLOPS. This section presents a case study using HPC AI500 benchmark. We perform aseries of scaling experiments on a HPC AI system using HPC AI500 benchmarksuite to show the scalability of the reference implementation (Sec. 5.2). We pro-vide an analyze to illustrate why EWA and image classification have differentparallel efficiency (Sec. 5.3). Finally, we publish a VFLOPS ranking list usingImage Classification to show the simpleness of this metric (Sec. 5.4).
The experiments are conducted on an HPC AI system, consisting of eight nodes,each of which is equipped with one Intel(R) Xeon(R) Platinum 8268 CPUand eight NVIDIA Tesla V100 GPUs. Each GPU in the same node has 32GB
PC AI500: Representative, Repeatable and Simple HPC AI Benchmarking 11
HBM memory, connected by NVIDIA NVLink–a high-speed GPU interconnec-tion whose theoretical peak bi-directional bandwidth is 300GB/s. The nodes areconnected with an Ethernet networking with a bandwidth of 10 Gb/s. Each nodehas a 1.5 TB system memory and an 8 TB NVMe SSD disk.The details of the architecture of each NVIDIA Tesla V100 GPU–NVIDIAVolta architecture are as follows. The NVIDIA Volta architecture is equippedwith 640 Tensor Cores to accelerate GEMM and convolution operations. EachTensor Core performs 64 floating-point fused-multiply-add (FMA) operationsper clock, delivering up to 125 TFLOPs of theoretical peak performance. Whenperforming mixed precision training with a Tensor Core, we use FP16 for calcu-lation and FP32 for accumulation.We use TensorFlow v1.14, compiled with CUDA v10.1 and cuDnn v7.6.2backend. We use Horovod v0.16.4 for synchronous distributed training, com-piled with OpenMPI v3.1.4 and NCCL v2.4.8. NCCL is short for the NVIDIACollective Communications Library. We use NVProf [19] to measure the FLOPs. (a)
IC (FP32) (b)
IC (Mixed) (c)
EWA (FP32) (d)
IC (FP32+Compression) (e)
IC (Mixed+Compression) (f)
EWA (FP32+Compression)
Fig. 4:
The scaling experiments of Extreme weather analysis (EWA) and Image Classification (IC).
Both EWA and Image Classification experiments are scaled out from 8 GPUsto 64 GPUs. In addition to the original FP32 version, we also evaluate per-formance of the mixed-precision and the model compression version, which aretwo frequently-used optimizations in HPC AI applications [3, 28]. We take the8-GPU experiments (single node) as a baseline. Our communication topologyis the double binary tree, which is implemented by NCCL 2.4. We report theperformance numbers of these experiments and perform further analysis. Thescaling results are shown in Fig. 4.
Image Classification
For the FP32 precision implementation of Image Clas-sification, the parallel efficiency is 0.91, 0.85, and 0.71 on 16, 32, and 64 GPUs,respectively. For the mixed implementation, the parallel efficiency is slightly lower: 0.89, 0.82, and 0.67, respectively. There is a significant loss of parallelefficiency when the system scale is 64 GPUs. The reason is that when the sys-tem scales up to 64 GPUs (8 nodes), more data need to be transmitted over thelow-speed Ethernet and result in the parallel efficiency reduction.We also notice that communication compression does not improve the per-formance when the system scale is 32 GPUs or less. This is because that commu-nication is not the bottleneck under these situations. However, when the scale is64 GPUs, it contributes a lot. For the FP32 version, the performance improvesfrom 345 to 414 TFLOPS. For the mixed version, the performance improvesfrom 718 to 939 TFLOPS.The highest performance of Image Classification that we achieve is 939 TFLOPSthrough both mixed precision optimization and communication compression, asshown in Fig. 4e.
EWA
For the FP32 precision implementation of EWA, the parallel efficiencyis 0.50, 0.37, and 0.36 at the system scale of 16, 32, and 64 GPUs, respectively.As a communication-intensive workload, communication compression of EWAachieves good results. When communication compression is used, the perfor-mance gain persists when the scale increase from 8 to 16, 32, and 64 GPUs,and the speedup is 1.2, 1.4,1.6 and 1.5, respectively. The highest performance ofEWA achieved through communication compression is 109 TFLOPS.
We found their different parallel efficiencies for EWA and Image Classificationdue to distinct communication bandwidth consumption. As shown in Fig. 5, wemeasure the communication bandwidth consumption of the FP32 precision im-plementations of EWA and Image Classification. EWA consumes a much highercommunication bandwidth than that of Image Classification. It shows that EWAis a communication-intensive workload. In contrast, the Image Classification isa computation-intensive workload.
The metric of VFLOPS emphasizes both performance and quality. We havepublished the VFLOPS ranking of Image Classification in HPC AI500 rank-ing website: . Fujitsuwon first place by achieving 31.41 VPFLOPS. Meanwhile, another additionalmetric–time-to-quality is also reported. Generally, our metric is visual and straight-forward.
PC AI500: Representative, Repeatable and Simple HPC AI Benchmarking 13
Fig. 5:
The distinctive communication bandwidth consumption of the FP32 implementations of EWAand Image Classification. The bandwidth refers to the aggregated communication bandwidth in thesystems, including Ethernet communication between nodes and NVLink communication betweenGPUs within the node.
We review the recent efforts of HPC AI benchmarking in chronological order.HPC AI500 (V 1.0) (2018) [17] is the first HPC AI benchmarks based onthe real-world scientific dataset, covering high energy physics, cosmology, andextreme weather analytics. HPC AI500 directly extracts three benchmarkingscenarios from representative HPC AI applications without a systematic bench-marking methodology to produce a representative, repeatable and simple bench-mark suite.HPL-AI (2019) [16] is designed for 32-bit and even lower floating-point pre-cision AI computing. It uses LU decomposition of mixed precision as the corealgorithm. As a micro benchmark, HPL-AI enables repeatable evaluation andeasily be ported to different systems. However, it cannot provide model-levelevaluation using an entire training session and lacks relevance in the AI field.Just like we have discussed in Sec. 2, the benchmark results will be misleading.Deep500 (2019) [15] is a reproducible customized benchmarking infrastruc-ture for high-performance deep learning. It has four abstraction levels to providea full-stack evaluation and provides a simple reference implementation based onsmall data sets and models. Deep500 can customize workloads but lacks bench-mark specifications. It is more like a framework than a concrete benchmark.
This paper proposes a representative, repeatable, and simple HPC AI bench-marking methodology. We analyze the representativeness and repeatability ofAIBench. After further considering the HPC field’s additional requirements, webuild the HPC AI500 benchmark suite, containing the two most representative,repeatable workloads, namely extreme weather analysis (EWA) and image classi-fication. We propose Valid FLOPS to rank the performance of HPC AI systems.We evaluate an HPC AI system using the HPC AI500 benchmarks, showing the reference implementation’s scalability. Also, we publish a VFLOPS-basedranking list. The specification, source code, and HPC AI500 ranking numbersare publicly available from . A full technical report is available from [39].
References
1. A. Krizhevsky et al. , “Imagenet classification with deep convolutional neural net-works,”
Communications of the ACM (CACM) , vol. 60, no. 6, pp. 84–90, 2017.2. J. Deng et al. , “Imagenet: A large-scale hierarchical image database,” in , pp. 248–255, 2009.3. T. Kurth et al. , “Exascale deep learning for climate analytics,” in SC , pp. 649–660,2018.4. E. Racah et al. , “Extremeweather: A large-scale climate dataset for semi-superviseddetection, localization, and understanding of extreme weather events,” Advancesin Neural Information Processing Systems (NIPS) , pp. 3402–3413, 2017.5. W. Jia et al. , “Pushing the limit of molecular dynamics with ab initio accuracy to100 million atoms with machine learning,” in SC , 2020.6. J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative ap-proach . Elsevier, 2011.7. P. Mattson et al. , “Mlperf training benchmark,”
Proceedings of Machine Learningand Systems (SysML) , vol. 2, pp. 336–349, 2020.8. F. Tang et al. , “AIBench Training: Balanced Industry-Standard AI Training Bench-marking,” in , IEEE Computer Society, 2021.9. W. Gao et al. , “Aibench: towards scalable and comprehensive datacenter ai bench-marking,” in bench , 2018.10. .11. C. Bienia et al. , “The parsec benchmark suite: Characterization and architecturalimplications,” in
PACT , pp. 72–81, 2008.12. .13. J. Gray, “Database and transaction processing performance handbook.,” 1993.14. J. Dongarra et al. , “Top500 supercomputer sites,”
Supercomputer , vol. 13, 1997.15. T. Ben-Nun et al. , “A modular benchmarking infrastructure for high-Performanceand reproducible deep learning,” in
IPDPS , 2019.16. https://icl.bitbucket.io/hpl-ai/ .17. Z. Jiang et al. , “Hpc ai500: a benchmark suite for hpc ai systems,” in
InternationalSymposium on Benchmarking, Measuring and Optimization (Bench) , pp. 10–22,Springer, 2018.18. J. Gray, “The benchmark handbook for database and transasction systems,”
Mer-gan Kaufmann, San Mateo , 1993.19. https://docs.nvidia.com/cuda/profiler-users-guide/index.html .20. N. Rogovschi et al. , “t-distributed stochastic neighbor embedding spectral cluster-ing,” in
International Joint Conference on Neural Networks , pp. 1628–1632, IEEE,2017.21. B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578 , 2016.22. C. Drummond, “Replicability is not reproducibility: nor is it good science,” 2009.PC AI500: Representative, Repeatable and Simple HPC AI Benchmarking 1523. C. Luo et al. , “Comparison and benchmarking of ai models and frameworks onmobile devices,” arXiv preprint arXiv:2005.05085 , 2020.24. W. Gao et al. , “Aibench: Scenario-distilling ai benchmarking,” arXiv preprintarXiv:2005.03459 , 2020.25. T. Kurth et al. , “Deep learning at 15pf: supervised and semi-supervised classifica-tion for scientific data,” in SC , pp. 1–11, 2017.26. A. Mathuriya et al. , “Cosmoflow: Using deep learning to learn the universe atscale,” in SC , pp. 819–829, IEEE, 2018.27. H. Mikami et al. , “Imagenet/resnet-50 training in 224 seconds,” arXiv preprintarXiv:1811.05233 , 2018.28. X. Jia et al. , “Highly scalable deep learning training system with mixed-precision:Training imagenet in four minutes,” arXiv preprint arXiv:1807.11205 , 2018.29. C. Ying et al. , “Image classification at supercomputer scale,” arXiv preprintarXiv:1811.06992 , 2018.30. T. Akiba et al. , “Extremely large minibatch sgd: Training resnet-50 on imagenetin 15 minutes,” arXiv preprint arXiv:1711.04325 , 2017.31. S. Ren et al. , “Faster r-cnn: Towards real-time object detection with region proposalnetworks,” IEEE TPAMI , no. 6, pp. 1137–1149, 2016.32. R. Girshick, “Fast r-cnn,” in
Proceedings of the IEEE international conference oncomputer vision , pp. 1440–1448, 2015.33. R. Girshick et al. , “Rich feature hierarchies for accurate object detection and se-mantic segmentation,” in
CVPR , pp. 580–587, 2014.34. K. He et al. , “Deep residual learning for image recognition,” in
Proceedings of theIEEE conference on computer vision and pattern recognition , pp. 770–778, 2016.35. M. Abadi et al. , “Tensorflow: A system for large-scale machine learning,” in , pp. 265–283, 2016.36. A. Sergeev and M. Del Balso, “Horovod: fast and easy distributed deep learningin tensorflow,” arXiv preprint arXiv:1802.05799 , 2018.37. A. Mathuriya, , et al. , “Scaling grpc tensorflow on 512 nodes of cori supercom-puter,” arXiv preprint arXiv:1712.09388 , 2017.38. http://research.baidu.com/bringing-hpc-techniques-deep-learning .39. Z. Jiang et al. , “Hpc ai500: The methodology, tools, roofline performance models,and metrics for benchmarking hpc ai systems,” arXiv preprint arXiv:2007.00279arXiv preprint arXiv:2007.00279