Making Convolutions Resilient via Algorithm-Based Error Detection Techniques
Siva Kumar Sastry Hari, Michael B. Sullivan, Timothy Tsai, Stephen W. Keckler
MMaking Convolutions Resilientvia Algorithm-Based Error Detection Techniques
Siva Kumar Sastry Hari, Michael B. Sullivan, Timothy Tsai, and Stephen W. KecklerNVIDIA Corporation
Abstract —The ability of Convolutional Neural Networks (CNNs) to accurately process real-time telemetry has boostedtheir use in safety-critical and high-performance computing systems. As such systems require high levels of resilience to errors, CNNs mustexecute correctly in the presence of hardware faults. Full duplication provides the needed assurance but incurs a prohibitive 100% overhead.Algorithmic techniques are known to offer low-cost solutions, but the practical feasibility and performance of such techniques have neverbeen studied for CNN deployment platforms (e.g., TensorFlow or TensorRT on GPUs). In this paper, we focus on algorithmically verifyingConvolutions, which are the most resource-demanding operations in CNNs. We use checksums to verify convolutions, adding a smallamount of redundancy, far less than full-duplication. We first identify the challenges that arise in employing Algorithm-Based Error Detection(ABED) for Convolutions in optimized inference platforms that fuse multiple network layers and use reduced-precision operations, anddemonstrate how to overcome them. We propose and evaluate variations of ABED techniques that offer implementation complexity, runtimeoverhead, and coverage trade-offs. Results show that ABED can detect all transient hardware errors that might otherwise corrupt outputand does so while incurring low runtime overheads (6-23%), offering at least 1.6 × throughput to workloads compared to full duplication. (cid:70) NTRODUCTION
Following recent improvement in the ability ofConvolutional Neural Networks (CNNs) to performcomplex tasks with high efficiency and accuracy, CNNs havemade their way into safety-critical and High PerformanceComputing (HPC) systems. For example, autonomous vehicles(AVs) employ CNNs to perform complex tasks such as vehicle,cyclist, pedestrian, lane, road-sign, and free-space detection [4],[38]. HPC systems also employ CNNs for object classificationand detection, image segmentation, and video analytics forapplication domains such as healthcare, climate analysis, andsurveillance, and often in real-time settings [22], [33]. As thecompute throughput and power efficiency demands of theCNN-based safety-critical and HPC systems are high, efficientplatforms are being designed to meet the throughput demandswithin limited power budgets [26], [45]. For example, therecently released NVIDIA DRIVE AGX Xavier System-on-Chipand T4 GPU deliver up to 32 and 130 TOPS while consumingjust 30 and 70 watts of power, respectively [33], [38].Safety-critical and HPC systems must be designed to toleratehardware errors such as those originating from hardwaretransient, intermittent, and permanent faults. Some marketsegments require systems to meet strict safety standards,such as the ISO 26262 functional safety standard for AVs [18].This standard requires the system to be robust to single-pointtransient, intermittent, and permanent faults either by designor by coverage from safety procedures (such as ECC andparity). The level of robustness a hardware component desiresto obtain is determined by the Automotive Safety IntegrityLevel (ASIL). For ASIL D (the highest safety level), the systemis required to be robust to ≥
99% of faults; the requirements forASIL C and B are 97% and 90%, respectively [30]. The rate ofresidual failures, measured in Failure In Time or FITs (where1 FIT refers to 1 failure per billion hours), must also be ≤ ≤
10 FIT for ASIL B/C and D, respectively. While theHPC market also demands high resilience, the requirementsare not as rigorous as those of ISO 26262 [44]. With the increasing prevalence of CNNs in safety-criticaland HPC systems and the resilience requirements of suchsystems, correctness of CNNs must be assured in the presenceof hardware faults for safety and standards-compliance. Priorstudies have analyzed the effects of hardware errors onCNN outputs and observed noticeable corruptions that mustbe mitigated to ensure safe operation [23], [40]. Processorsdeployed in such systems employ ECC and/or parity in majorSRAM structures. This protection is typically not sufficient tomeet the requirements of ASIL B, C, or D for all the hardwareerror sources. Aggressive employment of ECC/parity on flip-flops and small SRAM buffers comes at an area cost and maystill be insufficient for all error sources due to the error ratecontribution from non-storage elements. For intermittent andpermanent faults, non-storage elements contribute significantlytowards the total error rates in GPUs and DNN-acceleratorsthat dedicate significant chip area to logic [9]. Full hardwareredundancy can provide the needed safety [2], [19], [46], butit reduces the throughput by 2 × or more, which is prohibitivefor resource-constrained systems. The goal of this paper is todevelop a low-cost CNN-specific resilience solution that allowsthe full system to meet the target markets’ requirements, whileincurring far lower overheads compared to full duplication.Over 90% of the computation during CNN inference andtraining is in convolutions [24]. Algorithmic methods havebeen devised to speed up the convolution operation. Thesemethods include using General Matrix Multiplication (GEMM),Fast Fourier Transformation (FFT), and Winograd [45]. Faulttolerance approaches that leverage algorithm knowledge,known as algorithm-based fault tolerance (ABFT), have beenshown to provide lower overheads for GEMM and FFT thanfull duplication algorithms [15], [17]. These techniques leveragethe fact that these operations are linear and verify the cor-rectness using a checksum-based approach. These techniquescompute checksums for input data, store them with the originaldata, perform the original and redundant computation, verify a r X i v : . [ c s . D C ] J un outputs, and possibly correct errors. The number of extracompute operations they introduce is a small fraction of theoriginal computation. While prior ABFT implementationshave achieved runtime overheads of about 20% for squarematrices [13], our analysis shows that the overheads canbe much higher ( > × throughputimprovements to workloads compared to full duplication. ACKGROUND
Full Hardware/Software Redundancy:
Traditional business-class systems (e.g., IBM Z Series machines [5]) employ expen-sive hardware-managed dual- or triple-modular redundancyschemes. In safety-critical systems, similar techniques arebeing employed to meet the highest-levels of safety integrityrequirements [2], [19], [46]. Software techniques have beenexplored that introduce redundancy at different granularitiesincluding the process, GPU kernel, thread, and assemblyinstruction level [12], [27], [47], but they all incur overheadsthat are high for resource-limited real-time systems.
Algorithm-Based Fault Tolerance (ABFT):
ABFT techniquesleverage algorithmic knowledge to detect and correct errorswith very low overheads and they have been heavily studiedin literature for GEMM and FFT [13], [15], [17], [25]. ForGEMM, these approaches introduce row checksums for oneof the matrices and column checksums for the other. Whenthe matrix multiplication operation is performed with thechecksums, an output matrix is produced with an extra rowand column. Each of these extra values are expected to matchthe checksums of the rows and columns of the output matrix.In an event of an error, these techniques can localize the errorand use redundant information to correct certain types oferrors. Several algorithmic techniques have been proposedto increase the capability of correcting output matrix cellcorruptions with little emphasis on studying how low-levelhardware faults manifest at that level [7], [48]. Hence it isunclear whether the benefits of the additional correctioncapability outweigh the costs of higher overhead.A recent study showed that protecting GEMMs in CNNsvia ABFT can provide high protection [10]. While no runtimeoverhead analysis was included, the paper acknowledged thatGEMM kernels are tuned to fully use caches and registers, andadding an extra dimension for checksum storage would notonly compromise execution time, but also significantly increasedata movement and memory latency. Our experiments confirmthis hypothesis and show that the ABFT’s overheads for non-square matrices commonly used in CNNs is high ( > H W Q
CInput feature maps (I) Output feature maps (O)
PK1 C S R
Filters (F) . . KCS R
N N
W Q C PK.. . .. .H Convolution: 𝑂 𝑛,𝑘,𝑝,𝑞 = σ 𝑠=1𝑆 σ 𝑟=1𝑅 σ 𝑐=1𝐶 𝐹 𝑘,𝑐,𝑠,𝑟 × 𝐼 𝑛,𝑐,𝑝+𝑟,𝑞+𝑠 ∀𝑛 ∈ 𝑁, 𝑘 ∈ 𝐾, 𝑝 ∈ 𝑃, 𝑞 ∈ 𝑄 Fig. 1. A typical convolution operation used by most CNNs. effectiveness of the ABFT’s correction capability is notestablished for real hardware errors because they may notmanifest as correctable (single-cell output) corruptions.Based on a similar observations, researchers proposedusing a checksum-based ABED technique [28] to detect errorsduring a convolution. Since convolution is a linear operationlike GEMM, a checksum-based error detection techniquecan be extended to it as well. This work used checksums forboth the filters and input feature maps to verify the output.The goal of the paper was to overclock a CNN acceleratorand detect incorrectly computed convolutions. It extendedthe hardware accelerator to include the detection technique.Several challenges arise while applying ABED to convolutionsin optimized CNN inference platforms (e.g., TensorRT,TensorFlow, PyTorch) commonly used in safety-critical andHPC systems. (1) The use of reduced precision operationsis common during inference and without proper care, thechecksum arithmetic can overflow. (2) A convolution operationis often fused with a non-linear activation layer, to reduce datamovement and improve performance [33]. Checksum-basedtechniques apply only to the linear operations. Separating thelinear and non-linear computations into different operationscan incur high overheads due to additional data movement,introducing additional implementation challenge. (3) Withoutcustomizing ABED to the CNN inference frameworks, onlinechecksum storage management and generation for the outputand both the inputs on every convolution introduces avoidableruntime overheads (explained further in Sections 5.3 and 6.3).We not only address these challenges in this paper, but alsoidentified and explored two other ABED variants (not previ-ously studied) that use checksums either for filters or inputfeature maps (explained in Section 3). These variants offer inter-esting error coverage and performance trade-offs, important inselecting an optimal solution for a target safety-critical system.
A convolution operation takes two input tensors and producesone output tensor. One of the input tensors is for the inputfeature maps (or fmaps); these fmaps are either the output ofthe previous layer or the input to the network. Input fmaps arerepresented as a 4-D tensor in most CNNs. Each feature mapis a 2-D tensor with height (H) and width (W). Typically, manyfeature maps are stacked to form a 3-D tensor. The number ofchannels (C) defines the number of feature maps in the stack.Many 3-D input fmaps are batched (batch size = N) together to form the 4-D feature map tensor. The other input tensor isthe set of filters, which consists of weights that are computedduring the training process. Each filter is a 3-D tensor ofweights with dimensions height (S), width (R), and channels(C). Each convolution layer has multiple filters (number offilters = K), adding an extra dimension to produce a 4-D tensor.Each output feature map value is produced by performinga dot-product between a filter and a same-sized portion of theinput fmap’s tensor. An example is shown in the highlightedcells in Figure 1, along with the formula to compute each ofthe output fmap values. As one filter produces one outputfeature map, the number of channels (feature maps) in theoutput is the same as the number of filters (K). The numberof output fmaps is the same as the batch size (N).
ONVOLUTION
ABED A
PPROACH
Verifying every output value of a convolution might requireduplicating the entire operation. Instead, the focus of thiswork is on verifying just the reduced output, i.e., sum of all theoutput elements. This reduced output can be computed fromthe inputs directly with far fewer computations. We essentiallyuse a different sequence of sums and products. Since integersum and product operations are commutative and associative,changing the order of the operations is not a concern. Basedon this key insight, we explore the following three schemesto verify a convolution, which are summarized in Figure 2.
In this scheme, a 3-D filter checksum tensor is computed byperforming an element-wise sum (using sum as a checksumfunction) across all the 3-D filter tensors ( in Figure 2(a)).This new checksum filter is convolved with the input fmapsto compute an extra output fmap, which is used to verify theoriginal fmaps’ values. The original output fmaps’ values arereduced across the channel dimension to generate a reducedfmap, which is compared element-wise for equality withthe extra output fmap for verification ( in Figure 2(a)).This method protects the computation involved in theconvolution, the data storage and transportation of filters(between DRAM/L2 and registers) and output fmaps, but notthe storage and transportation of input fmaps. This schemeincreases the number of operations in the convolution by afactor of 1/K and introduces PQNK operations for outputverification. Here PQNK refers to P × Q × N × K. We omit themultiplication symbols while referring to products of theparameters of the convolution for brevity.
This scheme uses checksums of input fmaps, which can becomputed in one of two ways: (1) summing input fmaps’values element-wise across batches to add a new checksumbatch, and (2) summing elements of the portions of the inputfmaps that are used to perform the dot-product with the filtersduring the convolution operation to produce a tensor thatis the same size as the filter.The first option effectively increases the batch size by one.The original output fmaps’ values are reduced across allthe original batches to generate a batch of checksum outputfmaps, which is compared for element-wise equality with the ❶ Filter checksum generation: 𝐹 𝑐,𝑠,𝑟 = σ 𝑘=1𝐾 𝐹 𝑘,𝑐,𝑟,𝑠 ∀ 𝑐 ∈ 𝐶, s ∈ 𝑆, 𝑟 ∈ 𝑅 ❸ Output verification: 𝑂 𝑛,𝐾+1,𝑝,𝑞 == σ 𝑘=1𝐾 𝑂 𝑛,𝑘,𝑝,𝑞 ∀, 𝑛 ∈ 𝑁, 𝑝 ∈ 𝑃, q ∈ 𝑄
11 Q
Output fmaps (O)
PK+1 NN QP K+1 .. .11H W C Input fmaps (I)
NN W C .. . H 11 Q P NN QP .. . Element-wise compare
Filters (F) . . KKCS R11C S R ++ ❶ ❷ ❸
11 11CS R H W Q
CFilters (F) Input fmaps (I) Output fmaps (O) P K .. .KKCS R NN NNW Q C PK.. . .. .H
Input fmapchecksum (Ic)
Reduced
Output (Oc) +
11 11C S R …… + ❶ Input fmap checksums: 𝐼𝑐 𝑐,𝑠,𝑟 = 𝑛=1𝑁 𝑝=1𝑃 𝑞=1𝑄 𝐼 𝑛,𝑐,𝑝+𝑟,𝑞+𝑠 ∀𝑐 ∈ 𝐶, s ∈ 𝑆, 𝑟 ∈ 𝑅 ❷ , ❹ , ❺ Output verification: σ 𝑖=1𝐶×𝑆×𝑅 𝐹 𝑘𝑖 × 𝐼𝑐 𝑖 == σ 𝑗=1𝑁×𝑃×𝑄 𝑂 𝑗 ∀𝑘 ∈ 𝐾 ❸ ❹❷ ❺❶ (a) Filter Checksum-based Error Detection (FC) 11 11C S R H W Q
CFilters (F) Input fmaps (I) Output fmaps (O)
PK. . .KKCS R NN NN W Q C PK.. . .. .H
Filter Checksum (Fc) Input fmapchecksum (Ic) Reduced Output (Oc) +
11 11CS R …… + ❶ Filter checksums: 𝐹𝑐 𝑐,𝑠,𝑟 = σ 𝑘=1𝐾 𝐹 𝑘,𝑐,𝑠,𝑟 ∀ 𝑐 ∈ 𝐶, s ∈ 𝑆, 𝑟 ∈ 𝑅 ❷ Input fmap checksums: 𝐼𝑐 𝑐,𝑠,𝑟 = σ 𝑛=1𝑁 σ 𝑝=1𝑃 σ 𝑞=1𝑄 𝐼 𝑛,𝑐,𝑝+𝑟,𝑞+𝑠 ∀𝑐 ∈ 𝐶, s ∈ 𝑆, 𝑟 ∈ 𝑅 ❸ , ❺ , ❻ Output verification: σ 𝑖=1 𝐶×𝑆×𝑅 𝐹𝑐 𝑖 × 𝐼𝑐 𝑖 == σ 𝑗=1𝑁×𝐾×𝑃×𝑄 𝑂 𝑗 ❶ ❷ ❹ ❺❸ ❻ (b) Input Fmap Checksum-based Error Detection (IC) (c) Filter and Input Fmap Checksum-based Error Detection (FIC) Fig. 2. A depiction of the filter-only (FC), input fmap-only (IC), and filter and input fmap (FIC) checksum-based error detection schemes. Formulaeto generate the checksums and verify output fmap values are also shown. extra batch of output fmaps generated using the checksuminput fmaps for verification. This option is attractive if thebatch size is large because the effective overhead of runningthe larger convolution would be small. However, for smallbatch sizes, which is common in safety-critical systems [33],[42], this option can result in high overheads.The second option reduces the input fmaps into a separatechecksum tensor, which is the same size as a filter for the layer( in Figure 2(b)). This checksum tensor is then convolvedwith K filters to produce exactly K values. The output fmapsare reduced across height, width, and batch dimensions andthen compared with these K values element-wise for equalityfor verification ( and in Figure 2(b)). This methodprotects the computation involved in the convolution, the datastorage and transportation of input and output fmaps, but notthe storage and transportation of the filters. The number ofadditional computations needed for the convolution is CRSK,input fmap checksum generation is PQNCRS, and outputfmap checksum generation for verification is PQNK. The scheme, similar to the one proposed by prior work [28],creates checksums for both the filter and input fmaps ( and in Figure 2(c)). Using the two checksums, we performan extra convolution ( in Figure 2(c)). This operation canalso be implemented as a vector-vector dot-product becausethe filter checksum size is same as the input fmap checksumsize. This operation produces a single value, which is usedto verify the original computation. The original convolutionis run with the original parameters and the output is reducedto a single value, which is verified using the value generatedby the dot-product. ( , , and in Figure 2(c)). Thismethod protects the computation involved in the convolutionand the data storage as well as the transportation of filtersand the input and output fmaps. The number of additionalcomputations needed for dot-product is CRS, input fmap checksum generation is PQNCRS, and output fmap checksumgeneration for verification is PQNK. Table 1 summarizes the trade-offs offered by the FC, IC, andFIC techniques in terms of the number of tasks that must beperformed online and the protection they provide. The tableshows that FIC offers better protection than FC and IC byprotecting the storage and data movement of both the filtersand inputs. The filter checksums can be generated offlinebecause the weights are known before a CNN is deployed.However, the input and output checksum generation andverification tasks must be performed online. Since the onlinetasks needed for the FIC and IC techniques are similar, theruntime overheads also expected to be similar. Given that FICoffers superior protection compared to IC but the runtimesare expected to be similar, we do not investigate IC further.Since FC must run a larger convolution, the overhead can behigher than FIC. However, FC can be faster if the larger convo-lution adds minimal overheads, which is possible with the useof network pruning techniques [16], [29]. Network pruningimproves network performance by identifying and removingfilters that contribute minimally to the accuracy of the network.With the use of pruned networks, the number of filters per layermay diminish so that adding the checksum filter introducesminimal overheads (explored further in Section 6).
MPLEMENTATION
This section addresses challenges that arise while implement-ing ABED on a GPU-based system. Specifically, we explain(1) how to maintain high coverage by avoiding overflowwhile using reduced-precision operations and storage, the useof which is prevalent in inference platforms, (2) task-fusionbased optimizations/modifications we propose to the highly-optimized inference platforms to minimize the overheads intro-duced by ABED, and (3) modifications needed to the inferencedeployment frameworks for seamless integration with ABED.
TABLE 1Trade-offs between the FC, IC, and the FIC techniques. Entries markedYes/Offline and No/Online are favorable and unfavorable, respectively.
Criteria FC IC FIC A dd i t i o n a l W o r k Filter checksum generation Offline - OfflineInput fmap checksumgeneration - Online OnlineAvoid running a largerconvolution No Yes Yes P r o t e c t s Computation Yes Yes YesStorage andtransportation Filters Yes No YesInputfmaps No Yes Yes
The use of reduced-precision fixed-point data types hasbeen explored both in research as well as many commercialproducts. For example, 8-bit integer arithmetic is supported inGoogle’s Tensor Processing Unit, NVIDIA’s Volta and TuringGPUs, Intel Xeon Scalable Processors, and ARM CPUs [3], [21],[32], [41]. Fixed-point arithmetic suffers from overflow if theresult does not have sufficient bits to represent the full value.Here we describe a method to ensure full error coverage whileusing reduced-precision fixed-point data types.Convolutions that use 8-bit integers (int8) store the filtersand input fmaps using int8 data types. Each output fmap valueis obtained by performing a dot-product using one filter (of sizeCRS) with a same sized portion of the input fmap, illustratedby the highlighted portion in Figure 1. In this operation, CRS16-bit values, each of which is a product of two int8 values, aresummed together. Making no assumption about the values, theresult can be accurately represented using [16+ log ( CRS )] -bit integers. For most practical values of C, R, and S ( CRS ≤ K ), int32 is sufficient to avoid overflow during convolutions.We use two’s complement integer summation as thechecksum function. To avoid overflow during checksumgeneration, we use int32 operations. For the FC technique,we store the int32 checksums as a tuple consisting of up tofour int8 values, creating up to four checksum filters. Noinformation is lost with this scheme. The extra output fmapvalues are shifted left by , , , and , respectively, andadded together across the channel dimension. These values arethen compared with the output fmap checksums, which areobtained by summing the original output fmap values acrossthe channel dimension (K additions). The reduced result can beaccurately represented using [16+ log ( CRSK )] -bit integers.For most practical values of C, R, S, and K, 64-bits are sufficient.For the FIC technique, the filter checksum is obtained bysumming K int8 filters and can be accurately represented by [8+ log ( K )] -bit integers. Each input fmap checksum value iscomputed by summing PQN int8 input fmap values, requiringup to [8+ log ( P QN )] bits. The result of the convolution ofthe checksums would require up to [16+ log ( P QN KCRS )] bits. For most practical purposes, int32 and int64 are sufficientto store the checksums and convolution results, respectively.Table 2 summarizes the maximum number of bits neededto accurately represent the values at different points duringthe convolution operation for the FC and FIC techniquesbased on worst-case overflow analysis. We assume unsignedintegers for this analysis. The requirements can be slightlyless for signed integers. For example, the result of multiplying TABLE 2The bit requirements to accurately represent the results while verifying intb(e.g., int8 for b=8) convolutions.
FIC FCInput fmaps b bInput fmap checksum b+ log (K) -Filters b bFilter checksum b+ log (PQN) bOutput fmap 2 × b+ log (CRS) 2 × b+ log (CRS)Reduced output fmap 2 × b+ log (PQNKCRS) 2 × b+ log (CRSK)Dot-product output 2 × b+ log (PQNKCRS) - two signed 8-bit integers (with 1 sign bit) can be accuratelyrepresented using 15-bit signed integers (with 1 sign bit).Since all parameters are known prior to a neural networkdeployment, the ABED precision requirements can alsobe determined. For the networks used in this paper, int64checksums are sufficient to avoid overflow. If the precisionrequirements increase beyond int64, modular arithmetic canbe used to limit the highest precision operation used by theverification kernel (to reduce runtime overhead) with someloss in coverage, which we do not explore in this paper.Floating-point arithmetic suffers from both overflowand rounding. While we explored ways to address theseissues, the discussion to maintain high error coverageusing floating-point data types is deferred to Section 7, ascommercial implementations increasingly use fixed-pointarithmetic due to its performance and energy advantages. Once a neural network is trained, it is optimized and pre-pared for deployment using a platform for high-performanceinference (e.g., TensorRT). The optimizations involve pruning,weight and activation precision calibration (also known asquantization), layer and tensor fusion, and kernel auto-tuning.These optimizations are typically performed once to create aninference engine, which is then serialized to avoid preparationoverheads. We perform the following during the optimizationstep. (1) We create checksum filters and store them along withthe filter tensor and in separate storage for the FC and FICtechniques, respectively. (2) We introduce all the additionalonline tasks that should be executed during inference as partof the ABED scheme (e.g., input and output checksum genera-tion). Figure 3 summarizes the proposed changes for seamlessintegration of ABED in a tool-chain used to deploy trainedmodels for inference. ABED is independent from all optimiza-tion described above, except the layer and kernel fusion. Wenext describe how ABED can be applied to highly-optimizedconvolutions that are fused with subsequent layers in CNNs.
As described in Section 2, common convolution operationstake two 4-D tensors as inputs, one each for input fmaps andfilters (I and F) and produce a 4-D tensor of output fmaps(O). Convolution, bias, and activation operations are typicallyfused together for performance. Such fused operations perform O = activation ( conv ( x )+ bias ) . For int8 convolutions, I andF use int8, and O uses either int8 or fp32. Figure 4 explainsthe logical flow of computation within such fused kernels.For int8 convolutions, the output of the convolution operation Inference from Network Definition Inference from Serialized Engine (Avoids rebuilding and reoptimizing the engine)
Deserialize Engine (with ABED)
Perform Inference (with ABED)Serialized Engine (with ABED)
Import ModelBuild and Optimize EnginePerform Inference (with ABED)
Serialize Engine (with ABED)
Pre-trained Neural Network Example optimizations:PruningQuantization
Layer and tensor fusion
Kernel auto-tuning
ABED
Fig. 3. Inference deployment steps and where ABED can be included forseamless integration. x scaling factor(generated during calibration) = cast/ clamp/truncate Fused convolution, bias, and activation kernelFilters (Int8)
Inputs (Int8)
Output (Int8)
Int8
Input Int8
FiltersInt32
ConvOut
Epilog
Convolution +bias
Activation fp32 Scaled-Out fp32
ActOut int8
Output
Fig. 4. The logical computation flow in a fused convolution, bias, andactivation kernel. is an int32 result (ConvOut in the figure). This intermediateresult is then scaled using a scaling factor that is generatedduring the calibration step, which produces an fp32 result(ScaledOut in the figure). This step assumes that the scaledint32 result can be accurately represented using an fp32 datatype. Bias is added to ScaledOut. The activation function isthen applied to it to produce ActOut (another fp32 value). Ifthis value (ActOut) is too large to be accurately representedusing int8 data type, it will be clamped and truncated toproduce an int8 output value. We refer to all the operationsafter the convolution operation as epilog .The ABED techniques verify just the result of theconvolution. Hence, the intermediate output (ConvOut inFigure 4) must be verified before the epilog is applied, whichcan be performed by either using un-fused kernels or fusingsome part of the output fmap checksum generation task withthe fused convolution + epilog kernel. Figure 5 lists some ofthe options for the FIC technique. For each option (one rowin the table), we list the tasks that must be performed (incolumns) and show the unfused/fused kernels that the optionwill execute. For each of the kernels, we show the data typesand sizes of its inputs and outputs.
FIC Technique:
The following seven tasks (one more thanwhat is listed in Figure 2(c)) must be performed for theFIC technique to detect errors in the fused convolution +epilog kernel: (1) filter checksum generation, (2) input fmapchecksum generation (ICG), (3) dot-product of the filter andinput fmap checksums, (4) convolution operation, (5) outputfmap checksum generation (OCG), (6) epilog, and (7) verifyingthe output fmap checksum with the output of step (3). The firsttask is performed offline, as described above. All other tasksare performed online. Since task (7) involves comparing justtwo values for bit-wise equality, it can be performed on thehost CPU. There are several ways to perform the remainingonline tasks on a GPU. A simple option is to use different GPUkernels to perform each of the tasks, shown as option Unfusedin Figure 5. The input checksum generation kernel reads theinput fmaps in int8 data type of size NHCW and generates an int32 checksum vector of size CRS. The convolution kernelreads in the int8 filters and input fmaps of sizes KCRS andNCHW, respectively, and writes out an int32 output of sizeNKPQ (ConvOut in Figure 4). The next kernel reads ConvOutand applies epilog to produce an int8 output of the same sizeas the input. The output fmap checksum generation kernelreads the convolution output again to generate a single int64checksum value. Lastly, the dot-product kernel reads in filterand input fmap checksums, and two int32 vectors of sizes CRS.It produces a single int64 value, which is compared with theoutput fmap checksum for equality. This implementation doesnot protect the epilog output and introduces several additionaldata transfers including the convolution output stored in int32data type (4 × the size of an equivalent int8 structure).Task fusion can significantly reduce data movement. Whilewe explored multiple options to (partially/fully) fuse tasks, wedescribe only two options here for brevity. (1) The convolution,epilog, and output fmap checksum generation kernels can befused into a single kernel to limit data movement introducedby ABED (FusedOCG in Figure 5). It generates a int64 valuefor output fmap checksum along with the output of the fusedconvolution + epilog tasks, i.e., int8 output fmap tensor ofsize NKPQ. (2) To further reduce data movement and providecoverage for epilog’s output, the output checksum generationtask can be modified to produce the input fmap checksum forthe subsequent layer, if the next layer is a convolution layer(FusedIOCG in Figure 5). It essentially fuses input checksumgeneration task with the fused convolution + epilog + outputchecksum generation kernel. This optimization duplicates theepilog, but we assume that data movement saving will improveruntime more than the overhead introduced by duplicatingcompute-bound epilog. The FusedOCG and FusedIOCG op-tions require changes to the existing fused convolution + epilogkernel offered by frameworks such as cuDNN and TensorRT. FC Technique:
The following are the tasks needed for the FCtechnique (one more than what is listed in Figure 2(a)): (1) filterchecksum generation, (2) convolution operation, (3) epilog,and (4) output fmap reduction and verification. The first taskis done offline. The verification task involves computing check-sums from the original and extra output feature maps acrossthe channel dimension separately and comparing them for bit-wise equality. This task must be performed before the epilogand can be implemented in multiple ways. The first option isto not fuse the operations and launch separate kernels for theconvolution, epilog, and output fmap checksum generationand verification. This option is similar to Unfused for FIC. Fus-ing the output fmap checksum generation and verification taskwith the already fused convolution + epilog kernel will reducedata movement. This option is similar to FusedOCG for FIC.Since this technique adds checksum filters, the runtimeof the larger convolution can be higher. The increase can besuper-linear in the number of filters for certain architecturesand convolution parameters. The convolution implementationis often tiled. The runtime may not increase if a tile boundaryis not crossed and can increase significantly if it is crossed.Coordination with the pruning techniques may help inreducing this overhead. When the network is being preparedfor deployment, filter-pruning is commonly employed tooptimize the network’s performance and storage [16], [29].Reducing the number of filters at this step such that thechecksum filters can be included without introducing a new
Options Tasks I nput Fmap C hecksum G eneration (ICG) Convolution Epilog O utput Fmap C hecksum G eneration (OCG) Dot-product of the Input and
Filter Checksums
Baseline: No ABED
Unfused: No task/kernel fusionFusedOCG: Fused convolution + epilog kernel generates output checksum
FusedIOCG: Fused convolution + epilog kernel generates input and output fmap checksums
Input: int8, KCRS; int8, NCHWOutput: int8, NKPQ Input: int32, NKPQ
Output: int64, 1
Input: int8, NCHWOutput: int32, CRS Input: int32, CRS; int32, CRSOutput: int64, 1
Input: int8, KCRS; int8, NCHW
Output: int32, NKPQ Input: int32, NKPQOutput: int8, NKPQInput: int8, NCHWOutput: int32, CRS Input: int8, KCRS; int8, NCHWOutput: int8, NKPQ; int64, 1 Input: int32, CRS; int32, CRSOutput: int64, 1
Input: int8, KCRS; int8, NCHW
Output: int8, NKPQ; int32, CRS; int64, 1 Input: int32, CRS; int32, CRS
Output: int64, 1
Fig. 5. Implementation options for the FIC technique are shown. Each colored box represents a GPU kernel and shows the data type and size ofthe inputs and outputs. Inputs/outputs for which the data transportation is not protected are shown in red. tile of work can reduce the overhead significantly.The output of the fused operation must be trimmed suchthat the extra feature maps are ignored by the subsequentlayer. Trimming can be fused with the output verification task.We do not explicitly study the effects of trimming becauseimplementing it simply requires skipping some writes.
VALUATION M ETHODOLOGY
We evaluate the overheads introduced by the ABED techniquesto the convolutional layers from different CNNs. We useVGG16, ResNet18, and ResNet50 for analysis [14], [43]. Weevaluate using two different image sizes—224x224, the sizeof the images in the ImageNet dataset [11], and 1080x1920,the resolution of the images in a full-HD or 1080p video.
We first analytically evaluate the increase in the computeand data movement operations when we apply the FC andFIC ABED techniques to the networks. In this analysis, weabstract away implementation details and only considerthe arithmetic operations such as multiplication, addition,fused multiply-addition (FMA), activation, and type-casting.Similarly, we also count the bytes of data that form the inputsand outputs of different implementations for the FC and FICtechniques, as listed in Section 4.3.
We experimentally evaluate the runtime of convolutions bycreating a cuDNN-based workload that sets up, initializes,and runs convolutions in a loop. We compile this workloadusing CUDA 10 and use cuDNN 7.3 [37] on both a Jetson AGXXavier system and an x86-based desktop with a V100-basedGPU (Titan V) [32], [34]. For performance analysis, we lockthe CPU, GPU, and memory clocks on the Jetson board andlock the application clocks on the V100 GPU. We run theconvolution (and other operations needed by the ABEDtechniques) 200 times, recording the average. Since real-timeapplications (including safety-critical systems) use smallbatch sizes, we use batch size of two on Jetson and eighton V100-based system [33], [42]. We ignore the first layer ineach network because it is not well optimized by cuDNN. Weuse NHWC memory layout to tensor storage, as int8 cuDNNconvolutions are optimized for this layout. For a baseline, we invoke the fused convolutionand epilog kernel provided by cuDNN (calledcudnnConvolutionBiasActivationForward). We implementversions that are similar to Unfused for the FIC and FCtechniques. As cuDNN does not offer a convolution kernelthat takes in two int8 tensors and produces an int32 tensoras output, we employ a version that produces fp32 output.The epilog, when performed separately, invokes two GPUkernels (one each for adding bias and applying activation) andgenerates a fp32 tensor as output (instead of an int8 tensor,as mentioned in Section 4.3, due to cuDNN API limitations).For analyzing the Unfused implementation options for ABED,we also collect results with just the unfused convoltuion andepilog kernels (without ABED). This version launches onekernel each to perform the convolution operation, add bias,and apply activation. We refer to it as the unfused baseline.The checksum generation kernels are written in CUDA.Since the checksum generation has similarities to the reductionoperation, we use previously-established optimizations.Specifically, we use the functions and primitives (such asDeviceReduce and WarpReduce) provided by the
CUB libraryoptimized implementations [35]. We try to minimize the use ofatomics and leverage faster memory (e.g., registers and sharedmemory) as much as possible. We ensure that global loadsare coalesced across warps and use wide loads per thread(e.g., LD.128). We avoid control flow and use the dot-productinstruction (DP4A) whenever possible to avoid a computebottleneck [36]. We specialize kernels for filter sizes 1 and3, strides 1 and 2, and for data types int8 and int32. For theFC technique, we add 8 filters (4 for checksums and 4 withzeros for int8) because the cuDNN version we use on ourtarget device chooses efficient kernels when the number offilters is a multiple of 8. We obtain the aggregate runtimes pernetwork by adding all of the GPU kernel runtimes and showthe runtime relative to one of the baselines mentioned above.
Effect of Task Fusion-based Optimization:
Since we couldnot modify the closed-source cuDNN kernels to fuse checksumgeneration and output verification tasks with the convolutionfor the FusedOCG versions, we model the runtimes. We writeseparate CUDA kernels to capture the overheads associatedwith performing the additional work that will be fused with theconvolution + epilog kernel. For FIC-FusedOCG, the new ker-nel fully reduces the output and writes out just one int64 value.Since using atomic operations for reduction can be a perfor-mance bottleneck, we hierarchically reduce the output similar to the prior optimized reduction task implementations [35].For FC-FusedOCG, the new kernel reduces the values acrossthe channel dimension for verification and sets a flag on anerror detection. We measure the runtimes of these kernels byrunning them on silicon and add the runtimes to Baseline-Fused as an estimate of Kernel 3’s runtime (from Figure 5).
As convolutions can be implemented using GEMM, ABFT canbe employed at the GEMM-level to provide protection. Whileprior work explained the resilience benefits, no performanceassessment was included [10]. We quantify the overheadsassociated with the ABFT technique for GEMM and highlightthe benefits of our ABED solutions to protect convolutions.An ABFT approach performs the following tasks: (1) allocatelarger input and output matrices with space for checksums,(2) copy data from the original matrices to the new locations,(3) generate checksums for both the input matrices, (4) runthe larger GEMM, instead of the original GEMM, (5) generateboth row and column checksums for the output (by readingthe output matrix twice) and compare them with the extra rowand column values generated by the GEMM, and (6) copy theoutput matrix to the original location. We measure overheadsfor tasks (2)-(6) by developing CUDA kernels. We compare theruntime of our implementation of the checksum generationtask with the reduction operation provided by CUB [35] anduse that whenever it is faster (assuming checksum generationcan be as fast as reduction). ABFT can be implemented tostore the input checksums separately to avoid copying largematrices. Such an implementation launches separate kernelsto perform original GEMM and generate extra output row andcolumn via vector-matrix products, which can be as slow asthe original GEMM for some matrix sizes (used in CNNs). Dueto such overheads, we do not explore this option in this paper.
We evaluate the resilience improvements offered by theABED techniques using three methods—analytical modeling,input-output error injections, and accelerated-particle beamexperiments. The first method uses the same model usedabove to analyze the increase in compute and data movementoperations. Here we analyze the fraction of compute anddata movement operations that are protected by the ABEDtechniques. For the second method, we run the second convo-lution layer from ResNet18 with input (fmap and filter) valuesinitialized to 1. We perform three error injection campaignsto study the effect of injecting single bit-flips into input fmaps,filters, and output fmaps. For each experiment in a campaign,we randomly flip a bit in a randomly chosen location in thetensor and study whether the ABED approach detects the error.We conducted two accelerated-particle beam experimentsat ChipIr at Rutherford Appleton Laboratory and ICE-II atLANSCE [6], [31] using our implementations of the Unfusedoption for the FIC and FC techniques and baseline. Weexcluded epilog in these experiments. The input tensors wereinitialized to 1. After each convolution we verified the outputwith the expected golden values that were collected duringfault-free runs. Any mismatch was recorded as an SDC. Ourinvestigation suggests that the output tensor was corruptedfor some runs after the ABED checks were completed, likely NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
Baseline FC FIC Baseline FC FIC Baseline FC FIC
ResNet18 ResNet50 VGG16 E x t r a c o m pu t e o p e r a t i o n s r e l a t i v e t o B a s e li n e Conv Epilog
Input Checksum Generation Output Checksum Generation
Dot-product
Fig. 6. The increase in the number of logical compute operations for thetwo ABED techniques relative to the baselines. when the output was being reduced to compare againstthe golden checksum to determine the SDC. These outputcorruptions are out of the coverage scope for ABED andhence we do not consider them as SDCs in this study. Wealso recorded whether our ABED scheme was able to detectthe error. We tested with NVIDIA Quadro GV100 GPUs withHBM2 ECC always On, and on-chip ECC On and Off.
ESULTS
We first study the increase in the logical compute and datamovement operations based on the model described inSection 5.1. Figure 6 shows a breakdown of the numberof arithmetic operations in convolution, epilog, checksumgeneration, and dot-product of the checksums for the baselineand the FC and FIC techniques. The average increase in thenumber of operations is small, <
7% for FC and <
1% forFIC for the studied networks compared to the respectivebaselines. The increase is relatively higher for the FC techniquebecause it increases the size of the convolution, unlike the FICtechnique. Results show that the extra computations addedfor checksum generation and performing the dot-product ofthe checksums are significantly less than 1%.Different implementations listed in Section 4.3 performthe same logical compute tasks, but differ in terms of datamovement. Figure 7 shows the bytes of data that form theinputs and outputs of the different implementation optionsfor ResNet18 using two input image sizes. Results for othernetworks and input sizes show similar trends (not shown forbrevity). The figure shows that the fused versions transportsignificantly less data compared to the versions that donot fuse tasks. Introducing separate kernels for checksumgeneration and verification introduces more data movement, asis the case for FIC Unfused and FC Unfused. Results also showthat the FC FusedOCG requires less data movement than FICFusedOCG, but FIC FusedOCG offers better data movementprotection by using input and weight checksums, while FCFusedOCG protects just the weight storage and movement.
Figure 8 shows the average runtimes of the Unfused options ofthe baseline and FC and FIC techniques. Results for the threenetworks using 1080p inputs are shown here. The results are NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. B a s e li n e F u s e d B a s e li n e U n f u s e d F C U n f u s e d F I C U n f u s e d F C F u s e d O C G F I C F u s e d O C G F I C F u s e d I O C G B a s e li n e F u s e d B a s e li n e U n f u s e d F C U n f u s e d F I C U n f u s e d F C F u s e d O C G F I C F u s e d O C G F I C F u s e d I O C G ResNet18-ImageNet ResNet18-1080p T o t a l s i z e o f t h e i npu t / o u t pu t d a t a r e l a t i v e t o B a s e li n e F u s e d Protected Unprotected
Fig. 7. The relative increase in the data that forms input and output of all thekernels for different implementations of the ABED techniques is shown here. normalized to the Fused baseline. The runtime overheads forthe FC technique stem from running a larger convolution withadditional checksum filters and output checksum generation.For the FIC technique, the overheads stem from running theinput and output fmap checksum generation tasks and thedot-product of the filter and input fmap checksums.
FC vs. FIC:
Results show that the runtime overheadintroduced by the output checksum generation is similarbetween the FC and FIC techniques. The dot-product kernelused during the FIC technique introduces negligible overheadacross all the studied networks and architectures. Thedifference in overheads between the FC and FIC techniquesis mainly due to running the larger convolution versus gen-erating input fmap checksums online. The former introduceshigher overheads for all the networks and architectureswe studied (Figure 8). The results show that the overheadsintroduced by the checksum generation and verification tasksare small (4-20%). The overhead introduced by the separatedata-movement-heavy epilog is high which can be avoidedby the task-fusion-based implementations discussed below.
Model-specific Sensitivities:
The overhead of outputchecksum generation for ResNet50 is higher compared toVGG16 and ResNet18. A primary reason for this differenceis that the overhead for verifying 1x1 convolutions is muchhigher compared to verifying 3x3 convolutions, and ResNet50uses many 1x1 convolutions while ResNet18 and VGG16 donot use any. The fraction of the work (and data movement)performed for checksum generation to that of the baselineconvolution is much higher for 1x1 convolutions compared to3x3 convolution. Fusing the output checksum generation taskwith the convolution operation can help reduce the overheads.
Architecture-specific Sensitivities:
We show the resultsobtained from Jetson AGX Xavier (with batch size of two)and a V100-based GPU (with batch size of eight) in Figure 8.The V100-based GPU offers approximately 10 × computethroughput and 5 × memory bandwidth compared to theXavier GPU. As expected the baseline work is signficantlyfaster on the V100-based GPU. Since the throughput tobandwidth ratio is higher for the V100-based GPU, overheadsof the memory-bound tasks such as checksum generation andepilog are higher. In fact, the tasks that are not memory-boundin Xavier become memory bandwidth/latency limited in V100.The runtime overhead of generating input checksums issignificantly lower than generating output checksums. The NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. B a s e li n e U n f u s e d F C U n f u s e d F I C U n f u s e d B a s e li n e U n f u s e d F C U n f u s e d F I C U n f u s e d B a s e li n e U n f u s e d F C U n f u s e d F I C U n f u s e d B a s e li n e U n f u s e d F C U n f u s e d F I C U n f u s e d B a s e li n e U n f u s e d F C U n f u s e d F I C U n f u s e d B a s e li n e U n f u s e d F C U n f u s e d F I C U n f u s e d VGG16-ImageNetXavier VGG16-1080pXavier ResNet18-1080p Xavier ResNet50-1080p Xavier ResNet18-1080p V100 ResNet50-1080p V100 R un t i m e r e l a t i v e t o B a s e li n e F u s e d Convolution Epilog Input Checksum GenerationDot-product Output Checksum Generation
Fig. 8. The runtimes of the Unfused versions of the baseline and FC andFIC techniques for different neural networks are shown here. NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. B a s e li n e F C F I C B a s e li n e F C F I C B a s e li n e F C F I C B a s e li n e F C F I C VGG16-ImageNet VGG16-1080p ResNet18-1080p ResNet50-1080p R un t i m e r e l a t i v e t o B a s e li n e Fused Convolution + Epilog Fused Output Checksum Generation
Input Checksum Generation Dot-product
Fig. 9. The runtime overheads of FusedOCG options for the FC and FICtechniques relative to the fused convolution + e05ilog kernel (Baseline)are shown here. main contributing factor is that the input fmaps are 4 × smaller compared to the non-scaled 32-bit integer output fmapvalues. One exception to this finding is ResNet-18 with imagesize of 1080x1280 on the V100-based GPU. Input checksumgeneration for ResNet-18 incurs high overhead because itbecomes memory-bandwidth limited on this GPU. Input-specific Sensitivities:
Figure 8 shows the relativeruntime for VGG16 with different input sizes (224x224 vs.1088x1920). Since no significant difference is observed, we donot analyze results with smaller image size for other networks.
Effect of Task Fusion-based Optimization:
The runtimeoverhead results for the FusedOCG optimization for FC andFIC techniques obtained using the methodology described inSection 5.2 are shown in Figure 9. These experiments were runon Jetson Xavier. These results suggest that task fusion is highlyeffective in reducing the overheads by reducing memorytraffic associated with epilog and additional ABED tasks. Itshows that the inference-level overheads for the FIC technique(6-23%) are far lower than full duplication. The overheadsfor the FC technique are higher mainly due to running thelarger convolution (with additional checksum filters).
Reducing overheads for the FC technique:
In our FC imple-mentations, we increase the filter counts by 8 (as described inSection 5.2). The runtime of the convolution, however, increasesdisproportionately for some layers. We illustrate this behaviorby executing a convolutional layer with a varying number offilters. Specifically, we ran an int8 convolution with 112x112input fmap, 3x3 filters, 64 input channels, and stride and NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
468 32 64 128 256 512 R un t i m e i n m s Fig. 10. Convolution runtimes with varying filter counts. NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. BaselineUnfused FC Unfused BaselineUnfused FC Unfused BaselineUnfused FC UnfusedVGG16-ImageNet VGG16-ImageNet with pruning method 1
VGG16-ImageNet with pruning method 2 R un t i m e r e l a t i v e t o t h e n o n - p r un e d B a s e li n e F u s e d Convolution Epilog Output Checksum Generation
Fig. 11. The runtimes for VGG16 and two pruned versions for the Unfusedbaseline and FC technique are shown here. NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
VGG16
ResNet18ResNet50 Runtime relative to baseline GEMMGEMM Copy Inputs Copy Output Input Checksum Generation Output Checksum Generation
Fig. 12. Runtime breakdown for an ABFT implementation. padding of 1. We vary the filter counts and show that addingjust eight filters can introduce up to 2 × overhead. Figure 10shows these results. Since cuDNN uses GEMM as a method toperform the int8 convolutions and GEMMs use tiling, the sharpincrease in the runtime is likely due to the use of an additionaltile. While such tiling effects are not a concern for GEMMs withlarge dimensions, they can be problematic for the commonlyused convolutional layer dimensions (where K is small).Instead of increasing the filter counts of the baseline network,creating space for the filter checksums can eliminate thisoverhead. As mentioned in Section 3.4, network pruning tech-niques, which are being adapted as a way to improve networkperformance, may create space for filter checksums. These tech-niques identify and remove filters that contribute minimally tothe accuracy of the network. With the use of pruned networks,the number of filters per layer may reduce even after addingthe checksum filter, the sharp increase in the convolution run-time can be avoided. To test this hypothesis, we conducted anexperiment for the FC technique using VGG16 and two prunedversions of the network. We obtain the number of pruned filtersper layer from a previously published result. Huang et al., stud-ied two methods to prune the network [16]. The first approachranks filters on per layer basis, while the second ranks themacross all the network. Our results in Figure 11 demonstratethat the overheads from running the larger convolution becomesmall, 2% or 10% for the two pruned versions, respectively,compared to the 42% overhead for the non-pruned version. As convolutions can be implemented using GEMM, ABFT forGEMM (a well known technique) can provide high protectionto CNNs. A recent study explained the resilience benefitsof such an approach [10]. In this section, we quantify theoverheads associated with the ABFT technique for GEMM, asdescribed in Section 5.3, and highlight the benefits of our ABEDsolutions to protect convolutions in CNNs. Results for the threenetworks for 1080p input are shown in Figure 12. These resultsshow that running a larger GEMM incurs high overhead (sim-ilar to the FC technique). This overhead can be reduced usingpruned networks (not explored here). As our ABFT implemen-tation embeds the row and column checksums along with theinput matrices to perform a larger GEMM, online data man-agement (i.e., copying input to larger matrices) introduces sig-nificant overheads. This overhead can be avoided by allocatinglarger matrices in the first place for the inputs and output withbroader application knowledge and framework support, whichis what we propose for the FC technique. The FIC techniqueavoids running the larger convolution altogether, simplifyingdata management. Lastly, processing output matrix twice togenerate both the row and column checksums for error correc-tion capability can also introduce high overheads. Optimizedimplementations that process the output just once will reducethis overhead. By focusing on error detection alone, ABED sig-nificantly speeds up this step by generating a single checksum.
Scope of protection:
Figures 6 and 7 show the scope ofprotection offered by different the FC and FIC techniques.The ABED techniques protect computation in the convolution,input and output checksum generation, and dot-product kernel.Other than convolutions, CNNs include activation, pooling,and fully-connected layers. Activation layers are typicallymerged with the convolution layer and they constitute a smallfraction of the total compute (0.6% for ResNet18 and 1.1%for ResNet50 as shown in Figure 6). Only a few poolinglayers are typically used in CNNs (just two in ResNet18 andResNet50, for example) and their compute requirement isalso small ( < < > × .The amount of protection offered by ABED for data storageand movement is important for architectures that do not pro-tect (with ECC/parity) storage and transportation structuressufficiently against transient, intermittent, and permanenterrors. We show the levels of protection ABED techniques offerfor data movement in Figure 7. Since FIC technique protects theinput data to the original convolution kernel, it provides betterdata storage and movement protection than the FC technique.As the input fmap size increases, the coverage offered by theFC technique reduces (compare FC FusedOCG results betweenResNet18-ImageNet and ResNet18-1080p). While the coveragealso reduces for the FIC technique, the reduction is much less. NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
Baseline FC FICOn-Chip ECC OFF F I T r e l a t i v e t o t h e un f u s e d n o n - A B E D v e r s i o n Fig. 13. The SDC FIT rate improvement with the FC and FIC techniqueswithout on-chip ECC.
Results show that FIC FusedIOCG offers highest data storageand movement coverage among all the FIC options.ECC/parity deployed in architectures used in HPC andsafety-critical systems provide no protection to computationalunits, one of the major sources of intermittent and permanenterrors. ABED techniques provide very high protectionfor computational units along with storage and datatransportation protection.
Error injections:
We perform error injections into the inputand output tensors as described in Section 5.4. Our resultsfor the FC technique show that all single-bit injections intonon-zero filters and output fmaps are detected by the ABEDtechnique and no single-bit injections into input fmaps aredetected, as expected. A similar experiment for the FICtechnique shows that errors in the input fmaps, filters, andoutput fmaps are detected.
Beam testing:
To accurately quantify the vulnerabilityimprovement, we conducted accelerated-particle beamexperiments as outlined in Section 5.4. For the on-chip ECCOff experiments, the results show a clear SDC FIT ratereduction trend for the FC and FIC techniques. We observedsome SDCs when the FC technique was employed. Upon inspecting the SDCs that were not detected by the FCtechnique, we found the output to be corrupted such that thevalues in original fmaps when reduced to be compared to thechecksums result in no error detection (both the original outputand checksum values were corrupted). Such manifestationsindicate corruption to the input fmap, which is not covered bythis technique. With on-chip SRAM (register file, L1/L2 cache)ECC/parity protection enabled, the likelihood of such errorswill be low. We observed no SDCs while running convolutionswith the FIC technique. The ABED techniques detected a fewerrors when no SDC was observed. These extra errors couldbe the result of a fault in the verification kernel. Figure 13shows the FIT rate improvement results with on-chip ECC Off.With on-chip ECC On, we expect both the FC and FICtechniques to offer high and comparable protection. Due tothe cost of experimentation and limited availability of thebeam time, we verify this only for FIC. In this experiment,the FIC technique detected all observed SDCs. The error barsfor this experiment were relatively large, however, due tolimited beam time and relatively lower SDC rate of the GPU(compared to on-chip ECC Off).
ISCUSSION : M
ANAGING R OUNDING E RROR
While commercial systems increasingly use fixed-point datatypes during inference, the use of half-precision floating-pointdata type (fp16) is also common. All the techniques describedin the paper are applicable to fp16 convolutions. Due to thenon-associative nature of floating-point operations, the finalcomparison cannot be exact. We can test whether the two reduced values computed through different ways are closeenough using a threshold. Corruptions can be detected if theerror changes the values such that the difference is greater thanthe threshold. The lower the threshold, the higher the coverage.The threshold depends on the rounding error introducedby the ABED tasks (e.g., checksum generation) and baselineconvolution. We explored methods to significantly reduce theerror introduced by the ABED tasks. Since the filter checksumsare generated offline, very high precision operations can beused to reduce rounding error. Most architectures supportaccumulators that use higher precision compared to inputs(e.g., fp32 accumulation for fp16 input is common). Leveragingsuch hardware features, the error in input fmap onlinechecksum generation can be reduced. For the FC technique,the resulting checksums are stored with the filters in fp16format. The checksums can be stored as multiple fp16 values(or filters) such that the error introduced by rounding to fp16 iseliminated. For the FIC technique, the checksum can be storedin a fp32 data type. The output verification step can also use ahigher-precision accumulator to reduce the rounding error. Weestimate the error introduced by the checksum generation andverification steps (not shown here) and found it to be muchsmaller than the average error introduced by the convolution.Since the error introduced by the baseline convolution ischallenging to bound (due to the varying implementations,algorithms, and input value distributions), we rely onheuristics to estimate average rounding error. For someuncommon input values, the observed error can be higherthan the threshold used for error detection, causing checkfailures in fault-free runs (we call them false positives). Thesystem must recover from false positives to guarantee forwardprogress. False positives can be handled using a combinationof techniques, if the rate is low: (1) rerun the layer on adifferent component (CPU/GPU/DLA) by incurring higherlatency, or (2) notify a higher layer in the system to determinewhether the it can tolerate skipping the inference. A recentstudy found that many low-level errors are tolerable at thesystem-level [20]. If the false positive rate is high, a diagnosticmodule can be invoked to tune the threshold or switch to thelow-throughput duplication mode.
ONCLUSIONS
CNNs have made their way into safety-critical and HPCsystems. GPU and accelerator-based systems are preferredplatforms for CNNs, with the convolution consuming mostof their execution time. Since safety is paramount for suchsystems, it is important to ensure the correctness of con-volutions in the presense of hardware errors. This paperproposes an algorithm-based error detection (ABED) solutionfor convolutions, providing a much lower overhead approachcompared to full duplication. We demonstrate how this solu-tion can be employed during highly optimized CNN inferenceexecutions that fuse multiple layers and use reduced-precisionoperations. Results show that ABED eliminates convolutionoutput corruptions for all studied hardware errors with low (6-23%) runtime overhead, at least 4 × lower than full duplication. R EFERENCES [1] Al-Yamani et al. Performance Evaluation of Checksum-Based ABFT.In
Proc. of the Int. Symp. on Defect and Fault Tolerance in VLSI Systems(DFT) , 2001. [2] S. Alcaide, L. Kosmidis, C. Hernandez, and J. Abella. High-IntegrityGPU Designs for Critical Real-Time Automotive Systems. In Proc.of the Design, Automation & Test in Europe Conference (DATE) , 2019.[3] ARM. Arm A64 Instruction Set Architecture: Armv8, forArmv8-A architecture profile. https://developer.arm.com/docs/ddi0596/latest/a64-sve-instructions-alphabetic-order, 2018.[4] Baidu. Apollo Open Platform. http://apollo.auto, 2019.[5] W. Bartlett et al. Commercial Fault Tolerance: A Tale of Two Systems.
Trans. on Dependable and Secure Computing , pages 87–96, 2004.[6] C. Cazzaniga et al. First Tests of a New Facility for Device-Level,Board-Level and System-Level Neutron Irradiation of Microelectron-ics.
IEEE Trans. on Emerging Topics in Computing , pages 1–1, 2018.[7] J. Chen et al. GPU-ABFT: Optimizing Algorithm-Based FaultTolerance for Heterogeneous Systems with GPUs. In
Proceedingsof the IEEE International Conference on Networking, Architecture andStorage (NAS) , 2016.[8] J. Chung et al. Containment Domains: A Scalable, Efficient, andFlexible Resilience Scheme for Exascale Systems. In
Proc. of the Int.Conf. on High Performance Computing, Networking, Storage and Analysis(SC) , 2012.[9] C. Constantinescu. Intermittent Faults and Effects on Reliability ofIntegrated Circuits. In
Proc. of the Annual Reliability and MaintainabilitySymp. , 2008.[10] F. F. d. Santos et al. Analyzing and Increasing the Reliability ofConvolutional Neural Networks on GPUs.
IEEE Trans. on Reliability ,68(2):663–677, 2019.[11] J. Deng et al. ImageNet: A Large-Scale Hierarchical Image Database.In
Proc. of the Conf. on Computer Vision and Pattern Recognition (CVPR) ,2009.[12] M. Dimitrov et al. Understanding Software Approaches for GPGPUReliability. In
Workshop on General Purpose Processing on GraphicsProcessing Units (GPGPU) , 2009.[13] C. Ding et al. Matrix Multiplication on GPUs with On-Line FaultTolerance. In
Proc. of the Int. Symp. on Parallel and DistributedProcessing with Applications (ISPA) , 2011.[14] K. He et al. Deep Residual Learning for Image Recognition.
CoRR ,abs/1512.03385, 2015.[15] K.-H. Huang et al. Algorithm-Based Fault Tolerance for MatrixOperations.
IEEE Trans. on Computers , C-33(6):518–528, 1984.[16] Q. Huang et al. Learning to Prune Filters in Convolutional NeuralNetworks. In
Proc. of the IEEE Winter Conference on Application ofComputer Vision , 2018.[17] Z. Hui and other. Optimized Software-Based Hardening Strategiesfor Matrix Multiplication and Fast Fourier Transform. In
Proc. ofthe Int. Conf. on Algorithms, Computing and Systems ® Cortex ® -R5Processor for Safety-Critical and Ultra-Reliable Applications. In Proc.of the Int. Conference on Dependable Systems and Networks Workshop(DSN-W) , 2016.[20] S. Jha et al. ML-based Fault Injection for Autonomous Vehicles: ACase for Bayesian Fault Injection.
Proc. of the Int. Conf. on DependableSystems and Networks (DSN) , 2019.[21] N. P. Jouppi et al. In-Datacenter Performance Analysis of a TensorProcessing Unit. In
Proc. of the Int. Symp. on Computer Architecture(ISCA) , 2017.[22] T. Kurth et al. Exascale Deep Learning for Climate Analytics. In
Proc.of the Int. Conf. on High Performance Computing, Networking, Storageand Analysis (SC) , 2018.[23] G. Li et al. Understanding Error Propagation in Deep LearningNeural Network (DNN) Accelerators and Applications. In
Proc. ofthe Int. Conf. on High Performance Computing, Networking, Storage andAnalysis (SC) , 2017.[24] S. Li et al. Enabling Sparse Winograd Convolution by NativePruning.
CoRR , abs/1702.08597, 2017.[25] X. Liang et al. Correcting Soft Errors Online in Fast FourierTransform. In
Proc. of the Int. Conf. on High Performance Computing,Networking, Storage and Analysis (SC) , 2017.[26] S.-C. Lin et al. The Architectural Implications of AutonomousDriving: Constraints and Acceleration. In
Proc. of the Int. Conf. onArchitectural Support for Programming Languages and Operation Systems(ASPLOS) , 2018.[27] A. Mahmoud et al. Optimizing Software-Directed InstructionReplication for GPU Error Detection. In
Proc. of the Int. Conf. on HighPerformance Computing, Networking, Storage and Analysis (SC) , 2018. [28] T. Marty et al. Enabling Overclocking through Algorithm-Level ErrorDetection. In
Proc. of the Int. Conf. on Field-Programmable Technology(FPT) , 2018.[29] P. Molchanov et al. Pruning Convolutional Neural Networks forResource Efficient Transfer Learning.
CoRR , abs/1611.06440, 2016.[30] A. Nardi et al. Functional Safety Methodologies for AutomotiveApplications. In
Proc. of the Int. Conf. on Computer-Aided Design(ICCAD) , 2017.[31] S. F. Nowicki et al. The Los Alamos Neutron Science CenterSpallation Neutron Sources.
Physics Procedia
Proc. of the Design Automation Conference(DAC)
CoRR ,abs/1510.00149, 2015.[43] K. Simonyan et al. Very deep convolutional networks for large-scaleimage recognition.
CoRR , abs/1409.1556, 2014.[44] M. Snir et al. Addressing Failures in Exascale Computing.
The Int.Journal of High Performance Computing Applications , 28(2):129–173,2014.[45] V. Sze et al. Efficient Processing of Deep Neural Networks: ATutorial and Survey.
Proc. of the IEEE , 105(12):2295–2329, 2017.[46] Tesla. Tesla Autonomy Investor Day — Tesla, Inc. https://ir.tesla.com/events/event-details/tesla-autonomy-investor-day, 2019.[47] J. Wadden et al. Real-World Design and Evaluation of Compiler-Managed GPU Redundant Multithreading. In
Proc. of the Int. Symp.on Computer Architecture (ISCA) , 2014.[48] P. Wu et al. Towards Practical Algorithm Based Fault Tolerance inDense Linear Algebra. In