AccSS3D: Accelerator for Spatially Sparse 3D DNNs
Om Ji Omer, Prashant Laddha, Gurpreet S Kalsi, Anirud Thyagharajan, Kamlesh R Pillai, Abhimanyu Kulkarni, Anbang Yao, Yurong Chen, Sreenivas Subramoney
AAccSS3D: Accelerator for Spatially Sparse 3DDNNs
Om Ji Omer ∗§ , Prashant Laddha ∗ , Gurpreet S Kalsi ∗ , Anirud Thyagharajan ∗ , Kamlesh R Pillai ∗ ,Abhimanyu Kulkarni † , Anbang Yao ‡ , Yurong Chen ‡ , Sreenivas Subramoney ∗∗ Processor Architecture Research Lab, Intel Labs, Bangalore † University of Wisconsin - Madison, work done while at Intel ‡ Intel Labs, China § Corresponding Author : Om Ji Omer, [email protected]
Abstract —Semantic understanding and completion of real-world scenes is a foundational primitive of 3D Visual perceptionwidely used in high-level applications such as robotics, medicalimaging, autonomous driving and navigation. Due to the curseof dimensionality, compute and memory requirements for 3Dscene understanding grow in cubic complexity with voxel reso-lution, posing a huge impediment to realizing real-time energy-efficient deployments. The inherent spatial sparsity present inthe 3D world due to free space is fundamentally different fromthe channel-wise sparsity that has been extensively studied.We present A
CCELERATOR FOR S PATIALLY S PARSE
3D DNNs(AccSS3D) , the first end-to-end solution for accelerating 3D sceneunderstanding by exploiting the ample spatial sparsity. As analgorithm-dataflow-architecture co-designed system specializedfor spatially-sparse 3D scene understanding, AccSS3D includesnovel spatial locality-aware metadata structures, a near-zerolatency and spatial sparsity-aware dataflow optimizer, a surfaceorientation aware pointcloud reordering algorithm and a co-designed hardware accelerator for spatial sparsity that exploitsdata reuse through systolic and multicast interconnects. TheS S p NNA accelerator core together with the 64 KB of L1 memoryrequires 0.92 mm2 of area in 16nm process at 1 GHz. Overall,AccSS3D achieves 16.8x speedup and a 2232x energy efficiencyimprovement for 3D sparse convolution compared to an Intel-i7-8700K 4-core CPU, which translates to a 11.8x end-to-end3D semantic segmentation speedup and a 24.8x energy efficiencyimprovement (iso technology node).
I. I
NTRODUCTION
Understanding 3D geometry and semantics of a scene isessential to many real-world systems including but not limitedto autonomous driving, robotics, remote sensing, AR/VR andmedical treatment [6], [16], [31], [34], [62]. With rapid growthin 3D acquisition technologies, 3D sensors are becomingincreasingly available and affordable for rich data generationthrough various types of 3D scanners, LiDARs and RGB-D cameras.3D data is usually represented in various formatslike pointclouds, meshes, depth maps and volumetric grids.Deep Learning techniques have disrupted traditional methodsin domains such as computer vision, speech processing andmachine translation that operate over images, videos, audio,text and other forms of data. However, DL methods facesevere challenges in processing 3D visual data due to the highdimensionality and the unstructured nature of 3D data.Several methods (Figure 1) have been proposed for vari-ous 3D visual AI applications such as shape classification,
Deep Learning based 3D Visual Understanding3D Shape Classification 3D Object Detection & Tracking 3D Segmentation and CompletionSemantic Segmentation Instance Segmentation Part SegmentationProjection based Point basedObject Detection Object TrackingRegion Proposal Single ShotProjection based Point based Scene Completion
Fig. 1. Taxonomy of 3D visual data processing, courtesy: [20]Fig. 2. 3D Semantic segmentation, annotating voxels with labels [47] object detection, tracking, and scene segmentation. 3D scenesegmentation requires extraction of global geometric structureand intricate details of each point in the 3D pointcloud.Semantic segmentation (Figure 2) aims to divide a pointcloudinto several subsets based on real world semantics. Instancesegmentation identifies instances of each semantic object,while part segmentation further breaks each instance intodifferent parts. Scene completion predicts and labels missingregions arising from sensing inaccuracies or occlusions.Point based networks (Table I) directly operate over ir-regular, orderless and unstructured pointclouds, making itinfeasible to directly apply standard CNNs. Performance ofmulti-view 2D projection based methods is highly sensitive toviewpoint selection and these methods also do not fully exploitunderlying 3D geometry. Since volumetric representation pre-serves neighbourhood structure of pointcloud and regular dataformats allow direct application of standard 3D convolutions,volumetric projection based methods provide higher accuracy.However, voxelization introduces discretization artifacts andcauses information loss. A low resolution voxel representationcan degrade accuracy, whereas high resolution pointcloudscubically grow compute and memory requirements (Figure 5).This poses a huge impediment in achieving real-time real-world deployments of visual AI that fundamentally requirelow-latency and low-energy 3D scene understanding capabil-ity. Thankfully, the ample free space in 3D scenes providessignificant opportunity to exploit the inherent spatial sparsity .For dense DNNs, multiple accelerator designs [17], [22],1 a r X i v : . [ c s . A R ] N ov ABLE IC
OMPARISON OF SEMANTIC SEGMENTATION METHODS ON S CAN N ET [11] Category Method mIoU Category Method mIoU P o i n t b a s e d PointwiseMLP PointNet++ [43] 33.9 P r o j ec ti onb a s e d Volumetric (cid:63)
SparseConvNet [18] (cid:63)
PointSIFT [28] 41.5 MinkowskiNet [10] (cid:63)
PointConvolution PointCNN [32] 45.8 Multiview TangentConv [49] 40.9KPConv [51] 68.4 Perm. Lattice LatticeNet [44] 64.0Graph based HPEIN [27] 61.8 Hybrid UPB [9] 63.4 [25], [41], [63] have been proposed to exploit channel-wisesparsity in weights and inputs. They focus on identifyingvalid pairs of non-zero weights and input activations whileoperating over a compressed data structure. We observe twofundamental attributes of spatial sparsity as present in 3Dvisual analytics that differentiate it from the channel-wisesparsity: 1) the selection of a non-zero pair of weights andinputs depends solely on local 3D spatial sparsity, 2) the natureof the operation per selected pair is of the matrix-to-vector typeas compared to scalar-to-scalar type in channel-wise sparsity.The first motivates a different hardware-architecture to enableinput-weight pair selection based on the variable local sparsityper voxel. The second further offers the opportunity to exploithigher efficiency through coarser dispatches of work to thecompute units quite unlike the prior accelerators that hunt forfine-grained sparsity across channels.
We will show in SectionIII that a custom-built sparse engine is required for efficientacceleration of spatially-sparse applications such as 3Dvisual analytics as they have distinct hardware requirements.
Optimized dataflows, as explored in prior art [7], [8],[15], [21], [30], [58], focus on dense-DNNs and enabletiled execution with optimal tile-sizes and loop-orders toexploit data-locality in on-die memories. It is worth notingthat the execution attributes of dense-DNNs (
Hencespecialized dataflow optimizers that explicitly exploit spatialsparsity structure are warranted. In addition, they mustachieve compute efficiencies similar to dense-DNNs withoutthe latency overheads incurred by on-the-fly processing.
A. Our Contribution
Driven by a detailed end-to-end performancecharacterization of the Sparse Convolution Network (SCN),we present A CCELERATOR FOR S PATIALLY S PARSE , the first end-to-end solution for accelerating 3D scenesegmentation by exploiting spatial sparsity. In brief, our keycontributions are as follows:1. We introduce a novel locality-aware metadata structure ,C OIR , for voxel data that stores spatial locality informationby encoding active receptive fields at each location. We alsopresent a novel pointcloud reordering technique , S
OAR , that maximizes spatial data locality and reuse across all DNNlayers for multi-level memory architectures.2. We present (to the best of our knowledge) the first-eversparsity aware dataflow optimizer , S
PADE , that analyzes thelocal geometry structure of a pointcloud and uses the sparsityattributes to maximize data reuse across all DNN layers.To meet real-time requirements, we enable latency-criticaldataflow optimizer routines to be run in offline mode whileretaining the benefits of the sparsity-aware dataflow.3. We present a novel sparse-accelerator for spatially-sparseDNNs , S S p NNA , whose sparse scheduler maximizes weightreuse. Systolic and multicast interconnects in the PE arraysmaximize input feature-map reuse. We also propose, Ad MAC ,an accelerator for fast 3D neighbourhood search and C
OIR metadata creation and C AROM , a joint dataflow optimizationtechnique for multi-level memory hierarchies. We show howa scaled-up architecture based on AccSS3D accelerates 3Dsparse convolution by compared to a 4-core CPUbaseline with a improvement in energy efficiency.II. B
ACKGROUND
Recent algorithmic advances [13], [36], [55] extended 2D-CNNs to 3D to recognize 3D objects and enabled semanticsegmentation of real-world 3D scenes. However, due to thecurse of dimensionality, compute and memory requirementsgrow in cubic complexity with voxel resolution.Free space in 3D scenes is a source of inherent spatialsparsity, offering high opportunity for efficient processing.Recent algorithms [18], [42], [43] exploit this spatial sparsityby processing only those voxels ( active voxels ) that are inthe vicinity of a surface boundary. They transform 3D sparsedata into a list of active voxels for compressed storage anduse a map [26], [50] to retrieve lists of indices based on 3Dcoordinates. Graham et al. [19] introduced
Sparse ConvolutionNetwork (SCN) using the novel
Valid Sparse Convolution(VSC) that provides significant reduction in compute andmemory costs by considering output voxels as active onlyif the corresponding input voxel is active.
This retains thesparsity structure of feature maps across the network andpaves the way for achieving real-time complex 3D sceneunderstanding.
Zhang et al. [61] extended SCN to use SpatialGroup Convolution, dividing the pointcloud into groups, fur-ther reducing compute without accuracy loss. Several systemsemploy encoder-decoder U-net topologies [12], [14], [33],[45], [46], [53], [54], [57], [59], [60], with layers that reducespatial resolution (strided convolutions) followed by upsam-pling layers (deconvolutions), interspersed with VSC layers.III. M
OTIVATION
A. Increasing complexity of 3D Visual processing
Over the years, compute/memory requirements for 3D vi-sual processing have increased multi-fold. (Table II). While3D volumetric representations enable higher accuracy (TableI), low-resolution representation can hurt accuracy (Figure 5).2 ig. 3. Processing flow for SCN workload
Though compute and memory requirements grow in cubicorder with resolution size, the free space in 3D scenes, a sourceof exploitable ’spatial sparsity’, motivates a hardware-softwareco-designed approach to acceleration of 3D visual processing.
B. Sparse Convolution Network (SCN) workload profile
Figure 4 shows the runtime CPU profile of SCN [18] as arepresentative workload. The middle layers take much lowerexecution time than initial and last few layers as in the U-net,the middle layers operate on lowest resolution needing lowercompute. As the order of voxel processing can be differentfrom the memory layout, a gather operation for inputs andscattered write for outputs is required. Overall,
Input Gatherand Output Write dominates across the layers, due to theweight stationary dataflow, where input-output index pairs arecreated and processed for each weight plane independently.
These components are majorly dominant for the initial andlast few layers, where the voxel resolution sizes are higher.Lack of tiled execution and inability of CPU caches to holdthe entire large pointcloud significantly hurts data reuse.We observe that the performance of the workload scalespoorly with number of cores and flattens beyond 4 cores (Fig-ure 4-c). Further analysis shows that thread synchronization(spin locks) and thread creation events (both of which scalewith
TABLE IIA
CHRONOLOGICAL OVERVIEW OF TYPICAL
RGB-D/3D
DATASETS . Datasets [18]
I/O FP (grid) [18]NYU Depth V2 [39] 1.5k images (894) ∼ ∼ )ModelNet [56] 50k models (660) ∼ ∼ )ShapeNet [5] 3M models (3135) ∼ ∼ )Matterport3D [4] 194k images (40) - -ScanNet [11] 2.5M frames (40) ∼ ∼ )Waymo [48] 113K models (210K) ∼ ∼ ) C. The need for Sparsity-aware Dataflow Optimizers
Several dataflow optimizers have been proposed for denseDNNs [7], [8], [15], [21], [30], [58]. Since dense DNNsoperate over regular data-structures, memory size requirementsand data accesses for each data-type can be estimated and anoptimal dataflow could be chosen offline without requiringinput data. Since network parameters are usually known priorto input data arrival, an optimal dataflow could be chosen inoffline mode per layer.
Since 3D scenes are typically spatiallysparse, a uniform 3D tiling strategy would result in extremelyinefficient execution due to excessive memory consumptionand uneven work distribution.
Though tiling 1D compresseddata-structure through a spatial hash map is a concise represen-tation, it has the following drawbacks: 1) size of compresseddata-structure varies per input pointcloud and across differentregions within a pointcloud 2) memory requirement and dataaccesses can not be formulated as mathematical expressions3) storing 3D data in an unordered 1-D compressed formatresults in irregular data accesses as convolution operationsneed to be performed on spatially proximate points in 3Dspace.
In Sections IV-A, IV-B, IV-C, we show how specializedsparsity-aware dataflow optimizers can effectively addressesthese challenges.D. Accelerator Options for 3D Spatial Sparsity
As discussed in Section III-C, storing 3D spatially-sparsedata in 1D compressed format leads to inefficient executiondue to irregular data accesses as convolution operations need tobe performed on spatially proximate points in 3D space whilethe data is stored in 1D compressed format. We discuss twopotential ways to accelerate spatially sparse 3D processing:
1) Generic GEMM-based Acceleration:
Implement 3D sparseconvolution as a dense-GEMM using efficient gather/scatteroperations and performing GEMM over the regularized datastructure similar to reference CPU implementation (Figure 3).
Challenges: a) Since a 3D convolution is mapped onto aGEMM engine, input/output features need to be re-fetched asmany times as the local receptive field, requiring prohibitivelylarge data transfers and bandwidth. b) Due to input-dependentsparsity, data-specific source/destination rulebook tables needto be created by the host to enable a gather/scatter engine toperform explicit data copies. When local sparsity is medium-to-high and the point-cloud is large, the rulebook can signifi-cantly increase size and bandwidth requirement.
2) Specialized Acceleration for 3D sparsity : Build specializedimplementation to locate input/output data on-the-fly whileperforming 3D sparse convolution on a pool of MACs.
Challenges: a) While custom acceleration for 3D sparsity canentirely eliminate the need for explicit metadata/rulebook gen-eration, re-computation of input/output data addresses wouldbe required for each DNN layer with the same resolution andat every accelerator core in the system. b) The number oflookups into the sparse hash-map required per unit computeis input sparsity-dependent. So, either the convolution enginewill suffer significant utilization loss due to the high variabilityin the local receptive field OR to cover the variable latency3 %20%40%60%80%100% % T i m e S p e n t E x e c u t i o n T i m e & I n t s t r u c t i o n C o un t r e l a t i v e t o s i n g l e c o r e Number of Cores
Spin-Lock % Thread Creation %Core Processing % Execution TimeInstruction Count T i m e ( S e c ) B r e a k - u p o f R un t i m e SCN LayersRulebook Create Input Gather ConvolutionOutput Write Total Runtime (a)
Rulebook Create8%Input Gather24%Convolution43%Output Write (incl. partials)25% (b) (c)
Fig. 4. Performance characterization of SCN [18] inference for a typical ScanNet [11] dataset on Intel-i7-8700K CPU with single core execution @ 3.7GHz(a) Layer-wise runtime and break-up into constituent functions (b) Runtime break-up across all layers (c) Performance scaling with multi-core execution m I o U ( % ) R e qu i r e d O p s a nd M e m o r y ( b y t e s ) i n M illi o n s Voxel ResolutionOpsMemoryfootprintMean IOU
Fig. 5. Correlation between compute needs and accuracy (mIoU) as a functionof resolution obtained by downsampling input pointclouds in ScanNet [11].Fig. 6. SCN performance characterization on Nvidia [email protected]. Though GPU is processing for most of the time, average GPUcore utilization is significantly low at 21% due to high register usage. and bandwidth requirement, the local buffering would needto be prohibitively large.
We propose in section IV-D a semi-specialized accelerator for 3D sparsity which can process acompressed metadata structure while adopting a tile-basedsparsity-aware dataflow for efficient data-movement . E. Custom Accelerator Requirements for 3D Spatial Sparsity
State-of-the-art sparse accelerators [22], [29], [41], [63]have been proposed for sparse matrix multiply or sparse DNNmodels. In general, these sparse accelerators are composed oftwo major blocks: 1) a Front-end which processes metadataof input operands (feature-maps and weights) locating pairsof non-zero input operands and 2) a Back-end for performingvalid operations (MAC, activation, pooling etc.) through PEs.Proposals such as Sparten [17] also schedule work efficientlysuch that the imbalance between compute engines can bemitigated. We discuss challenges in adopting prior sparse-accelerators for 3D spatially sparse data processing:
A) Front-end : for 3D spatially sparse convolution, valid inputpair selection depends solely on whether the voxel is activein the receptive field in the input pointcloud and not on theweights. Further, the selection logic (Figure 7) significantly
TABLE IIIM
ICRO -O PS AND DATA ACCESSES SAVINGS WITH COARSE LEVEL OFWORK - SCHEDULING
Tile Size uOps Data-accessesLayers Type Total
Ops (∆ O, ∆ C, ∆ N ) uOps Savings SavingsL2 SCN 1.3e+9 (28,16,32) 2.5e+6 512x 1.93xL12 Conv 4.4e+8 (12,16,32) 8.6e+5 512x 1.94xL24 Dconv 1.7e+8 (860,8,8) 2.7e+6 64x 1.75xL35 SCN 5.4e+9 (212,8,16) 4.2e+7 128x 1.88x differs from the logic in any of these prior sparse accelerators. B) Back-End : the back-end in these accelerators receives a listof uops, each representing a scalar multiply and accumulationoperation (MAC). In 3D spatially sparse convolution, eachvalid input pair selection requires either a) a vector dot product(V-V) or b) a matrix-vector (M-V) multiply maximizing thedata reuse when there are multiple output channels (N) in atile. Hence, scheduling the work at the granularity of M-Vwould be more optimal than dispatching the work per MAC.
Table III shows how a M-V granularity of dispatch can helpto achieve a huge reduction in dispatched uops and in dataaccesses between compute and on-chip memory for a fewselect layers in Unet [18].
We use 64KB of on-chip memorywith optimal tile size and dataflow selection to minimize dataaccesses between on-chip and off-chip memory.We thus present a sparse-accelerator for spatially-sparseDNNs that consists of a) a sparse scheduler based Front-endto process custom-compressed 3D metadata and b) a Back-endthat dispatches matrix-vector operations which maximizes datareuse and minimizes the dispatched uops.IV. H
ARDWARE -S OFTWARE C O -D ESIGNED A CCELERATION FOR
3D S
PATIAL S PARSITY
A. Locality aware Data-structure (COIR)
Workload analysis in Section III-B provided the followinginsights: 1) performing convolution for all the voxels togetherin the receptive field (i.e. neighbourhood points) can reducedata accesses for input/output feature maps 2) storing all inputpoint indices per output point (or vice a versa) can achievemetadata compression compared to weight wise listing of in-out pairs [18]. Metadata size savings could be significantlyhigher for denser pointclouds. Inspired by these insights, wepropose a novel metadata structure (
COIR ) with two flavors 1)
Compressed Output Response Field (CORF) , 2)
CompressedInput Receptive Field (CIRF) (Figure 7). In
CORF , eachmetadata entry corresponds to one unique point in input spaceand consists of 1) the index of the input point, 2) indices ofall the points in the output space in the response field of theinput point. In addition, relative location of each neighbouris also required to select the appropriate weight index forconvolution operation. For this, a weight bit-mask is storedfor each entry with ’1’s indicating valid neighbours and bit-locations indicating corresponding weight indices. Similarly,each
CIRF entry contains the index of a unique output point,the indices of all the input points in the receptive field requiredto compute feature map for the output point and bit-masks4 ig. 7. Two-dimensional illustration of the two flavors for C
OIR metadatastructure for submanifold convolution operator. With 3D visual data input,output activations and weights would be in form of cuboids (instead of planes) . Fig. 8. An illustration for dividing points in pointclouds in size constrained chunks and reordering points in each chunk through S
OAR for weight index selection. Two flavours of metadata aremotivated by the observation that for the layers with resolutionchange (input and output space having different resolution),input to output mapping is not one-to-one.
Hence, based onthe sparsity in the pointcloud and the type of convolution(upscaling or downscaling) picking the right flavor couldprovide higher compression and data-savings over the other.
B. Point-cloud Reordering ( S OAR )COIR improves data reuse of input points with
CORF (oroutput points with
CIRF ) by performing convolution acrossmultiple neighbours per element access, but it doesn’t ensuredata reuse of neighbours across multiple input or output points.
To maximize data reuse for neighbours as well, entries inthe metadata needs to be ordered such that the entries withshared neighbours are co-located in the metadata structureand processed in close temporal vicinity.
To achieve this, wepropose
Surface Orientation Aware Reordering (S OAR ) of thepointcloud. We first create an
Adjacency Map with the voxelindex as key and a list of indices to all its neighbours as value.When provided as input the maximum number of voxels forwhich data can fit in the on-chip memory, S
OAR divides theentire pointcloud into multiple chunks, each chunk composedof ordered voxels obeying the maximum voxels constraint.As shown the Figure 8, S
OAR constructs an m-ary tree witheach node representing a voxel and an edge connecting twoneighbouring voxels such that each child node connects toonly one parent even if there is more than one neighbourat parent’s levels, though a parent node can connect to one
Fig. 9. Tiling 1D compressed data-structure for 3D spatially sparsed data or more neighbours as child nodes. To begin, the voxel withminimum number of neighbours is selected as the root noderepresenting a corner in the pointcloud. S
OAR pushes all itsneighbours to a
Neighbour Queue and pops out voxels one-by-one. If a voxel is already selected in previous chunks, itgets dropped otherwise it is inserted into the m-ary tree asa child node to the first neighbour in breadth-first order. Theinserted voxel is then tagged as selected and all its neighboursare added to the
Neighbour Queue . This process continuestill the number of voxels in the m-ary tree matches with theprovided threshold, upon which the m-ary tree is selected asthe desired chunk with voxels in the breadth-first order. Theroot node for the next chunk is selected among voxels in the
Neighbour Queue with minimal number of neighbours andthen this queue is flushed. This process gets completed whenall the voxels in the pointcloud are divided into chunks.
C. Sparsity Aware Dataflow ( S PADE ) Section III-C highlighted two key challenges of tiling 3Dspatially sparse data: 1) memory requirement varies highly dueto input dependent spatial sparsity 2) data accesses can not beestimated through mathematical expressions due to irregulardata-structures. In this section, we propose a framework fordataflow exploration for spatially sparse data processing.For a layer L , let the total number of input voxels, outputvoxels, filter size, input channels, output channels and meta-data size be represented as I , O , K , C , N and M respectively.And if ∆ ’s correspond to respective values in a given tile T (Figure 9), then the tile size ∆ T can be expressed as: ∆ T = (∆ I. ∆ C ) + (∆ O. ∆ N ) + (∆ K. ∆ C. ∆ N ) + ∆ M (1) For a number of output voxels ( ∆ O ) in the tile, requirednumber of input voxels and metadata size could be obtainedper region ( R ∆ Oi ) as: ∆ I = f I ( R ∆ Oi , ∆ O ) , ∆ M = f MO ( R ∆ Oi , ∆ O ) (2) Tile size ∆ T , therefore, could be formulated as a complexfunction of ∆ O , ∆ N and ∆ C . There are two possible waysof tiling: 1) dynamic tiling and 2) static tiling. In dynamictiling, the data fetch module can keep fetching additionalinput feature maps as required for computing every new S p a r s e H a s h C r e a t i o n M e t a d a t a G e n e r a t i o n S p a r s i t y A tt r i b u t e G e n e r a t i o n T il e C a n d i d a t e s G e n e r a t i o n O p t i m a l D a t a f l o w S o l v e r ( D a t a A cc e ss M i n i m i z a t i o n ) Layer Parameters I n p u t P o i n t - c l o u d Optimal Dataflow P o i n t C l o u d R e o r d e r i n g Fig. 10. S
PADE
Dataflow Exploration Framework ∆ O . Note that to enablethis, ∆ N and ∆ C need to be known prior to the executionand, therefore, a joint optimization of all tiling parameterscould not be performed . A solution to this would be toprocess the metadata before the DNN execution extracting outregion information ( f I , f MO ) for every region and empiricallychoose the optimal tiling candidates by iterating over allpossible values of ∆ C and ∆ N pairs. This would explodethe dataflow exploration space and require multiple passes ofmetadata processing significantly degrading latency in real-time scenarios. Thus, we propose to decouple extraction ofregion information in the form of sparsity attributes and thenperform dataflow exploration using these attributes via ananalytical framework . We compute sparsity attributes, SA I and SA MO , as function of ∆ O over all regions (Eqn. 3). Notethat SA MO represents Average number of voxels in ReceptiveField (ARF) for a given region, while SA I takes the formof (1 + β ) where β reflects the fraction of voxels at regionboundary. SA I ( R ∆ Oi , ∆ O ) = (cid:16) f I ( R ∆ Oi , ∆ O ) (cid:17) / ∆ O SA MO ( R ∆ Oi , ∆ O ) = (cid:16) f MO ( R ∆ Oi , ∆ O ) (cid:17) / ∆ O (3) We adopt static tiling with two ethods: 1) Strict Static Tiling(SST) and 2) Relaxed Static Tiling(RST). In SST, SA I and SA MO with the highest value across all regions are pickedfor each ∆ O to ensure worst-case tile size allocation whileunder-utilizing the on-chip memory for quite a few tiles. InRST, to improve memory utilization, we use n th quantile of SA I and SA MO such that majority of the tiles fits within on-chip memory. The tiles overshooting the on-chip memory aresplit into two (or a next power of two) such that each sub-tilefits well into the on-chip memory. To calculate data transfersand operations, we use SA AvgI and SA AvgMO (Eqn. 4) averagingthe sparsity attributes over all the regions ( R ∆ O ) for each ∆ O . SA AvgI (∆ O ) = (cid:16) Σ R ∆ Oi ∈ R ∆ O SA I ( R ∆ Oi , ∆ O ) (cid:17) / | R ∆ O | , SA AvgMO (∆ O ) = (cid:16) Σ R ∆ Oi ∈ R ∆ O SA MO ( R ∆ Oi , ∆ O ) (cid:17) / | R ∆ O | (4) Similar to dense DNN, we choose between three types ofwalk-patterns ( WP ) 1) Input Stationary ( IS ) 2) Output Sta-tionary ( OS ) and 3) Weight Stationary ( W S ). Note that Eqn.2, 3 and 4 could also be expressed as functions of ∆ I computing for ∆ O and ∆ M with CORF metadata structureand ARF representing as Average Response Field. Using theanalytical framework (as shown in Figure 10), S PADE exploresthe entire dataflow design space to arrive upon the best tilesize, walk pattern and metadata structure ( MD ) for each layerin the network given an input pointcloud. S PADE ’s Analyticalframework minimizes data accesses DA (Eqn. 5) betweenthe on-chip memory and the off-chip memory over the entiredataflow design space D = { ( T , WP , MD ) } . DA ( D ) = F WS ( WP , (cid:100) O/ ∆ O (cid:101) ) . ( C.N.K )+ F IS ( WP , (cid:100) N/ ∆ N (cid:101) ) . (cid:16) SA AvgI (∆ O ) .O.C (cid:17) + F OS ( WP , (cid:100) C/ ∆ C (cid:101) ) . (cid:16) O.N + SA AvgMO (∆ O ) .O (cid:17) where, F X ( Y, Z ) = 1 if ( Y = X ) , else Z (5) D. 3D-Sparse-NN-Core Micro-architecture: S S p NNA
Section III-E described the distinctive requisites of a customaccelerator for 3D spatial sparsity, especially contrasted withprior sparse accelerators. We describe in this section, themicroarchitecture of the S S p NNA ( S patially SP arse N eural N etwork A ccelerator) core which lies at the heart of ourAccSS3D solution. The primary requisite for the hardwareaccelerator (HWA) is to process a tile (Figure 9), convert fromsparse to dense representation by restructuring the tile data inthe Front-end and perform dense compute in the Back-end. Top-level Overview:
The S S p NNA
HWA is comprised oftwo major blocks (a)
WAVES Front-end - W eight plane based A ctive V oxel E xecution S cheduler (b) SyMAC Back-end - S ystolic and M ulticast based MAC C omputation in Figure11. Metadata (MT) is partitioned into header and feature data,where header stores C
OIR bit-masks and feature data stores thefeature indices and its data. HWA also has a memory arbiter, aconfiguration and a control block. The Global event controllerinitiates execution after loading the L1 and configuring theHWA. Upon start, the WAVES scheduler formats the metadataand then triggers SyMAC for channel-wise computation andoutput element accumulation, during which, WAVES startsworking on formatting the next set of data.The
WAVES Front-end rearranges the spatially distributedvoxels along weight plane (Figure 11-a). Its subblock MTHDR Processor fetches weight mask and corresponding IFMand OFM indices. By using smart-lookup, it finds weight indexfor 4 active neighbor voxels per cycle. The HDR Format blockforms tuples by grouping 4 features per weight plane, using 27blocks to manage all weight planes together. Link-List basedbuffer is used to provide dynamic memory space allocationfor weight planes, enabling more storage for weights withmore active voxels. This design choice was motivated by theobservation that fixed allocation of resources per weight planeleads to significant under-utilization, since it depends on
ARF .By allocating 1 FIFO per weight plane for tuple storage, higheroccupancy for all the FIFOs cannot be achieved. This under-utilization can be visualized as a wavy line touching the topelement in each FIFO. The Link-List design helps increase theutilization by dynamic allocation of more resources to planeswith higher number of active neighboring voxels, allowing usto accommodate 1.5X-2X more metadata lines in the samesize of memory internal to the S S p NNA .The
SyMAC Back-end increases the data reuse withinthe HWA by connecting multiple compute blocks systolically,multicasting the input features to a set of PEs and accumulat-ing data within the PEs as per channel length. Figure 11(b)shows three options for systolic groups as A (cid:13) , B (cid:13) , C (cid:13) , whichcan be dynamically selected for weight plane grouping in theWAVES. Figure 11(b) 6 (cid:13)
ACC OFM block has local bufferingwith cache lookup capability which helps in reducing memorytransactions for elements with cache hit. Figure 11(c) showsa DeNN block where input features are buffered and reusedfor all output channels, by convolving with different weights.PEs are implemented using tree structure where they can6 ig. 11. Micro-Architecture block diagram for S S p NNA , (a) WAVES group features as per weight planes, stores tuples into Link-List index queue, (b)SyMAC connects DeNNs as group of systolic arrays along the weight data, 6 (cid:13) perform caching and local accumulation on hits (c) DeNN block where IC-datais buffered, to be reused among multiple output channels and multi casted to all the PEs.
Mem ArbandInterface control Ⓒ AdjacencyMapcreation NeighborList GatherMem WriteLatencyFIFODataBuf
NeighborPoint req 6-queues 6
FIFO-1
NeighborAddressGenLat FIFOVoxel mapAddr GenMem RdReq
L1 BUFbitmaskBitmasklookup
Addr pointercounterRd/Wr control logichitmiss Point x,y,zval/vldwrBUF ① ②③ ④ Mem RdReqVoxelreadAddrGen
Schedule1-by-1 Ⓐ Voxel Fetch Ⓑ Ⓐ o/p Ⓐ o/p Ⓐ o/p memreq Ⓑ Neighbor list creation Ⓒ Sparse hash Creation Ⓐ Fig. 12. Micro-Architecture block diagram for Ad
MAC , A (cid:13) is common blockstreaming voxels from memory, B (cid:13) prepares adjacency map to help C (cid:13) createMetadata with minimal memory reads perform dot-product on IEEE754 Full Floating-Point numbersand accumulate locally along the input channels.4-DeNN configuration with 4 PEs per DeNN computes 4elements per PE per cycle allowing S S p NNA to support 64-MUL operations per cycle. Changing S S p NNA configurationto 8 DeNNs, working in two systolic groups of 4-DeNN each,doubles performance to 128 MUL operations per cycle, butdoes not require any more memory ports for weights.
Throughconscious design choices we reduce bandwidth requirementby 68% : a) Local input buffer sharing and Systolic weightconnection reduce bandwidth requirement by 37% and 25%respectively, b) local accumulation within PEs for ∆ C and ∆ N greater than 15 provides another 6% reduction. E. Adjacency Map accelerator Micro-Architecture - Ad
MAC
Section IV-B described the requirements for reorderingpointcloud and Section IV-C, Figure 9 described meta-data structure for tiling. We describe in this section, theMicro-Architecture of Ad MAC - A djacency map and M etadata Ac celerator core in Figure 12. Input voxels (x,y,z) are fetchedserially by block A (cid:13) in Figure 12. Block B (cid:13) creates lookuptable for all the active voxels. For faster lookup, Ad MAC maintains hierarchical lookup table, at level one it encodesactive voxels at higher granularity (called voxel 3D groups)and at second level it stores active information and corre-sponding memory address per voxel. Voxel information isstored in an 8-banked memory where bankID is encodedusing { y[2], z[1:0] } . Within the bank, voxels are hashed sothat each memory read of 64 Bytes can provide informationfor 16 voxels as per { y[1:0], x[1:0] } addressing. This specifichashing helps in reading 26 neighboring voxels in a singlecycle , with the exception of boundary voxels. Block C (cid:13) creates Adjacency List for the voxels in the memory using theSparse hash from B (cid:13) . Blocks C (cid:13) (cid:13) , 2 (cid:13) , 3 (cid:13) compute address ofneighbors, reads voxel information and write to memory afterpacking data as per metadata structure.V. S CALE - UP A RCHITECTURE FOR
3D S
PATIAL S PARSITY
In this section, we present A
CCELERATOR FOR S PATIALLY S PARSE
3D DNNs, an architecture (Figure 13) targeting 3Dspatially sparse DNN applications.
A. Overall Architecture1)
On-die memory architecture : Multi-level memory hier-archies have been widely adopted as they allow high band-width short-distance reuse at smaller inner level memorieswith lower latencies, while longer-distance reuse are capturedat higher capacity by outer memory levels. The two levelsof memories in the AccSS3D architecture are managed asscratchpads by the software to orchestrate the 3D sparse-CNNexecution following the optimal directives of the S
PADE . Ascratchpad-based architecture with multiple levels of memory,requires compute and data-transfer needs to be effectively syn-chronized to maximize compute and bandwidth utilization. Asa result of the locality-aware tile-based execution as describedin section IV-C, there is significantly higher space sensitivity atL1 compared to L2. This is because when processing similarnumber of voxels, the unique-OFM to unique-IFM ratio istypically highly skewed (away from 1) for L1 when comparedto L2. This results in higher sensitivity for reuse opportunity tomemory size at L1 as compared to L2. This precludes double-buffering at L1.
Hence, we choose distinct compute and data-exchange phases as in [52] . All data transfers from betweenthe shared-L2 and a S S p NNA core’s L1 are blocked when thecore is active, and the S S p NNA core is idled when data isbeing exchanged between its L1 and the shared L2. Scaled-up multi-sparse-NN Core architecture : To com-pensate for the increased latency due to sequential phases,we adopt an overlapped tile execution model across themultiple S S p NNA cores in our scaled-up architecture (Figure14-a). Using a shared bus at the L1-L2 interface, tile data istransferred between the shared-L2 and a S S p NNA core’s L1memory while other cores are in their compute phases andthis continues in round-robin mode as each core enters itsdata exchange phase.7 ig. 13. Proposed architecture for spatial sparsity: A
CCELERATOR FOR S PATIALLY S PARSE
3D DNNs Data transfers between L1, L2 and DRAM : We employa DMA-based architecture to accomplish data transfers in thedata-exchange phases. DMAs are triggered through a globalhardware-based event controller, with the software specifyingdata-movement details such as source/destination addressesand number of bytes through the DMA engines tables. Weprovision two DMA engines: one each for the L1-L2 and L2-DRAM interfaces respectively. Based on the selected metadatastructure (
CORF or CIRF ) - either one of the datatypesbetween IFM and OFM is accessed in order, while the othercan be un-ordered (see Figure 9). We use block transfersprogrammed as a single DMA table entry for the entire tilefor the datatype with ordered accesses. For the other datatype,we use DMA table entries per voxel level. Since weights aredense, a block transfer is sufficient. DMA transfers for all tilesof a layer are chained in a pre-selected order as this allowslimiting CPU/software intervention to only once per layer. Load Balancing with Multi-core : Though S
PADE en-sures uniform tiling (equal ∆ N , ∆ C and ∆ O or ∆ I ), due toregion dependent sparsity (varying SA MO ), operations per tilecould differ. Asymmetric core execution times and data trans-fer period can introduce idle periods for both processing coresand the DMA engine (Figure 14b-left). To address this, we sortall the spatial tiles based on ’Ops-per-tile’ in descending order.Based on parallelism along N , cores are grouped with eachgroup to process different N -tiles. The sorted spatial tiles arescheduled in order and round-robin fashion on cores in eachgroup maximizing concurrent execution (Figure 14-b-right). B. Dataflow for Multi-level memory hierarchy
Multi-level memory hierarchies pose significant challengesfor dataflow optimization, as minimization of data-transfersis required at every level of the memory hierarchy to en-able best performance at lowest hardware and energy cost.Inspired by dense DNN dataflow optimizers [8], [21], [58], weadopt hierarchical dataflow search starting from outermost toinnermost level of the on-chip memory. One of the drawbacksof hierarchical search is the likelihood of picking a globallysuboptimal dataflow. For example data access minimization at L q ↔ L q +1 memory interface may choose tile candidates withlower spatial data reuse and higher temporal data reuse, whichcould increase data accesses at inner interface L q ↔ L q − . Toaddress this, we propose Constrained Access based ReuseOpportunity Maximization (CAROM) to maximize spatialreuse while keeping the number of data accesses at L q ↔ L q +1 Fig. 14. (a)Execution flow orchestrating asynchronous scheduling of sparse-NN cores with serialized DMA transfers. (b) An illustration of improvedperformance with smart work scheduling. W x and D x represents processing-work and data-transfer respectively as associated with the tile T x . interface lower than a maximum threshold value ( DA L q th )without being bandwidth constrained. For this, CAROM firstidentifies a set of dataflow candidates D L q , such that: D L q = (cid:110) D i : DA L q D i ≤ DA L q th (cid:111) ∪ (cid:110) argmin D i ( DA L q D i ) (cid:111) (6) where, DA L q th is computed based on number of operationsto be performed on the working set at L q +1 , total compute(Ops/sec) available to an instance of memory level L q andbandwidth at L q ↔ L q +1 interface: DA L q th = ( Ops L q × BW L q ) /T otal Comp L q (7)where, Ops L q = SA AvgMO (cid:16) O L q (cid:17) .O L q .N L q .C L q (8) CAROM then picks the optimal dataflow over the set D L q maximizing reuse opportunity ( RO L q − ) for L q − level. D L q opt = argmax D i ∈ D Lq (cid:16) RO L q − i (cid:17) (9) Since for a given working set, reuse opportunity is proportionalto the total number of operations to be performed on the work-ing set,
Ops L q − is used for reuse opportunity maximization.The optimal tile candidate at a memory level L q acts as theworking set for level L q − . And, therefore, CAROM continuesto pick optimal dataflow(s) from outer levels to inner levelsexcept the innermost level as per the above criterion. Forthe innermost memory level, it selects the optimal dataflowby minimizing data accesses ( argmin D i ( DA i ) ).To improve data locality across all the memory levelsthrough pointcloud reordering, S OAR (Section IV-B) is ex-tended to perform hierarchical pointcloud ordering startingfrom innermost to outermost levels. Given optimal tiling fromS
PADE , S
OAR groups the entire pointcloud in chunks and findsoptimal ordering of points in each chunk based on tillingparameters for the innermost memory. Reinterpreting eachchunk as a point, S
OAR is reapplied to recursively group thesechunks into super-chunks with optimal ordering based on tileparameters of the outer memory level, till the outermost level.
C. Minimizing S PADE latency overhead
As described in sections IV-C and V-B, S
PADE requirespre-processing of input pointcloud data to extract sparsityattributes ( SA ’s) to perform dataflow exploration. Extractionof SA ’s could add significant overhead to end-to-end latency.To minimize this, we explore: 1) if the sparsity attributes couldbe categorized into two sets: a) common attributes which areconsistent across pointclouds - referred to as Meta SparsityAttributes (
MSA ) , b) Input Specific Attributes (
ISA ) which8 pts_0 pts_4pts_8 pts_12pts_16 pts_20pts_32 pts_40pts_48 pts_52 pts_0 pts_4pts_8 pts_12pts_16 pts_20pts_32 pts_40pts_48 pts_523d dense 2d dense Values
ΔO -> ΔO ->
Values
Fig. 15. Sparsity attributes for different points clouds varies highly across pointclouds; and, 2) whether by using
MSA , optimal dataflow candidates could be pre-computed forselected binned values of
ISA . Thus, we correlate SA Avg overrandomly picked pointclouds (Figure 15) and observe that:1. SA AvgI (∆ O ) exhibits high variance across various valuesof ∆ O , but follows similar pattern across pointclouds. Also, SA AvgI ( v ) shows a high correlation with ( α m / m √ v ) represent-ing surface area to volume ratio of m -dimensional cube withvolume v and faces α m .2. SA AvgMO (∆ O ) which represents the ARF , remains constantwith ∆ O for a pointcloud, but varies across pointclouds.Based on above observations, we propose an offline version ofS PADE to use
MSA I as the meta sparsity attribute and ARF as ISA . The
MSA I is computed over a representative set ofpointclouds P using Eq. 10. We generate tables of optimaldataflow candidates with ARF as table index, for all networklayers. For tile allocation, we use -quantile of SA AvgI (∆ O ) along with RST as described in IV-C. Note that the proposedsemi-offline mode of dataflow exploration strikes a consciousbalance between DNN execution performance and runtimelatency overhead of sparse dataflow selection . MSA
AvgI (∆ O ) = (cid:16) Σ p ∈ P SA AvgI,p (∆ O ) (cid:17) / | P | (10) To minimize latency overhead, S
PADE is deployed as twocomponents (Figure 16): 1) offline- S PADE - to generate a tableof optimal dataflows for multiple selected
ARF values, and 2) on-the-fly S PADE (OTF-S
PADE ) - to reorder a given inputpointcloud and select optimal dataflow for the input. Though
Adjacency Map and
COIR metadata contain a similar datastructure, due to tiling and reordering, entries in
AdjacencyMap are re-grouped to create tiled metadata ( ∆ M ) based ontiling parameters of innermost memory level. To transfer databetween memory levels, required DMA tables (as mentionedin section V-A3) are generated prior to DNN execution. Tofurther hide the latency for OTF-S PADE , DNN execution iskicked-off just after the OTF-S
PADE completes processing forthe first layer. Since OTF-S
PADE processing for the rest of thelayers does not depend on DNN execution of previous layers,both the threads, OTF-S
PADE and DNN execution, proceedindependently without further need for synchronization.VI. E
VALUATION
A. Setup, Methodology and Workloads
We evaluate AccSS3D using a whole chip performanceand energy model. We design, synthesize S S p NNA core with
Create Adjacency MapReorder though S OAR for multiple chunk-sizes
Compute Sparsity Attribute
Compute Meta Sparsity Attribute O v e r r e p r e s e n t a t i v e s e t o f p o i n t c l o u d s O v e r s e l e c t e d A R F v a l u e s Find optimal dataflow through S
PADE for Multi-level Memory Hierarchy O v e r a ll l a y e r s o f t h e n e t w o r k (a) Offline S PADE (b) On-the-fly S
PADE
Create Adjacency Map & Compute
ARF
Input PointcloudSelect Optimal Dataflowthrough pre-computed tablesReorder through S
OAR
Generate Address Translation Tables O v e r a ll l e v e l s o f m e m o r y h i e r a r c h y Generate Tiled Metadata Structure (only for innermost memorylevel)
Optimal Dataflow, Tiled Metadata & Translation Tables O v e r a ll l a y e r s o f t h e n e t w o r k Fig. 16. Processing flow for offline mode and on-the-fly mode of S
PADE × MAC is estimated using micro-architectureaccurate model and counting data access events. DRAM poweris taken from Micron power calculator [37] for DDR4-2660.
Performance : To measure AccSS3D performance, we obtainexecution cycles for S S p NNA core for each tile processingin every network layer through SystemVerilog simulation.For end-to-end latency estimation, we simulate multi-coreasynchronous execution model through a detailed analyticalframework feeding per tile execution time from the SVsimulation and including processing time for OTF-S
PADE (Section V-C). CPU (Intel-i7-8700K @3.7 GHz) performanceis measured through VTune [3] and power is estimated usingPSST tool [38]. GPU (Nvidia GeForce-GTX-1080 @1.6GHz)performance is obtained form visual profiler [40] and poweris reported from nvidia-smi . We denote single core CPU, 4-core CPU and GPU software baseline performance as 1-CPU,4-CPU and GPU.
Workloads and Datasets : We evaluate AccSS3D on three ap-plications of 3D visual analytics 1) 3D semantic segmentation2) 3D object detection 3) 3D scene completion picking threestate-of-the-art workloads 1) SCN [18], 2) PV-RCNN [46]and 3) SGNN [12] one from each application. We choose 3Dpointclouds from ScanNet [11] and Waymo Open Dataset [48](Table II, rows 5 and 6) for indoor and outdoor scenarios.B. Scaled-up Configuration for AccSS3D
We describe a 1024 MACs based AccSS3D architecture thatachieves a 50x execution-time speed-up at an operating fre-quency of 1 GHz compared to 1-CPU. Given the prohibitivelylarge architecture configuration space, we adopt a hierarchi-cal framework to arrive at the most optimal configuration.Firstly, the optimal L2 memory size is selected such that9 ig. 17. Performance Sensitivity to L1 memory size, L1 ↔ L2 bandwidth andcore count at an L2 size of 2 MB and DRAM bandwidth of 48 Bytes/clk.Fig. 18. Relative performance, on-chip and DRAM data transfers with optimalarchitecture configuration for select DRAM memory bandwidth points total DRAM accesses do not exceed . × of the total datafootprint accumulated across layers while avoiding bandwidthbottlenecks at DRAM interface for a given DRAM bandwidth.A joint optimization is then performed to optimize L1 memorysize, L1 ↔ L2 bandwidth and number of the S S p NNA cores.Figure 17 shows performance sensitivity to these parametersfor a chosen pair of L2 memory size and DRAM bandwidth.Performance scales with cores up to a certain number owingto a reduction in idle time during data-transfer phase. Withfurther increase in cores, data replication for shared datatypedominates, degrading the performance. At higher L1 ↔ L2bandwidth and larger L1 size, data-transfer time reduces sig-nificantly favoring lower core count configuration. Similarly,we obtain the optimal architecture configuration for eachDRAM bandwidth point (Figure 18) and select the optimalbandwidth that maximizes performance. Since the L2 is dual-buffered, we provision for 2x the required size. Figure 20-Right lists the optimized architecture parameters.
C. Power, Performance and Area Analysis
As described in Section IV-D, the S S p NNA core can bescaled with the number of internal DeNN instances. Thedesign parameters, area breakup for the major blocks on atypical 16nm process and the local buffering details in theS S p NNA core for both configurations are shown in Figure20. With dual 8KB of local buffer we were able to hidethe execution latency of WAVES MT formatting. Figure 20also shows the physical placement for the S S p NNA corewhere we achieved a high utilization 72.6% by placing thelogic blocks as dictated by the internal dataflow. The energyconsumption of the S S p NNA compute core and other localstorage contributes to ∼
50% of total energy and remaining
TABLE IVS
PEED - UP & E NERGY S AVINGS WITH A CC SS3D ON SCN/S
CAN N ET DNN Only End-to-end1-CPU 4-CPU 1-CPU 4-CPUSpeed-up 36.6x 16.8x 23.7x 11.8xEnergy Savings 2079.0x 2232.0x 23.2x 24.8x
50% is attributed to SRAM accesses. 70% of the logic poweris consumed by the clock network whereas sequential andcombinational cells consume 5% and 25% respectively. For Ad MAC , energy is dominated by DRAM reads. Local bufferaccess and logic energy contributes to only 2% of total energy.Figure 19 shows layer-wise speed-up and power reductionfor the 3D sparse convolution operation in the SCN networkwith ScanNet. Speedup for initial and last few layers reachesup to over 1-CPU. As S
PADE adopts spatial sparsityaware tiled execution, data accesses are reduced significantly.With tiled metadata, ordered data transfers through DMAsand asynchronous execution model, data transfer latencies areoverlapped with the accelerator’s compute, as compared toCPU execution where Input Gather and Output Write incurhigh sequential latency. In the middle layers, reuse opportu-nity is significantly higher, therefore the impact of dataflowoptimization is low, yet speedups close to 20x are achievedover 1-CPU. AccSS3D achieves a power reduction by and on average across all layers compared to 1-CPUand 4-CPU respectively. Since middle layers are convolutionheavy, CPU achieves higher instructions-per-cycle resulting inhigher power consumption. Therefore, power reduction withAccSS3D is relatively higher for the middle layers.Table IV summarises speed-up and energy reduction forboth 3D sparse convolution operation and end-to-end scenesegmentation over CPU baselines for SCN on ScanNet. Figure21 shows power and performance for two additional workloads(PV-RCNN, SGNN) and including Waymo’s outdoor dataset,where AccSS3D is compared with 1-CPU, 4-CPU and GPU.For PV-RCNN and SGNN networks, acceleration for
Adja-cency Map creation provides significant speed-up and energysavings as size of pointcloud is relatively higher than numberof channels (
N, C ) in these network topologies.
D. AccSS3D Feature Analysis
We evaluate the goodness of AccSS3D features as de-scribed in Sections IV and V (Figure 22). We collect per-formance metrics for each feature by disabling it from thefully-featured AccSS3D. As reference software implements aweight-stationary dataflow with no spatial tiling , it was infea-sible to implement for few layers due to large weight size andalso likely to be unoptimized for a system with limited mem-ory such as the proposed AccSS3D. Hence, we picked input-stationary dataflow as a reasonable baseline which performstiling on available on-chip memory, equally distributing tilesalong output channels ( N ) onto many-cores and supporting on-chip partial accumulation minimizing DRAM accesses. Op-timal tiling and walk-pattern selected by S
PADE providessignificant reduction in both on-chip and DRAM data accessescompared to baseline dataflow. A data-accesses minimizationat L2 results in lower DRAM accesses, but it increases on-chipdata transfers and hence performance drops significantly dueto on-chip bandwidth bottlenecks. C AROM helps alleviatingthis issue by striking a balance between the DRAM andthe on-chip data-transfers without being bandwidth limited atDRAM interface. Comparing with input pointcloud dependent10
250 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 S p ee d - U p P o w e r R e du c t i o n Layers →
AoS Power Reduction (w/ 1-core) AoS Power Reduction (w/ 4-core) AoS Speedup (w/ 1-Core) AoS Speedup (w/ 4-core)
Fig. 19.
Layer-wise Power Reduction and Performance for SCN/ScanNet with AccSS3D for 3D sparse convolution over 1-CPU and 4-CPU [18] (settingOpenMP threads to 1 and 4). Hyper-threading was disabled during measurement. Idle power was subtracted from total power to get workload power.
SSpNNA Design ParametersAoS Architecture Parameters
DeNNs/SyMAC 8 Multipliers/PE 4IC-data-Buf/DeNN 1 WAVES buffer size 16 KBWT Block/DeNN 4 DeNNs buffer size 268 BPEs/DeNN 4 L1 Memory size 64KB
Floor Plan Area (mm2)
SSpNNA + L1 Memory 0.9232 SyMAC 0.4195SSpNNA 0.7901 DeNN 0.0467 WAVES 0.3706 L1 Memory 0.1331Adjacency Map Accelerator (AdMAC) 0.0571
Parameter Value Parameter Value
SSpNNA Cores 8 L2 Size 2 x 1 MBL1-L2 BW 128 B/clk DRAM BW 48 B/clk
Overall UtilizationCell density Legend (72.6%)
Post-Route database Cell Density
Fig. 20. Left: Physical placement and area utilization of S S p NNA , Right-Top:16nm S S p NNA
Design parameters, Right-Bottom: Architecture Parameters. - C P U - C P U G P U A o S - C P U - C P U G P U A o S - C P U - C P U G P U A o S - C P U - C P U G P U A o S ScanNet/SCN Waymo/PVRCNN ScanNet/SGNN Waymo/SGNN R e l a t i v e P o w e r & E n e r g y R e l a t i v e T i m e CREATE_RB IP_RD GEMM OP_WR AoS Time Relative Energy Relative Power
24x 22x 13x 23x19x 26x 10x 18x
Fig. 21. Power, Performance and Energy Savings (including DRAM) acrossworkloads and datasets with AccSS3D over CPU and GPU baselines. SA based dataflow ( ISA ), AccSS3D with
MSA marginallyloses performance for a few pointclouds while it providessignificant reduction in end-to-end latency through offline-dataflow. Gains with S OAR based pointcloud reordering variesacross pointclouds as scope for the reordering depends oninput geometry and scan order performed during the dataacquisition. Figure 23 shows relative data-access savings withS
OAR comparing with three different scan-orders along x, yand z direction.
E. CPU Performance with S PADE
To evaluate the performance impact of S
PADE on a CPUonly system, we implement tiling and loop order in the refer-ence SCN-CPU baseline [18]. Performing the 3D sparse con-volution as per C
OIR metadata structure requires irregular dataaccesses interleaved with compute. Without explicitly ensuringdata residency in inner level caches, processing performance
Fig. 22. Impact of different AccSS3D features (S
OAR , S
PADE , C
AROM , O TF -S PADE ) on performance, on-chip energy savings, and DRAM data accessesevaluated across different input pointclouds in ScanNet dataset .
32 128 512 2048 8192 32768 R e l a t i v e D a t a A cc e ss e s Raster-Scan (XZY) Raster-Scan (YZX)Raster-Scan (ZXY) SOAR Raster-Scan (YZX) SOARChunk Size (ΔO) Contiguous tiles in memory
Fig. 23. Left: Data access savings with S
OAR reordering over raster-scan { inner, middle, outer loop } . Right: Effect of SOAR on a pointcloud; thethin strips caused by raster scan would invoke multiple data refetches fromneighbouring tiles during convolution. S p ee d - U p E x e c u t i o n T i m e ( s e c ) Layer Id Ref GEMM Ref Data Tx SPADE GEMM SPADE Data Tx Speed-Up
Fig. 24. CPU performance improvements with S
PADE will be limited by latency. For efficient processing, irregulardata accesses need to be separated from the convolution. Toachieve this, similar to the reference CPU baseline, we gatherinput and weights into local buffers and after convolution weperform scattered write for output features. Since footprint formost of layers exceeds the capacity of the CPU’s last-level-cache (LLC) and given the high latency to DRAM memory, weoptimize for DRAM accesses. The dataflow optimizer assumes10% of LLC’s capacity to be used for code and miscellaneousdata and remaining 90% would be be available for the SCN’sworking set. Figure 24 shows the CPU performance with theoptimized tiling and loop order as recommended by S
PADE .For brevity, we show one layer from each spatial resolution.With S
PADE , overall performance improves by 18%. Forsome layers gain performance upto 74%, while for a fewlayers it drops by 21%. These layers have smaller metadata(due to lower resolution) and more channels (
C, N ), henceS
PADE prefers tiling across channels requiring metadata to beaccessed and processed repeatedly.
We observe that high CPUoverheads of metadata processing and explicit scatter-gatheroperation are reasons for lower performance .VII. R
ELATED W ORK
Taichi [23] offers a high level interface to efficient datastructures for spatially sparse data by using index analysis.Eyeriss [7] proposed row-stationary dataflow with diagonaldata-feedforwarding over 2D systolic array demonstrating sub-stantial energy savings, while Eyeriss-V2 [8] extended it to ascale-up architecture through mesh connections. [58] utilizes atiling structure through unrolling nested loops of convolutionto maximize reuse at different levels of caches. FlexFlow[35] aims to maximize compute utilization by minimizingwastage due to spatial split of work among PEs. Morph [21]11ptimizes dataflows for a 3-level memory scale-up architecturemaximizing reuse at each level in the hierarchy. ExTensor [22]proposes technique to find non-zero element intersection foreffective computation. Sparten [17] defines SparseMap, whichis two tuple of bitmask and performs efficient inner join logicto feed MAC. SMAASH [29] compresses sparse matrix insoftware and performs efficient index finding by hardwareaccelerator on compressed data. [24] proposes fine grainedchannel gating technique and an accelerator to exploit thedynamic sparsity. [63] proposes an accelerator with IndexingModule to efficiently select and transfer needed neurons toPEs. VIII. C
ONCLUSION
Understanding of 3D objects and environment is criticalfor many real world applications. To our best of knowledge,this is the first end-to-end solution for accelerating 3D sceneanalysis by exploiting spatial sparsity through sparsity-awaredataflow optimizer, novel micro-architecture and design fora spatially-sparse compute engine and, employing customsoftware-hardware co-designed methodologies.R arXiv preprint arXiv:1709.06158 , 2017.[5] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li,S. Savarese, M. Savva, S. Song, H. Su et al. , “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012 , 2015.[6] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d objectdetection network for autonomous driving,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2017, pp.1907–1915.[7] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture forenergy-efficient dataflow for convolutional neural networks,” in
ACMSIGARCH Computer Architecture News , vol. 44, no. 3. IEEE Press,2016, pp. 367–379.[8] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexibleaccelerator for emerging deep neural networks on mobile devices,” arXivpreprint arXiv:1807.07928 , 2018.[9] H.-Y. Chiang, Y.-L. Lin, Y.-C. Liu, and W. H. Hsu, “A unified point-based framework for 3d segmentation,” in . IEEE, 2019, pp. 155–163.[10] C. Choy, J. Gwak, and S. Savarese, “4d spatio-temporal con-vnets: Minkowski convolutional neural networks,” arXiv preprintarXiv:1904.08755 , 2019.[11] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, andM. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoorscenes,” in
Proc. Computer Vision and Pattern Recognition (CVPR),IEEE , 2017.[12] A. Dai, C. Diller, and M. Nießner, “Sg-nn: Sparse generative neuralnetworks for self-supervised scene completion of rgb-d scans,” arXivpreprint arXiv:1912.00036 , 2019.[13] A. Dai, D. Ritchie, M. Bokeloh, S. Reed, J. Sturm, and M. Nießner,“Scancomplete: Large-scale scene completion and semantic segmenta-tion for 3d scans,” in
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2018, pp. 4578–4587.[14] R. di Bella, D. Carrera, B. Rossi, P. Fragneto, and G. Boracchi, “Waferdefect map classification using sparse convolutional networks,” in
In-ternational Conference on Image Analysis and Processing . Springer,2019, pp. 125–136. [15] M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, “Tangram:Optimized coarse-grained dataflow for scalable nn accelerators,” in
Pro-ceedings of the Twenty-Fourth International Conference on ArchitecturalSupport for Programming Languages and Operating Systems . ACM,2019, pp. 807–820.[16] M. Garbade, Y.-T. Chen, J. Sawatzky, and J. Gall, “Two stream 3dsemantic scene completion,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition Workshops , 2019, pp. 0–0.[17] A. Gondimalla, N. Chesnut, M. Thottethodi, and T. Vijaykumar,“Sparten: A sparse tensor accelerator for convolutional neural networks,”in
Proceedings of the 52nd Annual IEEE/ACM International Symposiumon Microarchitecture . ACM, 2019, pp. 151–165.[18] B. Graham, M. Engelcke, and L. van der Maaten, “3d semantic segmen-tation with submanifold sparse convolutional networks,” in
Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,2018, pp. 9224–9232.[19] B. Graham and L. van der Maaten, “Submanifold sparse convolutionalnetworks,” arXiv preprint arXiv:1706.01307 , 2017.[20] Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun,“Deep learning for 3d point clouds: A survey,” arXiv preprintarXiv:1912.12033 , 2019.[21] K. Hegde, R. Agrawal, Y. Yao, and C. W. Fletcher, “Morph: Flexibleacceleration for 3d cnn-based video understanding,” in .IEEE, 2018, pp. 933–946.[22] K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago, A. Jaleel,E. Solomonik, J. Emer, and C. W. Fletcher, “Extensor: An accelerator forsparse tensor algebra,” in
Proceedings of the 52nd Annual IEEE/ACMInternational Symposium on Microarchitecture . ACM, 2019, pp. 319–333.[23] Y. Hu, T.-M. Li, L. Anderson, J. Ragan-Kelley, and F. Durand, “Taichi:a language for high-performance computation on spatially sparse datastructures,”
ACM Transactions on Graphics (TOG) , vol. 38, no. 6, pp.1–16, 2019.[24] W. Hua, Y. Zhou, C. De Sa, Z. Zhang, and G. E. Suh, “Boostingthe performance of cnn accelerators with dynamic fine-grainedchannel gating,” in
Proceedings of the 52Nd Annual IEEE/ACMInternational Symposium on Microarchitecture , ser. MICRO ’52. NewYork, NY, USA: ACM, 2019, pp. 139–150. [Online]. Available:http://doi.acm.org/10.1145/3352460.3358283[25] R. Hwang, T. Kim, Y. Kwon, and M. Rhu, “Centaur: A chiplet-based,hybrid sparse-dense accelerator for personalized recommendations,” in . IEEE, 2020, p. 1.[26] G. Inc, “Google sparse hash.” [Online]. Available: https://github.com/sparsehash/sparsehash[27] L. Jiang, H. Zhao, S. Liu, X. Shen, C.-W. Fu, and J. Jia, “Hierarchicalpoint-edge interaction network for point cloud semantic segmentation,”in
Proceedings of the IEEE International Conference on ComputerVision , 2019, pp. 10 433–10 441.[28] M. Jiang, Y. Wu, T. Zhao, Z. Zhao, and C. Lu, “Pointsift: A sift-like network module for 3d point cloud semantic segmentation,” arXivpreprint arXiv:1807.00652 , 2018.[29] K. Kanellopoulos, N. Vijaykumar, C. Giannoula, R. Azizi, S. Koppula,N. M. Ghiasi, T. Shahroodi, J. G. Luna, and O. Mutlu, “Smash:Co-designing software compression and hardware-accelerated indexingfor efficient sparse matrix operations,” in
Proceedings of the 52NdAnnual IEEE/ACM International Symposium on Microarchitecture , ser.MICRO ’52. New York, NY, USA: ACM, 2019, pp. 600–614.[Online]. Available: http://doi.acm.org/10.1145/3352460.3358286[30] H. Kwon, A. Samajdar, and T. Krishna, “Maeri: Enabling flexibledataflow mapping over dnn accelerators via reconfigurable intercon-nects,” in
Proceedings of the Twenty-Third International Conferenceon Architectural Support for Programming Languages and OperatingSystems , ser. ASPLOS ’18. New York, NY, USA: ACM, 2018, pp.461–475.[31] J. Li, Y. Liu, D. Gong, Q. Shi, X. Yuan, C. Zhao, and I. Reid, “Rgbdbased dimensional decomposition residual network for 3d semanticscene completion,” in
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2019, pp. 7693–7702.[32] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “Pointcnn: Con-volution on x-transformed points,” in
Advances in neural informationprocessing systems , 2018, pp. 820–830.
33] Z. Li, X. Yan, Q. Wei, X. Gao, S. Wang, and S. Cui, “Pointsite: apoint cloud segmentation tool for identication of protein ligand bindingatoms,” bioRxiv , p. 831131, 2019.[34] S. Liu, Y. Hu, Y. Zeng, Q. Tang, B. Jin, Y. Han, and X. Li, “See andthink: Disentangling semantic scene completion,” in
Advances in NeuralInformation Processing Systems , 2018, pp. 263–274.[35] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “Flexflow: A flexibledataflow accelerator architecture for convolutional neural networks,” in . IEEE, 2017, pp. 553–564.[36] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural net-work for real-time object recognition,” in
Queue , vol. 16, no. 2, pp. 50–66, 2018. [Online].Available: https://github.com/intel/psst[39] P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmen-tation and support inference from rgbd images,” in
ECCV , 2012.[40] Nvidia, “Nvidia visual profiler,” 2016. [Online]. Available: https://developer.nvidia.com/nvidia-visual-profiler[41] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: Anaccelerator for compressed-sparse convolutional neural networks,” in . IEEE, 2017, pp. 27–40.[42] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning onpoint sets for 3d classification and segmentation,” in
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2017,pp. 652–660.[43] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchicalfeature learning on point sets in a metric space,” in
Advances in neuralinformation processing systems , 2017, pp. 5099–5108.[44] R. A. Rosu, P. Sch¨utt, J. Quenzel, and S. Behnke, “Latticenet: Fastpoint cloud segmentation using permutohedral lattices,” arXiv preprintarXiv:1912.05905 , 2019.[45] S. Schmohl and U. S¨orgel, “Submanifold sparse convolutional networksfor semantic segmentation of large-scale als point clouds.”
ISPRS Annalsof Photogrammetry, Remote Sensing & Spatial Information Sciences ,vol. 4, 2019.[46] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv-rcnn: Point-voxel feature set abstraction for 3d object detection,” arXivpreprint arXiv:1912.13192 , 2019.[47] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser,“Semantic scene completion from a single depth image,” in
Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,2017, pp. 1746–1754.[48] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui,J. Guo, Y. Zhou, Y. Chai, B. Caine et al. , “Scalability in perceptionfor autonomous driving: Waymo open dataset,” arXiv , pp. arXiv–1912,2019.[49] M. Tatarchenko, J. Park, V. Koltun, and Q.-Y. Zhou, “Tangent convolu-tions for dense prediction in 3d,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2018, pp. 3887–3896.[50] M. Teschner, B. Heidelberger, M. M¨uller, D. Pomerantes, and M. H.Gross, “Optimized spatial hashing for collision detection of deformableobjects.” in
Vmv , vol. 3, 2003, pp. 47–54.[51] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette,and L. J. Guibas, “Kpconv: Flexible and deformable convolution forpoint clouds,” in
Proceedings of the IEEE International Conference onComputer Vision
Sustainable Cities and Society ,vol. 54, p. 102002, 2020.[54] Y. Wu, Y. Pang, B. Gao, and J. Han, “Complementary features withreasonable receptive field for road scene 3d object detection,” in . IEEE,2019, pp. 3905–3909. [55] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3dshapenets: A deep representation for volumetric shapes,” in
Proceedingsof the IEEE conference on computer vision and pattern recognition ,2015, pp. 1912–1920.[56] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3dshapenets: A deep representation for volumetric shapes,” in
The IEEEConference on Computer Vision and Pattern Recognition (CVPR) , June2015.[57] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutionaldetection,”
Sensors , vol. 18, no. 10, p. 3337, 2018.[58] X. Yang, J. Pu, B. B. Rister, N. Bhagdikar, S. Richardson, S. Kvatin-sky, J. Ragan-Kelley, A. Pedram, and M. Horowitz, “A systematicapproach to blocking convolutional neural networks,” arXiv preprintarXiv:1606.04209 , 2016.[59] Y. Ye, H. Chen, C. Zhang, X. Hao, and Z. Zhang, “Sarpnet: Shapeattention regional proposal network for lidar-based 3d object detection,”
Neurocomputing , vol. 379, pp. 53–63, 2020.[60] Y. Ye, C. Zhang, and X. Hao, “Arpnet: attention region proposal networkfor 3d object detection,”
Science China Information Sciences , vol. 62,no. 12, p. 220104, 2019.[61] J. Zhang, H. Zhao, A. Yao, Y. Chen, L. Zhang, and H. Liao, “Efficientsemantic scene completion network with spatial group convolution,” in
Proceedings of the European Conference on Computer Vision (ECCV) ,2018, pp. 733–749.[62] P. Zhang, W. Liu, Y. Lei, H. Lu, and X. Yang, “Cascaded context pyra-mid for full-resolution 3d semantic scene completion,” in
Proceedingsof the IEEE International Conference on Computer Vision , 2019, pp.7801–7810.[63] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, andY. Chen, “Cambricon-x: An accelerator for sparse neural networks,” in
The 49th Annual IEEE/ACM International Symposium on Microarchi-tecture . IEEE Press, 2016, p. 20.. IEEE Press, 2016, p. 20.