POIRot: A rotation invariant omni-directional pointnet
PPOIR OT : A ROTATION INVARIANT OMNI - DIRECTIONALPOINTNET ∗ A P
REPRINT
Liu Yang, Rudrasis Chakraborty and Stella X. Yu
University of CaliforniaBerkeley, USA. {liu-yang, rudra, stellayu}@berkeley.edu
October 31, 2019Point-cloud is an efficient way to represent 3D world. Analysis of point-cloud deals with understanding the underlying3D geometric structure. But due to the lack of smooth topology, and hence the lack of neighborhood structure, standardcorrelation can not be directly applied on point-cloud. One of the popular approaches to do point correlation is topartition the point-cloud into voxels and extract features using standard 3D correlation. But this approach suffers fromsparsity of point-cloud and hence results in multiple empty voxels. One possible solution to deal with this problemis to learn a MLP to map a point or its local neighborhood to a high dimensional feature space. All these methodssuffer from a large number of parameters requirement and are susceptible to random rotations. A popular way tomake the model “invariant” to rotations is to use data augmentation techniques with small rotations but the potentialdrawback includes (a) more training samples (b) susceptible to large rotations. In this work, we develop a rotationinvariant point-cloud segmentation and classification scheme based on the omni-directional camera model (dubbedas
POIRot ). Our proposed model is rotationally invariant and can preserve geometric shape of a 3D point-cloud.Because of the inherent rotation invariant property, our proposed framework requires fewer number of parameters(please see [1] and the references therein for motivation of lean models). Several experiments have been performedto show that our proposed method can beat the state-of-the-art algorithms in classification and part segmentationapplications. Furthermore, we have applied our proposed framework to detect corpus callosum shape from a 3D brainscan represented as a point-cloud. We have empirically shown that our proposed method can detect corpus callosumshape from the 3D brain point-cloud given only the atlas of the corpus callosum. Point-cloud is an efficient way to represent 3D world [2, 3]. The recent years have witnessed the popularity of 3Dcomputer vision tasks with the advent of 3D sensors and modeling devices. The 3D sensors such as depth cameras,LiDAR can output 3D point-cloud, which is a key component in several 3D vision tasks including but not limited tovirtual/ augmented reality [4, 5], 3D scenes understanding [6, 7, 8], and autonomous driving [9, 10, 11].Due to the enormous popularity of correlation neural networks (CNNs) in computer vision tasks [12, 13], an obviousapproach is to use CNNs to process point-cloud. But unfortunately, due to the lack of the smooth topology of apoint-cloud, standard correlation can not be applied as it is. This is mainly due to the fact that at a given point in apoint-cloud, it is hard to define a grid structure analogous to an image, hence applying standard correlation turns outto be a non-trivial and challenging problem. To alleviate this problem, several researchers [14, 15, 16, 17, 18, 19, 20]use CNNs to process 3D point-cloud is by first mapping the point-cloud on a smooth topological space where doingcorrelation makes sense. Several popular solutions to overcome the main bottleneck of lack of a smooth neighborhoodtopology of a 3D point include 1. converting the 3D point-cloud into regular voxel representation [14, 15, 16] or 2. usingview projection [17, 18, 19, 20]. Most of these voxel based methods suffer from possible sparsity of point-clouds whichresults in multiple empty voxels.One possible solution is to use MLP to extract features from each point [2] or a local neighborhood of each point[3]. These models [2, 3, 21, 22] directly work on 3D point-clouds. Similar to the CNNs, given a set of points, the ∗ A tribute to Agatha Christie a r X i v : . [ c s . C V ] O c t PREPRINT - O
CTOBER
31, 2019“point correlation layer” finds “local patch” around each point by using point affinity matrix. Here the affinity matrixis defined as the adjacency matrix for the fully connected graph on point-cloud. These local patches are used instandard correlation operator to extract local patches and this operator is defined as “point correlation”. This basic“point correlation” operator layers are stacked together to extract features. But unlike images, defining local patcheson point-cloud needs to deal with geometric structure of point-cloud. In most methods [2, 3, 23, 24], researchers usenearest neighbors to define local neighborhood, a.k.a., “local patch”.The rationale behind using MLP to extract features from each point or a local neighborhood can be thought of asmapping the local neighborhood in a high dimensional space ( ⊂ R n for some n ) from where extracting features usingstandard correlation makes sense. This is analogous to kernel methods [25] where the features are mapped in Hilbertspace, but nonetheless this essentially implies that on the feature space we use the topology induced from the Hilbertspace. In the context of point-cloud processing, this analogy translates to the use of the induced topology from thestandard smooth topology of R n . But due to the presence of geometric structures in a 3D point-cloud, in order toinduce the globally flat topology from R n , n needs to be very large. Naturally this increases the complexity of themodel in terms of number of parameters and computational time. To overcome this limitation, one can embed the localstructures and geometry of the 3D point-cloud in a “curved” space with known non-Euclidean geometry. One of thewell-known non-Euclidean spaces is hypersphere, whose topology we use to induce a topology on point-cloud.Using the induced topology from sphere, we define a correlation operator on the point-cloud. In order to definecorrelation operator, we first put sphere on each point of the point-cloud and collect response from it. This essentiallyrepresents the local geometry in the point-cloud captured as spherical response. We use spherical correlation to extractrotation equivariant features. Unlike previous methods, we implicitly look at the interaction between points in thepoint-cloud by looking at the collective spherical responses. After extracting rotation equivariant local features, we lookat explicit interaction between points in the point-cloud. In section 2, we give the detailed description of the scheme toextract local and global features for classification and segmentation tasks. Our proposed method has several advantagesover the previous methods including: (a) the induced spherical topology makes our proposed scheme invariant torotations (b) due to the presence of intrinsic geometry, our proposed scheme has much leaner model (c) the interactionbetween local features makes the proposed method invariant to permutations. (d) our proposed segmentation schemeoutperforms the state-of-the-art methods on benchmark datasets. (e) we can achieve similar classification accuracy onrotated data without any explicit data augmentation. Before describing our proposed method in detail, we discuss someof the previous work. Previous works on 3D point-clouds either deal with representing 3D shapes based on 3D grid, and use standardcorrelation networks [14, 26, 27]. These methods mostly suffer from inefficient usage of 3D voxels due to emptyvoxels and fail to capture 3D geometric shapes. Furthermore, due to the computational complexity of 3D correlationoperations, this is not a desirable choice. In some recent work [28, 15, 27, 16] researchers proposed techniques tosomewhat overcome these limitations but still the partition of point-cloud into voxels makes these algorithm not suitableto capture 3D geometric shapes.Another major body of work is mostly deal with developing “correlation” like techniques on 3D point-clouds. As a firstwork in this genre, PointNet [2] embeds each point coordinate in a high dimensional space by learning a mapping andthen aggregating information by pooling the features. Although achieving reasonable accuracy, PointNet did not learnany local geometric information of the 3D shape. PointNet++ [3] handled this by proposing a hierarchical applicationof isolated 3D point feature learning to multiple subsets of point-cloud data. The authors ideally used the single pointprocessing unit hierarchically on multiple subsets of the point-cloud. Several other researchers proposed techniques tocombine local neighborhood information either by defining correlation operator like χ -conv [24] or by using a dynamicgraph based technique [23]. In order to capture geometric shapes, [29] extracted local structure by grouping pointsbased on permutohedral lattices [30], and then applied bilateral correlation [31] for feature learning. Super-point graphs[32] proposed to partition point-cloud into super-points to learn 3D geometric shapes.Though there is a large body of work defining “correlation” like technique on point-clouds, none of them define aequivariant correlation operator on point-cloud. As stated before, the challenge is mostly due to lack of smooth topologyin 3D point-clouds which are naturally equipped with discrete topology. This motivates us to define a correlationoperator by defining an induced smooth topology on point-clouds. In the rest of the paper, we first describe our proposedclassification and segmentation technique for point-clouds in Section 2 with experimental validations in Section 3.2 PREPRINT - O
CTOBER
31, 2019
In this section, we first give the motivation of our proposed geometric framework for processing 3D point-cloud. Morespecifically, we will point out that the existing methods has several shortcomings including: 1. earlier methods eithermapped each point or its neighborhood in a point-cloud into higher dimension in order to extract features and hencerequire significantly large number of parameters 2. none of these previous methods define a correlation operator onpoint-clouds preserving geometric invariance properties.To avoid these shortcomings, we propose our framework toprocess 3D point-clouds which is 1. inherently invariant to rigid transformations, 2. use definition of correlation onpoint-clouds by induced topology from sphere 3. leaner compared to the previous methodsWe first propose a rotation invariant correlation neural network (CNN) using an induced spherical topology on thepoint-cloud. Though the formulation described below can be applied on R n for any n ∈ Z + , in this work we haverestricted ourselves to n = 3 as our proposed framework is specifically designed for 3D point-cloud. Our proposedframework consists of three basic building blocks which we describe next. Figure 1: Response frompoint-cloud collected onsphere. At each point in the point-cloud we put a sphere and collect the response from theentire point-cloud. This gives at each point x i ∈ R , a function f i : S r ( x i ) → R ,where, S r ( x i ) is the sphere of radius r > centered at x i . Thus given the point-cloud X = { x i } Ni =1 , we represent by { f i } . Now, we collect the combined responses fromentire point-cloud. Before doing that, for each x i , we subtract x i from X so that S r ( x i ) at x i is centered at the origin of R . Without any loss of generality, we will denotethe sphere centered at the origin by S r . Now, given y ∈ S r , we compute the response f i ( y ) (an example is shown in Fig. 1) as f i ( y ) = (cid:88) x j (cid:54)∈ B r ( x i ) max (cid:8) , y t ( x j − x i ) (cid:9) , (1)where, B r ( x i ) is the unit ball with radius r centered at x i . The reason for ignoring the negative responses is twofold:1. Given y , (cid:101) y ∈ S r and x ∈ R , if x t y and x t (cid:101) y differ in sign (assume x t y ≥ ), then the two points, y , (cid:101) y on S r mustlie on two hemispheres separated by the equator perpendicular to x . Thus, we can eliminate (cid:101) y as there exists a − (cid:101) y ∈ S r such that x t ( − (cid:101) y ) ≥ . Thus eliminating negative responses will reduce information bottleneck. 2. The underlyinghypothesis is that response from every point in the point-cloud should be captured by exactly one antipodal point on S r ,thus eliminating negative responses will reduce the amount of conflicting information gathered on S r . A schematicdiagram depicting a point-cloud and the corresponding collected response is given in Fig. 2.Figure 2: ( Left: ) The original point-cloud, (
Middle: ) the downsampled point-cloud, (
Right ) convex hull points.This representation can be viewed as putting omni-directional camera at each point and collecting the responses ineach viewing direction.
This analogy makes one wonder:
Is there a necessity for N cameras where N is the number ofpoints in the point-cloud? Obviously, for a dense point-cloud the answer is no and hence we propose a multinomialdownsampling strategy as follows. 3
PREPRINT - O
CTOBER
31, 2019Figure 3: Point-cloud withconvex set of points (shown inorange).
Multinomial downsampling: (a) on each S r centered at x , we collect the omni-directional response from the entire point-cloud. (a) for each i th point in the point-cloud, we assign a value v i to be the largest response collected on S r . (a) we usenormalized v i as the sampling probability for the multinomial distribution. We draw n < N number of sub-sampled points from this multinomial distribution. In thefollowing proposition, we claim that the set of points selected according to the largestresponses lie on the convex hull of the point-cloud (an example is shown in Fig. 3). Proposition 1.
Let J be the set of indices corresponding to the points with the largestresponses. Then { x j } j ∈ J lie on the convex hull of the point-cloud.Proof. We will prove it by contra-positive. Let y ∈ S r . Let x k be the point that doesnot lie on the convex hull of the point-cloud, i.e., there exists a x i such that y t ( x i − x k ) < . Let (cid:101) x l be a point on theconvex hull of the point-cloud, i.e., for x i and y ∈ S r , y t ( x i − (cid:101) x l ) > . Thus, f l ( y ) > f k ( y ) , which concludes that k (cid:54)∈ J . Thus by contra-positive, we conclude the proof. (cid:4) Let S = { x i } i ∈ I be a given set of points. As mentioned before, we will collect response at each x i ∈ S as in Algorithm1. Observe that 1. the proposed multinomial downsampling strategy can be viewed as a data augmentation technique. Algorithm 1:
Compute responses at each grid point on a given set of points S . Data:
Input X = { x i } , S = { x i } i ∈ I ⊂ X , r > Result:
Responses (cid:8) f i : S → R (cid:9) i ∈ I Generate a grid on { y j } Kj =1 on S r ; For each x i ∈ S and for each y j , assign f i ( y j ) = (cid:80) x k (cid:54)∈ B r ( x i ) max (cid:8) , y tj ( x k − x i ) (cid:9) ;2. as a consequence of this downsampling, our proposed model is robust to outliers. Now that we have a set of sphericalsignals (cid:8) f i : S → R (cid:9) i ∈ I , the next step is to extract invariant features from the spherical signals. Partitioning the response based on normal directions:
If we have normal information present, we canuse this information to separate the collected responses. In Algorithm 1, we change the construction of f i ( y j ) as follows: f li ( y j ) = (cid:88) x k (cid:54)∈ B r ( x i ) n tk n i ∈ [ ( l − π n , lπ n ) max (cid:8) , y tj ( x k − x i ) (cid:9) , (2)where, n is the number of partitions we choose based on normal directions and n k is the normal direction for x k . Thuswe essentially partition the responses from { x k } collected at S r centered at x i based on the similarity of normal at x i , denoted by n i with the normal at x k , denoted by n k . An example for partitioning into 3 channels is shown in theadjacent figure. Given (cid:8) f i : S → R (cid:9) i ∈ I , we will use the spherical correlation network as defined in [33]. For completeness, we willfirst briefly introduce the correlation operator before discussing the usage. Correlation operators:Definition 1 ( S correlation) . Given f : S → R (the signal) and w : S → R (the learnable kernel), we define thecorrelation operator f (cid:63) w : SO (3) → R as ( f (cid:63) w ) ( g ) = (cid:90) S f ( x ) w ( g − · x ) ω ( x ) . (3) Here, ω is the chosen volume density on S . As showed in [33, 34], the above definition of spherical correlation isequivariant to the action of SO (3) . PREPRINT - O
CTOBER
31, 2019As the output of the spherical correlation operator is a function on SO (3) (the × special orthogonal group), we willuse the correlation operator on SO (3) as defined in [35]. Definition 2 ( SO (3) correlation) . Given f : SO (3) → R (the signal) and w : SO (3) → R (the learnable kernel), wedefine the correlation operator f (cid:63) w : SO (3) → R as ( f (cid:63) w ) ( g ) = (cid:90) SO (3) f ( x ) w ( g − x ) (cid:98) ω ( x ) . (4) Here, (cid:98) ω is the Haar measure on SO (3) . This definition of SO (3) correlation is equivariant to the action of SO (3) . We will use the rotational equivariance property of the spherical correlation to extract rotation invariant features fromthe 3D point set. But in order to do that, we need to address the following questions 1. what is the effect of rotation onthe spherical signal? 2. how to get invariant features?
What is effect of rotation on the spherical signal:
If we rotate the 3D point-cloud by a rotation matrix R , the responsesas computed using Algorithm 1 will be rotated by the same matrix, this is stated in the following proposition. Proposition 2.
If the point-cloud X is rotated by the matrix R , then the corresponding responses { f i } are rotated bythe matrix R .Proof. Given r > and x i ∈ X , let x j (cid:54)∈ B r ( x i ) . Let (cid:101) X = RX be the rotated point set. Let y ∈ S r , then y t ( x j − x i ) becomes y t R ( x j − x i ) after rotation. And as f i in Eq. 1 is sum of he responses y t ( x j − x i ) for all x j (cid:54)∈ B r ( x i ) , thisconcludes the proof. (cid:4) Thus by the above proposition, with rotation of the point-cloud, the spherical signal also gets rotated. And observe thatas S and SO (3) correlation operators are equivariant to rotations, the output features after cascaded S and SO (3) correlation operators are also rotational equivariant. But, in order to extract rotational invariant features, we will use aninvariant layer as defined next. But before that, we will define equivariance and invariance for completeness. Equivariance and Invariance:
Given f : S → R and R ∈ SO (3) , an operator on f , denoted by F ( f ) : SO (3) → R is (a) equivariant to the action of SO (3) if R · F ( f ) = F ( R · f ) , (5) (b) invariant to the action of SO (3) if F ( R · f ) = F ( f ) , (6)where, R · f ( x ) := f ( R − · x ) for all x ∈ S . How to get invariant features:
As the output of the spherical correlation, f : SO (3) → R is equivariant to the action of SO (3) , we will integrate f over SO (3) with respect to the Haar measure to get SO (3) invariant feature. This entails“quotienting” out the group SO (3) , which results the invariant features with respect to SO (3) .Notice that we will use S correlation followed by cascaded SO (3) correlation with intermediate ReLU and normaliza-tion operators, followed by the invariant layer to generate rotation-invariant features. Non-linear operator:
In order to design a deep architecture, it is essential to use some amount of non-linearitiesin-between correlation operators. As the aforementioned correlation operator is R valued, we can use ReLU operatoras our choice of non-linearity. Normalization:
It is well-known that normalization is crucial for stability of optimization and even achieving betteroptimum. We will resort to two types of normalization schemes, namely
Batch-normalization [36] and
Activation-Normalization [37]. We use PyTorch implementation of Batchnorm. For Actnorm, given an input tensor T ∈ R B × c × h × w ( B, c, h, w denotes the batch size, number of channels and spatial resolution respectively.), we learn thescaling tensor H ∈ R × c × × and bias tensor B ∈ R × c × × to get the normalized output as T (cid:55)→ H (cid:12) T + B .A schematic depicting the pipeline to extract rotation invariant features from spherical signal is shown in Fig. 4. Before moving forward with the rest of our proposed model architecture, we will introduce a leaner variant of sphericalcorrelation operator,
Tensor Ring . Observe that the correlation operators defined in Eqs. 3, 4 are going to be computedin Fourier basis of the respective domain. For example, on S , we use Spherical Harmonics basis [38] to computespherical correlation operator. Thus, for spherical correlation, we denote the learnable kernel w : S → R with the5 PREPRINT - O
CTOBER
31, 2019Figure 4: A schematic from response on a sphere to the final invariant layer.Figure 5: An example demonstrating Tensor Ring as given in Eq. 7 with k = 4 , p = 2 , n = 3 , n = 5 , n = 5 , n = 3 .Here after the multiplication of tensor cores we use trace operator on the result.corresponding coefficient (with respect to Spherical Harmonics basis) matrix W ∈ R C in × N × N × C out , where, C in , C out are the input and output number of channels and N × N is the resolution of the parametrization of S .Any tensor of the size W ∈ R n ×··· n k can be decomposed using Tensor Ring [39] as follows: W ( i , · · · i k ) = trace ( T ( i ) · · · T k ( i k )) , (7)where, T j ( i j ) ∈ R p × p and i j ∈ [1 , · · · n j ] for all j . Here p is the size of the core of the tensor ring. It has been shownthat under some assumptions any tensor can be decomposed in tensor ring form with arbitrary approximation error[39]. An example demonstrating tensor ring in Fig. 5. Notice that this form of decomposition amounts of learning p ( (cid:80) kj =1 n j ) number of parameters compared to (cid:81) kj =1 n j number of parameters. To give a concrete example, for atensor W ∈ R × × × with a core size p = 5 , using tensor ring we use number of parameters instead of , which amounts to it . parameters reduction.In our implementation of spherical correlation, we use the tensor ring form in order to implement the correlation operator.We will use Φ l to denote the spherical correlation block to extract rotation invariant local features. After extracting thisfeatures from each of the selected points, { x i } i ∈ I , our algorithm to combine features is different for classification andsegmentation. Below, we will first discuss our proposed scheme for segmentation followed by classification. Before describing the scheme to combine the extracted features, we will first discuss the necessity of extractingglobal features from a point-cloud.
Why global features are needed?
It is obvious that global features are helpfulfor classification, though correlation is a powerful operator to extract local features, even for standard (Euclidean)correlation operators, aggregating local features, e.g., pooling is a necessity. As for classification we need to find asingle consensus for an entire point-cloud, aggregating local features to get a global response is desired. Moreover inorder to do semantic/part segmentation, just local feature at each point in the point-cloud is not enough as we needto find spatial context of each point with respect to the global structure, e.g., in order to do part segmentation of aearphone in terms of head band and ears, we need information in addition to local features. An example showing theusefulness of global features is shown in Fig. 6. Here the two selected points have similar local features (after quotientout rotations) but they belong to two different classes. Hence the need for global information is justified.Now that we justify the necessity of the global features, we propose our scheme to extract global features. We divideour scheme into two subsections, namely (1) extraction and combining features for segmentation and (2) extraction andcombining features for classification. 6
PREPRINT - O
CTOBER
31, 2019Figure 6: Local responses across different classes look similar.
Given a non-empty S = { x i } i ∈ I ⊂ X as the given convex hull for the point-cloud X , we will define global affinecoordinate of each x i as follows: If x i = (cid:88) j | x j ∈ S c ij x j , (8)then, we will say ( c i , · · · , c i | S | ) t to be the affine coordinates of x i where (cid:80) j c ij = 1 . Let Φ gS : X → A | S | ⊂ R | S | bethe function which returns affine coordinates for each x i given the convex hull S . Notice that here A n is the affinesubspace of R n and | S | is the cardinality of set S . It is easy to see that the affine coordinate is rotation invariant asstated formally in the next proposition. Proposition 3.
Given S , Φ gS ( R · x i ) = Φ gS ( x i ) for all R ∈ SO (3) and for all x i ∈ X .Proof. The proof follows from the linearity of the affine coordinates as defined in Eq. 8. (cid:4)
Now, that we have local and global rotation invariant features, we combine them to do segmentation as illustrated inAlgorithm 2.
Algorithm 2:
Extract features from X . Data:
Input X = { x i } , S = { x i } i ∈ I ⊂ X , Φ l , Φ gS Result: Φ lgS For each x j ∈ S , compute Φ l ( x j ) ; For each x ∈ X , interpolate Φ l ( x ) from (cid:8) Φ l ( x j ) (cid:9) x j ∈ I according to Algorithm 3 ; For each x ∈ X , compute Φ gS ( x according to Eq. 8; For each x ∈ X , use a self-attention block to compute the probabilities for local and global features (as defined inAlgorithm 4). Let the probabilities be denoted by p l and p g respectively ; Define Φ lgS to be x (cid:55)→ (cid:0) p l ( x )Φ l ( x ) | p g ( x )Φ gS ( x ) (cid:1) t , where | denotes the concatenation of local and global features;In the following algorithm, we will give the interpolation technique to interpolate the local features from the responseson S (a schematic is shown in Fig. 7). Together with this and Algorithm 2, we have a rotation invariant combined localand global features which we will use for segmentation using fully connected (FC) layer(s).Figure 7: The dark brown points are the nearest points of the light brown point whose features need to be interpolated.A schematic of our proposed segmentation pipeline is shown in Fig. 8.7 PREPRINT - O
CTOBER
31, 2019
Algorithm 3:
Interpolation of rotation invariant local features.
Data:
Input X = { x i } , S = { x i } i ∈ I ⊂ X , Φ l ( S ) , k ≥ Result: Φ l ( X ) For each x ∈ X , we find the k -nearest neighbors in S , denoted by N x ⊂ S ; For each y ∈ N x , we assign the weight, w xy = / (cid:107) x − y (cid:107) ; For each x ∈ X , interpolate Φ l ( x ) = (cid:80) y ∈ S w xy Φ l ( y ) ; Algorithm 4:
Self-attention block to combine local and global features.
Data:
Input X = { x i } , (cid:8) Φ l ( X ) (cid:9) Result: ( p l , p g ) t For each x ∈ X , use Φ l on the k -nearest neighbors N x ; We use 1D correlation with kernel size to look at the interaction between the neighbors ; We use multiple FC layers to learn feature followed by a softmax layer to extract ( p l ( x ) , p g ( x )) t for all x ∈ X ;Figure 8: A schematic of the pipeline for segmentation block. As affine coordinates is in high dimension, we use atSNE 2D embedding for visualization. After extraction of rotation invariant spherical local features, we aggregate the local features using either one of thefollowing two ways over the entire point-cloud X , (a) maxpooling the local features over S . (b) use × correlationoperator over S to combine the local features.The usage of maxpooling to extract global features will make sure that theextracted features are permutation invariant. A schematic of our proposed classification pipeline is shown in Fig. 9,which shows the rotation invariance property on the rotated and non-rotated examples. In this section, we will talk about data augmentation in our proposed framework. As argued so far that our proposedmodel is rotational invariant, which makes our model implicitly rotation invariant , this entails an implicit dataaugmentation. As pointed out before, the proposed downsampling step amounts of doing explicit data augmentationof random sampling . In the following section, we will talk about another kind of explicit data augmentation, namely random deformation . Before explaining deformation augmentation in detail, we like to remind the readers about the8
PREPRINT - O
CTOBER
31, 2019Figure 9: A schematic of the pipeline for classification block. The vertical blocks denote S , SO (3) , SO (3) and FClayers respectively with filter outputs in-between. The response before the FC layer denotes the output after integration.benefits of these variants of augmentations: (a) implicit rotation augmentation makes it rotation invariant (b) randomsampling augmentation makes it robust to noise (c) random deformation augmentation makes it small deformationinvariant. In this section, we propose a scheme to make our model robust to small deformations. Given X as the input point-cloud,we apply some amount of deformation of the coordinates in the following way: use a neural network consists of fullyconnected layers and tanh activation functions. After deforming the points, we apply Algorithms 1-3 to extract thespherical features Φ l ( X ) . Computegeodesicdistance Computea ffi nitymatrix Candidateselection Sphericalconv. Feat.matching3D brain scan 3D point cloudCC atlas MinimizingEntropy Figure 10: A schematic for segmentation from 3D point-cloud extracted from 3D brain scan.We then combine these features with the global features. An example showing the deformed point-cloud is given in Fig.11. 9
PREPRINT - O
CTOBER
31, 2019Figure 11: Original point-cloud and together with its corresponding deformed coordinates.Due to the inherent property of rotation invariance, we name our model “Rotation Invariant omni-directional networkfor pointsets” (or ‘POIRot’ as a short name). In the following section, we will show an extensive comparison on POIRotwith the state-of-the-art methods on benchmark datasets for both classification and segmentation tasks.
In this section, we present experimental validations in a supervised setting for classification and segmentation tasks. Inclassification task, we use three datasets, including MNIST, Modelnet40 and OASIS dataset. In segmentation task, wedo part segmentation on Shapenet dataset. We also present an experiment for unsupervised object detection task from3D brain shape.
One of the challenging tasks in 3D point-cloud data processing is to do part segmentation. This task entails to segmenteach point-cloud in separate categorical labels. We evaluate the performance of our proposed model on Shapenet partsegmentation dataset [40]. This dataset consists of , shapes from types of objects annotated with parts.We have performed similar experimental setup as proposed in [2]. Our proposed segmentation scheme outperforms thestate-of-the-art methods by > in terms of mean intersection over union (mIoU) evaluation metric. The comparativeresults in terms of mIoU for different categories of objects are given in Table 1. We can see that the proposed methodperforms consistently well for all objects, while makes significant performance improvement for some objects includingairplane and table. Some representative part segmentation results are shown in Fig. 17. We show some non-obviousmistakes in annotations for ground truth and prediction by orange and blue circle respectively. One can also see someobvious mistakes in ground truth annotations, e.g., in last row first and fifth columns.In terms of model complexity, we can see from Fig. 12 that our proposed method, POIRot takes ≈ . − comparedwith [2], as this method is a representative of the most commonly used state-of-the-art algorithms. This clearlyindicates the usefulness of a leaner model which not only makes it suitable for low-memory devices but also makes theoptimization simpler and hence it achieves better optimum. This clearly indicates the usefulness of the proposed model. Avg. airplane bag cap car chair earphone guitar knife lamp laptop motorbike mug pistol rocket skateboard table
Table 1: Part segmentation results in terms of mIoU(%) on ShapeNet PartSeg dataset.Figure 12: Number of parameters of POIRot compared to Pointnet on ShapeNet dataset. Due to noise present in the labels, we relabel some of the labels of the dataset. PREPRINT - O
CTOBER
31, 2019
In this section, we perform ablation studies in the following two ways (a) by removing global features (b) by includingdeformation to show its usefulness. From Table 2, we can see that global features may or may not help. We conjecturethat global feature may be useful for relatively complicated shapes. In Table 3, we show the result with deformation,we can see that though the deformation data augmentation makes the model robust to deformations, it affects theperformance, which is due to the trade-off between performance and achieving invariance. w w/ocap rocket 67.0
Table 2: Ablation of global features on ShapeNet PartSeg dataset in terms of mIoU(%). w/o w/earphone
Table 3: With and without deformation on ShapeNet PartSeg dataset in terms of mIoU(%).
In this section, we give some experiments on point-clouds for classification task. We have performed classification onthree datasets, namely, 1. classifying digits in MNIST data 2. classifying objects in ModelNet40 data 3. classifyingdemented vs. non-demented subjects based on shape of corpus callosum.
Classifying MNIST digits and Modelnet40 shapes:
We use MNIST digits [41] and Modelnet40 shapes [14] forclassification. We follow the pre-processing step as given in [24]. We compare performance of POIRot with PointCNN[24] in terms of model complexity and classification accuracy. We perform two sets of experiments 1. training andtesting both on non-rotated datasets 2. training and testing on non-rotated and rotated datasets respectively. For allthe experiments, we use sampled points as the point-cloud. We generate the rotated data by randomly drawingrotation matrices from uniform distribution. Using only < parameters of PointCNN, we can achieve significantlybetter performance for rotated testing data as given in Table 4. We like to point out that, even using explicit rotated dataaugmentation, PointCNN can achieve . on MNIST and . classification accuracy on Modelnet40. This clearlyindicates the implicit data augmentation is more powerful than the explicit data augmentation to achieve invariance. InFig. 15, we show representative filter responses for digits ‘2’ and ‘3’ for the selected points highlighted as orange. (251770) 91.1 ( ) (24.0)Modelnet40 (599340) 82.8 ( ) (7.3) Table 4: Classification results in terms of accuracy for MNIST and Modelnet40 dataset. The results are shown forPOIRot (PointCNN), ‘NR/X’ denotes non-rotated training and ‘X’ testing, where ‘X’ may be rotated or non-rotated.POIRot approximately retains the accuracy for rotated testing with ≈ of the number of parameters as PointCNNmodel. For both these models, we use sampled points from point-cloud. Classifying demented vs. non-demented subjects based on corpus callosum shapes:
In this section, we use OASISdata [42] to address the classification of demented vs. non-demented subjects using our proposed framework. Thisdataset contains at least two MR brain scans of subjects, aged between to years old. For each patient, scansare separated by at least one year. The dataset contains patients of both sexes. In order to avoid gender effects, wetake MR scans of male patients alone from three visits, which results in the dataset containing MR scans of subjects with dementia and subjects without dementia. This gives scans for subjects with dementia and scansfor subjects without dementia. We first compute an atlas (using the method in [43]) from the × MR scansof patients without dementia. After rigidly registering each MR scans to the atlas, we segment out the corpus callosumregion from each scan. We represent the shape of the corpus callosum as a 3D point-cloud.11
PREPRINT - O
CTOBER
31, 2019
Computing the centroid of a point-cloud:
Given the point-cloud X = { x i } Ni =1 ⊂ R , we compute the centroid ofthe point-cloud to be the nearest point to the mean of X . Formally, let us denote the centroid of X to be m . Then, m isdefined as m = Π X (cid:32) n n (cid:88) i =1 x i (cid:33) , where, Π X ( x ) is the projection of x in the set X . Extracting the “attention” from a point-cloud:
Given the point-cloud X , we extract the region of interest, i.e.,“attention” to be a subset Y ⊂ X as follows: (a) Compute the directional part of the vector from m to each point, x i . Let the vector be denoted by v i . (b) Pass the vector v i through a FC layer to get the confidence, c i ∈ [0 , ∞ ) , forselecting x i . (c) Define a random variable following multinomial distribution with c as the parameter. (d) Draw samplesfrom this random variable to generate Y . We call this subset Y to be our region of interest. We follow the steps asdescribed in Section 2.We achieve . ± . % classification accuracy with the sensitivity and specificity to be . and . respectively. If we remove the “attention” module, we can achieve . classification accuracy. This clearly indicatesthe usefulness of the “attention” module used in this work. We show the overlayed attention region and the selectedconvex hull points in Fig. 13. We can see that the attention block focuses on the thinning of corpus callosum shape inorder to classify demented vs. non-demented subjects. In Fig. 14, we show that for rotated and non-rotated CC shapesthe integrated responses are similar which proves the rotational invariance property.Figure 13: (Top): Sample point-cloud (Middle:) with attention region marked with “orange” (Bottom:) putting spherearound the convex hull points. (The first and second columns represent samples with non-dementia and dementiarespectively.)
In this section, we want to investigate whether the proposed method can learn a 3D geometric shape of an object ofinterest. We demonstrated it in an object detection framework where the task is to detect corpus callosum region from a3D brain scan. One of the major hurdle to deal with medical images is to do image registration which incurs someerror that can carry forward to the following processing steps. This motivates us to solve the detection problem in anunsupervised way where we only present an atlas of corpus callosum shape. The only way to detect the corpus callosumshape correctly during testing is by learning the 3D geometric shape of the corpus callosum region. Hence, in our objectdetection task, our inputs are a 3D brain scan and an atlas for CC region. Our proposed framework consists of five stepsas given below.
Extracting point-cloud:
Given a 3D brain scan and an atlas for CC region, we extract the corresponding 3D point-clouds for the brain scan and the CC atlas, denoted by X = { x i } Ni =1 ⊂ R and M = { m i } mi =1 ⊂ R respectively.12 PREPRINT - O
CTOBER
31, 2019Figure 14: (Top): non-rotated (Bottom:) rotated point-cloud with their respective invariant feature responses.Figure 15: Filter responses for mnist digits and . Constructing affinity matrix using geodesic distance:
In order to capture the shape of the corpus callosum, oneneeds to use geodesic distance instead of standard (cid:96) distance. For each point x ∈ X , we look at the nearest neighborsand calculate their (cid:96) distance. In this way, for each x , we can construct a locally flat neighborhood, N (cid:15) ( x ) , whichessentially gives us an adjacency matrix, E . Now we construct a graph G = ( X, E ) where each point is treated as nodeof the graph and there exists an edge between vertex x i and x j if x i ∈ N (cid:15) ( x j ) or x j ∈ N (cid:15) ( x i ) . Given the graph G , weuse Floyd-Warshall algorithm to find all pair shortest paths and use the weighted length of the path as the geodesicdistance. Analogous idea to get geodesic distance has been used in [44]. Using the geodesic distance, d g : X × X → R ,we construct the affinity matrix A = { a ij } with a ij = d g ( x i , x j ) . We choose (cid:15) = 5 in terms of pixels for our purpose. Candidate selection:
After construction of the affinity matrix A , we choose a potential candidate for the matchingCC shape as follows. Let the candidate pool be denoted by S = { S j } where S j is a potential match . For notationalsimplicity, we denote each potential match S by x and m . Thus for a given m , we call x j to be a potential match if S j contains m − nearest neighbors of x j . At each x j ∈ S , we construct a sphere around and collect responses on thesphere from the m − nearest neighbors (in an analogous way as in Section 2). Let the spherical response be denotedby (cid:8) f j : S → R (cid:9) x j ∈ S . We capture responses from the atlas as well and let it be denoted by m f : S → R . Spherical feature extractor and feature matching:
We use spherical correlations as described in Section 2 fromboth { f j } and m f to get the rotation invariant features, let it be denoted by (cid:8) F j ∈ R f (cid:9) and M ∈ R f , where f is thedimension of the spherical feature. Now, for each F j , we compute the similarity with M to be F tj M . Thus, for each potential match , we get a similarity score, let it be denoted by s j .13 PREPRINT - O
CTOBER
31, 2019
Minimizing the entropy:
After getting the similarity scores { s j } , we compute the probability of S j to be the chosensegmented region (denoted by p j ) as: P ( S j = M ) = exp ( s j ) (cid:80) | S | i =1 exp ( s i ) Finally we minimize the entropy given by − (cid:80) | S | j =1 p j log( p j ) . We select the candidate as the one with maximumprobability, i.e., the chosen candidate j ∗ = argmax j p j . A schematic of our proposed unsupervised algorithm is givenin Fig. 10. The detection results are shown in Fig. 16 which indicates that the proposed method performs well to detectCC shapes.Figure 16: point-cloud from 3D brain scans ( Left ) with the corresponding segmented corpus callosum region (
Right ).The middle subplot shows an atlas of corpus callosum shape.
Point-cloud helps with understanding 3D geometric shapes. In this work, we proposed a definition of correlation/convolution on 3D point cloud and have shown that our framework is “augmentation-free"and is rotation invariant.Unlike the previous state-of-the-art methods, our proposed framework uses much leaner model by utilizing geometricstructures in 3D point-clods. The core of our method is the proposed rotation invariant convolution on point-cloudinduced from the topology of sphere. We performed significantly better result over state-of-the-art algorithms for partsegmentation task on shapenet dataset. In classification task, we have achieved similar result on rotated test data withoutexplicit data augmentation. We have also tested on an unsupervised object detection task and detect the corpus callosumshape from an entire 3D brain scan. In future, we like to explore point set completion and object detection in large scaledata, e.g. KITTI and various autonomous driving datasets.
References [1] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-levelaccuracy with 50x fewer parameters and <1mb model size,”
ArXiv , vol. abs/1602.07360, 2017.[2] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification andsegmentation,”
Proc. Computer Vision and Pattern Recognition (CVPR), IEEE , vol. 1, no. 2, p. 4, 2017.[3] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metricspace,” in
Advances in Neural Information Processing Systems , 2017, pp. 5099–5108.14
PREPRINT - O
CTOBER
31, 2019 (a) airplane (b) airplane (c) airplane (d) airplane(e) bag (f) bag (g) cap (h) cap(i) earphone (j) earphone (k) earphone (l) earphone(m) rocket (n) rocket (o) rocket (p) rocket(q) table (r) table (s) table (t) table(u) skateboard (v) skateboard (w) skateboard (x) skateboard
Figure 17: In each subfigure, we show ground truth (left) and our segmentation result (right).[4] C.-H. Lin, Y. Chung, B.-Y. Chou, H.-Y. Chen, and C.-Y. Tsai, “A novel campus navigation app with augmentedreality and deep learning,” in . IEEE,2018, pp. 1075–1077.[5] J. R. Rambach, A. Tewari, A. Pagani, and D. Stricker, “Learning to fuse: A deep learning approach to visual-inertial camera pose estimation,” in
Mixed and Augmented Reality (ISMAR), 2016 IEEE International Symposiumon . IEEE, 2016, pp. 71–76.[6] S. Tulsiani, S. Gupta, D. Fouhey, A. A. Efros, and J. Malik, “Factoring shape, pose, and layout from the 2d imageof a 3d scene,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018, pp.302–310.[7] S. Vasu, M. M. MR, and A. Rajagopalan, “Occlusion-aware rolling shutter rectification of 3d scenes,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018, pp. 636–645.[8] A. Dai, D. Ritchie, M. Bokeloh, S. Reed, J. Sturm, and M. Nießner, “Scancomplete: Large-scale scene completionand semantic segmentation for 3d scans,” in
CVPR , vol. 1, 2018, p. 2.[9] Y. Chen, J. Wang, J. Li, C. Lu, Z. Luo, H. Xue, and C. Wang, “Lidar-video driving dataset: Learning drivingpolicies effectively,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018,pp. 5870–5878. 15
PREPRINT - O
CTOBER
31, 2019[10] S. Shen, “Stereo vision-based semantic 3d object and ego-motion tracking for autonomous driving,” arXiv preprintarXiv:1807.02062 , 2018.[11] B. Wu, A. Wan, X. Yue, and K. Keutzer, “Squeezeseg: Convolutional neural nets with recurrent crf for real-timeroad-object segmentation from 3d lidar point cloud,” in . IEEE, 2018, pp. 1887–1893.[12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,”in . Ieee, 2009, pp. 248–255.[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,”in
Advances in neural information processing systems , 2012, pp. 1097–1105.[14] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation forvolumetric shapes,” in
Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp.1912–1920.[15] G. Riegler, A. O. Ulusoy, and A. Geiger, “Octnet: Learning deep 3d representations at high resolutions,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , vol. 3, 2017.[16] P.-S. Wang, Y. Liu, Y.-X. Guo, C.-Y. Sun, and X. Tong, “O-cnn: Octree-based convolutional neural networks for3d shape analysis,”
ACM Transactions on Graphics (TOG) , vol. 36, no. 4, p. 72, 2017.[17] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-view convolutional neural networks for 3d shaperecognition,” in
Proceedings of the IEEE international conference on computer vision , 2015, pp. 945–953.[18] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas, “Volumetric and multi-view cnns for objectclassification on 3d data,” in
Proceedings of the IEEE conference on computer vision and pattern recognition ,2016, pp. 5648–5656.[19] E. Kalogerakis, M. Averkiou, S. Maji, and S. Chaudhuri, “3d shape segmentation with projective convolutionalnetworks,” in
Proc. CVPR , vol. 1, no. 2, 2017, p. 8.[20] L. Zhou, S. Zhu, Z. Luo, T. Shen, R. Zhang, M. Zhen, T. Fang, and L. Quan, “Learning and matching multi-viewdescriptors for registration of point clouds,” arXiv preprint arXiv:1807.05653 , 2018.[21] D. Rethage, J. Wald, J. Sturm, N. Navab, and F. Tombari, “Fully-convolutional point networks for large-scalepoint clouds,” arXiv preprint arXiv:1808.06840 , 2018.[22] M. Gadelha, R. Wang, and S. Maji, “Multiresolution tree networks for 3d point cloud processing,” arXiv preprintarXiv:1807.03520 , 2018.[23] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning onpoint clouds,” arXiv preprint arXiv:1801.07829 , 2018.[24] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “Pointcnn: Convolution on x-transformed points,” in
Advancesin Neural Information Processing Systems , 2018, pp. 820–830.[25] B. Schölkopf, A. Smola, and K.-R. Müller, “Kernel principal component analysis,” in
International conference onartificial neural networks . Springer, 1997, pp. 583–588.[26] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for real-time object recognition,” in . IEEE, 2015, pp. 922–928.[27] M. Tatarchenko, A. Dosovitskiy, and T. Brox, “Octree generating networks: Efficient convolutional architecturesfor high-resolution 3d outputs,” in
Proceedings of the IEEE International Conference on Computer Vision , 2017,pp. 2088–2096.[28] R. Klokov and V. Lempitsky, “Escape from cells: Deep kd-networks for the recognition of 3d point cloud models,”in
Proceedings of the IEEE International Conference on Computer Vision , 2017, pp. 863–872.[29] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M.-H. Yang, and J. Kautz, “Splatnet: Sparse lattice networksfor point cloud processing,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,2018, pp. 2530–2539.[30] A. Adams, J. Baek, and M. A. Davis, “Fast high-dimensional filtering using the permutohedral lattice,” in
Computer Graphics Forum , vol. 29, no. 2. Wiley Online Library, 2010, pp. 753–762.[31] V. Jampani, M. Kiefel, and P. V. Gehler, “Learning sparse high dimensional filters: Image filtering, dense crfs andbilateral neural networks,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,2016, pp. 4452–4461.[32] L. Landrieu and M. Simonovsky, “Large-scale point cloud semantic segmentation with superpoint graphs,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018, pp. 4558–4567.16
PREPRINT - O
CTOBER
31, 2019[33] T. Cohen, M. Geiger, J. Köhler, and M. Welling, “Convolutional networks for spherical signals,” arXiv preprintarXiv:1709.04893 , 2017.[34] R. Chakraborty, M. Banerjee, and B. C. Vemuri, “A cnn for homogneous riemannian manifolds with applicationsto neuroimaging,” arXiv preprint arXiv:1805.05487 , 2018.[35] S. Helgason,
Differential geometry, Lie groups, and symmetric spaces . Academic press, 1979, vol. 80.[36] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariateshift,” arXiv preprint arXiv:1502.03167 , 2015.[37] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in
Advances in NeuralInformation Processing Systems , 2018, pp. 10 215–10 224.[38] E. W. Hobson,
The theory of spherical and ellipsoidal harmonics . CUP Archive, 1931.[39] Q. Zhao, G. Zhou, S. Xie, L. Zhang, and A. Cichocki, “Tensor ring decomposition,” arXiv preprintarXiv:1606.05535 , 2016.[40] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al. , “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012 , 2015.[41] L. Deng, “The mnist database of handwritten digit images for machine learning research [best of the web],”
IEEESignal Processing Magazine , vol. 29, no. 6, pp. 141–142, 2012.[42] A. F. Fotenos, A. Snyder, L. Girton, J. Morris, and R. Buckner, “Normative estimates of cross-sectional andlongitudinal brain volume decline in aging and ad,”
Neurology , vol. 64, no. 6, pp. 1032–1039, 2005.[43] B. B. Avants, N. Tustison, and G. Song, “Advanced normalization tools (ants),”
Insight j , vol. 2, pp. 1–35, 2009.[44] F. Wang, B. C. Vemuri, A. Rangarajan, and S. J. Eisenschenk, “Simultaneous nonrigid registration of multiplepoint sets and atlas construction,”