[PDF] Deep Multimodal Subspace Clustering Networks

Abstract

We present convolutional neural network (CNN) based approaches for unsupervised multimodal subspace clustering. The proposed framework consists of three main stages - multimodal encoder, self-expressive layer, and multimodal decoder. The encoder takes multimodal data as input and fuses them to a latent space representation. The self-expressive layer is responsible for enforcing the self-expressiveness property and acquiring an affinity matrix corresponding to the data points. The decoder reconstructs the original input data. The network uses the distance between the decoder's reconstruction and the original input in its training. We investigate early, late and intermediate fusion techniques and propose three different encoders corresponding to them for spatial fusion. The self-expressive layers and multimodal decoders are essentially the same for different spatial fusion-based approaches. In addition to various spatial fusion-based methods, an affinity fusion-based network is also proposed in which the self-expressive layer corresponding to different modalities is enforced to be the same. Extensive experiments on three datasets show that the proposed methods significantly outperform the state-of-the-art multimodal subspace clustering methods.

Full PDF

IIEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 12, NO. 6, DECEMBER, 2018. 1

Deep Multimodal Subspace Clustering Networks

Mahdi Abavisani,

Student Member, IEEE and Vishal M. Patel,

Senior Member, IEEE

Abstract —We present convolutional neural network (CNN)based approaches for unsupervised multimodal subspace clus-tering. The proposed framework consists of three main stages- multimodal encoder, self-expressive layer, and multimodaldecoder. The encoder takes multimodal data as input and fusesthem to a latent space representation. The self-expressive layeris responsible for enforcing the self-expressiveness property andacquiring an afﬁnity matrix corresponding to the data points.The decoder reconstructs the original input data. The networkuses the distance between the decoder’s reconstruction andthe original input in its training. We investigate early, lateand intermediate fusion techniques and propose three differentencoders corresponding to them for spatial fusion. The self-expressive layers and multimodal decoders are essentially thesame for different spatial fusion-based approaches. In additionto various spatial fusion-based methods, an afﬁnity fusion-basednetwork is also proposed in which the self-expressive layercorresponding to different modalities is enforced to be the same.Extensive experiments on three datasets show that the proposedmethods signiﬁcantly outperform the state-of-the-art multimodalsubspace clustering methods.

Index Terms —Deep multimodal subspace clustering, subspaceclustering, multimodal learning, multi-view subspace clustering.

I. I

NTRODUCTION M ANY practical applications in image processing, com-puter vision, and speech processing require one toprocess very high-dimensional data. However, these data oftenlie in a low-dimensional subspace. For instance, facial imageswith variation in illumination [1], handwritten digits [2] andtrajectories of a rigidly moving object in a video [3] areexamples where the high-dimensional data can be representedby low-dimensional subspaces. Subspace clustering algorithmsessentially use this fact to ﬁnd clusters in different subspaceswithin a dataset [4]. In other words, in a subspace clusteringtask, given the data from a union of subspaces, the objectiveis to ﬁnd the number of subspaces, their dimensions, thesegmentation of the data and a basis for each subspace [4].This problem has numerous applications including motionsegmentation [5], unsupervised image segmentation [6], imagerepresentation and compression [7] and face clustering [8].Various subspace clustering methods have been proposedin the literature [9], [10], [11], [12], [13], [14], [15], [16],[17], [18], [19], [20]. In particular, methods based on sparseand low-rank representation have gained a lot of attractionin recent years [21], [22], [14], [15], [23], [24], [25], [26].These methods exploit the fact that noiseless data in a unionof subspaces are self-expressive, i.e. each data point can be

M. Abavisani is with the department of Electrical and Com-puter Engineering at Rutgers University, Piscataway, NJ USA. email:[email protected] M. Patel is with the department of Electrical and Computer En-gineering at Johns Hopkins University, Baltimore, MD USA. email: [email protected]. M od a lit y M od a lit y M od a lit y M od a lit y M od a lit y M od a lit y Multimodal Encoder Multimodal Decoder … … Z ⇥ e Z ⇥ e ⇥ s { G ( x ) , G ( x ) , · · · , G ( x N X ) } G ( · ) { y , y , · · · , y N Y }{ F ( y ) , F ( y ) , · · · , F ( y N Y ) } F ( · ) Z ⇥ e Z ⇥ e ⇥ s { G ( x ) , G ( x ) , · · · , G ( x N X ) } G ( · ) { y , y , · · · , y N Y }{ F ( y ) , F ( y ) , · · · , F ( y N Y ) } F ( · ) Fig. 1. An overview of the proposed deep multimodal subspace clusteringframework. Note that the network consists of three blocks: a multimodalencoder, a self-expressive layer, and a multimodal decoder. The weights inthe self-expressive layer, Θ s , are used to construct the afﬁnity matrix. Wepresent several models for the multimodal encoder. expressed as a sparse linear combination of other data points.The self-expressiveness property was also recently investigatedin [16] to develop a deep convolutional neural network (CNN)for subspace clustering. This deep learning-based method wasshown to signiﬁcantly outperform the state-of-the-art subspaceclustering methods.In the case where the data consists of multiple modalitiesor views, multimodal subspace clustering methods can beemployed to simultaneously cluster the data in the individualmodalities according to their subspaces [27], [28], [29], [30],[31], [32], [33], [34], [35], [36]. Some of the multimodalsubspace clustering methods make use of the kernel trick tomap the data onto a high-dimensional feature space to achievebetter clustering [36].Motivated by the recent advances in deep subspace cluster-ing [16] as well as multimodal data processing using CNNs[37], [38], [39], [40], [41], [42], [43], [44], [45], [46], inthis paper, we propose a different approach to the problemof multimodal subspace clustering. We present a novel CNN-based autoencoder approach in which a fully-connected layeris introduced between the encoder and the decoder whichmimics the self-expressiveness property that has been widelyused in various subspace clustering algorithms.Figure 1 gives an overview of the proposed deep multimodalsubspace clustering framework. The self-expressive layer isresponsible for enforcing the self-expressiveness property andacquiring an afﬁnity matrix corresponding to the data points.The decoder reconstructs the original input data from the latentfeatures. The network uses the distance between the decoder’sreconstruction and the original input in its training.For encoding the multimodal data into a latent space, weinvestigate three different spatial fusion techniques based onlate, early and intermediate fusion. These fusion techniques aremotivated by the deep multimodal learning methods in super-vised learning tasks [47], [48], that provide the representation a r X i v : . [ c s . L G ] J a n EEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 12, NO. 6, DECEMBER, 2018. 2 of modalities across spatial positions. In addition to the spatialfusion methods, we propose an afﬁnity fusion-based networkin which the self-expressive layer corresponding to differentmodalities is enforced to be the same. For both spatial andthe afﬁnity fusion-based methods, we formulate an end-to-endtraining objective loss.Key contributions of our work are as follows: • Deep learning-based multimodal subspace clusteringframework is proposed in which the self-expressivenessproperty is encoded in the latent space by using a fullyconnected layer. • Novel encoder network architectures corresponding tolate, early and intermediate fusion are proposed for fusingmultimodal data. • An afﬁnity fusion-based network architecture is proposedin which the self-expressive layer is enforced to havethe same weights across latent representations of all themodalities.To the best of our knowledge, this is the ﬁrst attempt that pro-poses to use deep learning for multimodal subspace clustering.Furthermore, the proposed method obtains the state-of-the-artresults on various multimodal subspace clustering datasets.Code is available at: https://github.com/mahdiabavisani/Deep-multimodal-subspace-clustering-networks.This paper is organized as follows. Related works onsubspace clustering and multimodal learning are presentedin Section II. The proposed spatial fusion-based and afﬁnityfusion-based multimodal subspace clustering methods are pre-sented in Section III and IV, respectively. Experimental resultsare presented in Section V, and ﬁnally, Section VI concludesthe paper with a brief summary.II. R

ELATED W ORK

In this section, we review some related works on subspaceclustering and multimodal learning.

A. Sparse and Low-rank Representation-based SubspaceClustering

Let X = [ x , · · · , x N ] ∈ R D × N be a collection of N signals { x i ∈ R D } Ni = drawn from a union of n linear subspaces S ∪ S ∪ · · · ∪ S n of dimensions { d (cid:96) } n (cid:96) = in R D . Given X , thetask of subspace clustering is to ﬁnd sub-matrices X (cid:96) ∈ R D × N (cid:96) that lie in S (cid:96) with N + N + · · · + N n = N . The sparsesubspace clustering (SSC) [21] and low-rank representations-based subspace clustering (LRR) [22] algorithms exploit thefact that noiseless data in a union of subspaces are self-expressive . In other words, it is assumed that each data pointcan be represented as a linear combination of other data points.Hence, these algorithms aim to ﬁnd the sparse or low-rankmatrix C by solving the following optimization problem min C (cid:107) C (cid:107) p + λ (cid:107) X − XC (cid:107) F , (1)where (cid:107) . (cid:107) p is the (cid:96) -norm in the case of SSC [21] andthe nuclear norm in the case of LRR [22]. Here, λ is aregularization parameter. In addition, to prevent the trivialsolution C = I , an additional constraint of diag ( C ) = is added Encoder Decoder … … Z m ⇥ e Z m ⇥ e ⇥ s Z ⇥ e Z ⇥ e CYˆY ⇥ { G ( x ) , G ( x ) , · · · , G ( x N X ) } G ( · ) { y , y , · · · , y N Y }{ F ( y ) , F ( y ) , · · · , F ( y N Y ) } F ( · ) Z ⇥ e Z ⇥ e ⇥ s { G ( x ) , G ( x ) , · · · , G ( x N X ) } G ( · ) { y , y , · · · , y N Y }{ F ( y ) , F ( y ) , · · · , F ( y N Y ) } F ( · ) I npu t O u t pu t Z m ⇥ e Z m ⇥ e ⇥ s Z ⇥ e Z ⇥ e CXˆX ⇥ { G ( x ) , G ( x ) , · · · , G ( x N X ) } G ( · ) { y , y , · · · , y N Y }{ F ( y ) , F ( y ) , · · · , F ( y N Y ) } F ( · ) Z m ⇥ e Z m ⇥ e ⇥ s Z ⇥ e Z ⇥ e CXˆX ⇥ { G ( x ) , G ( x ) , · · · , G ( x N X ) } G ( · ) { y , y , · · · , y N Y }{ F ( y ) , F ( y ) , · · · , F ( y N Y ) } F ( · ) ∼ Fig. 2. An overview of the DSC framework proposed in [16] for unimodalsubspace clustering. to the above optimization problem in the case of SSC. Once C is found, spectral clustering methods [49] are applied onthe afﬁnity matrix W = | C | + | C | T to obtain the segmentationof the data X .Non-linear versions of the SSC and LRR algorithms havealso been proposed in the literature [23], [24]. B. Deep Subspace Clustering

The deep subspace clustering network (DSC) [16] exploresthe self-expressiveness property by embedding the data into alatent space using an encoder-decoder type network. Figure 2gives an overview of the DSC method for unimodal subsapceclustering. The method optimizes an objective similar to thatof (1) but the matrix C is approximated using a trainabledense layer embedded within the network. Let us denotethe parameters of the self-expressive layer as Θ s . Note thatthese parameters are essentially the elements of C in (1). Thefollowing loss function is used to train the network min ˜ Θ (cid:107) Θ s (cid:107) p + λ (cid:107) Z Θ e − Z Θ e Θ s (cid:107) F + λ (cid:107) X − ˆX ˜ Θ (cid:107) , s.t. diag ( Θ s ) = , (2)where Z Θ e denotes the output of the encoder, and ˆ X ˜ Θ isthe reconstructed signal at the output of the decoder. Here,the network parameters ˜ Θ consist of encoder parameters Θ e ,decoder parameters Θ d and self-expressive layer parameters Θ s . Here, λ and λ are two regularization parameters. C. Multimodal Subspace Clustering

A number of multimodal and multiview subspace clusteringapproaches have been developed in recent years. Bickel etal. introduced an Expectation Maximization (EM) and ag-glomerative multiview clustering methods in [33]. White etal. [32] provided a convex reformulation of multiview sub-space learning that as opposed to local formulations enablesglobal learning. Some algorithms use dimensionality reductionmethods such as Canonical Correlation Analysis (CCA) toproject the multiview data onto a low-dimensional subspacefor clustering [28], [34]. Some other multimodal methods arespeciﬁcally designed for two views and can not be easilygeneralized to multiple views [50], [35]. Kumar et al. [29]proposed a co-regularization method that enforces the clus-terings to be aligned in different views. Zhao et al. [30]use output of clustering in one view to learn discriminant

EEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 12, NO. 6, DECEMBER, 2018. 3 M od a lit y M od a lit M od a lit y M od a lit Single modality Single modality … …

Single modality

Single modality Encoders

Single modality

Single modality Decoders M od a lit y M od a lit y Common Self-expressive Layer F u s i on F u s i on M od a lit y M od a lit y M od a lit y M od a lit y M od a lit y M od a lit y … … Z ⇥ e Z ⇥ e ⇥ s { G ( x ) , G ( x ) , · · · , G ( x N X ) } G ( · ) { y , y , · · · , y N Y }{ F ( y ) , F ( y ) , · · · , F ( y N Y ) } F ( · ) Z ⇥ e Z ⇥ e ⇥ s { G ( x ) ,G ( x ) , ··· ,G ( x N X ) } G ( · ) { y ,y , ··· ,y N Y }{ F ( y ) ,F ( y ) , ··· ,F ( y N Y ) } F ( · ) F u s i on M od a lit y M od a lit y M od a lit y M od a lit y M od a lit y M od a lit y … … Z ⇥ e Z ⇥ e ⇥ s { G ( x ) , G ( x ) , · · · , G ( x N X ) } G ( · ) { y , y , · · · , y N Y }{ F ( y ) , F ( y ) , · · · , F ( y N Y ) } F ( · ) Z ⇥ e Z ⇥ e ⇥ s { G ( x ) ,G ( x ) , ··· ,G ( x N X ) } G ( · ) { y ,y , ··· ,y N Y }{ F ( y ) ,F ( y ) , ··· ,F ( y N Y ) } F ( · ) F u s i on M od a lit y M od a lit y M od a lit y … … Z ⇥ e Z ⇥ e ⇥ s { G ( x ) , G ( x ) , · · · , G ( x N X ) } G ( · ) { y , y , · · · , y N Y }{ F ( y ) , F ( y ) , · · · , F ( y N Y ) } F ( · ) Z ⇥ e Z ⇥ e ⇥ s { G ( x ) ,G ( x ) , ··· ,G ( x N X ) } G ( · ) { y ,y , ··· ,y N Y }{ F ( y ) ,F ( y ) , ··· ,F ( y N Y ) } F ( · ) M od a lit y M od a lit y M od a lit y Multimodal DecoderMultimodal EncoderMultimodal DecoderMultimodal Encoder Multimodal DecoderMultimodal Encoder (a) (b)(c) (d) Z m ⇥ me Z m ⇥ me ⇥ s Z ⇥ e Z ⇥ e CXˆX ⇥ { G ( x ) ,G ( x ) , ··· ,G ( x N X ) } G ( · ) { y ,y , ··· ,y N Y }{ F ( y ) ,F ( y ) , ··· ,F ( y N Y ) } F ( · ) Z m ⇥ me Z m ⇥ me ⇥ s Z ⇥ e Z ⇥ e CXˆX ⇥ { G ( x ) ,G ( x ) , ··· ,G ( x N X ) } G ( · ) { y ,y , ··· ,y N Y }{ F ( y ) ,F ( y ) , ··· ,F ( y N Y ) } F ( · ) Fig. 3. Different network architectures corresponding to (a) early fusion, (b) intermediate fusion, and (c) late fusion. Note that in all the spatial fusion-basednetworks (a)-(c), the overall structure for the self-expressive layer and the multimodal decoder remain the same. (d) Network architecture corresponding toafﬁnity fusion. In this case, the encoder and decoder are trained separately for each modality, but are forced to have the same self-expressive layer.

DP S0 S1 S2 Visible z = f ( DP,S0,S1,S2,Vi ) Spatial Fusion ResultInput Modalities Spatial Fusion f ( x , x , x , x , x ) Fig. 4. In spatial fusion methods each location of the fusion is related to the input values at the same location. In this especial case, the facial components(i.e. eyes, nose and mouth) are aligned across all the modalities (i.e. DP, S0, S1, S2, Visible). subspaces in another view. A multiview subspace clusteringmethod, called Low-rank Tensor constrained Multiview Sub-space Clustering (LT-MSC) was recently proposed in [26].In the LT-MSC method, all the subspace representations areintegrated into a low-rank tensor, which captures the high ordercorrelations underlying multiview data. In [51], a diversity-induced multiview subspace clustering was proposed in whichthe Hilbert Schmidt independence criterion was utilized toexplore the complementarity of multiview representations.Recently, [52] proposed a constrained multi-view video faceclustering (CMVFC) framework in which pairwise constraintsare employed in both sparse subspace representation andspectral clustering procedures for multimodal face clustering.A collaborative image segmentation framework, called Multi-task Low-rank Afﬁnity Pursuit (MLAP) was proposed in [27]. In this method, the sparsity-consistent low-rank afﬁnities fromthe joint decompositions of multiple feature matrices into pairsof sparse and low-rank matrices are exploited for segmenta-tion.

D. Deep Multimodal Learning

In multimodal learning problems, the idea is to use thecomplementary information provided by the different modal-ities to enhance the recognition performance.Supervised deepmultimodal learning was ﬁrst introduced in [37], [38], and hasgained a lot of attention in recent years [53], [54], [40].Keila et al. [47] investigated deep multimodal classiﬁcationof large-scaled datasets. They, compared a number of multi-modal fusion methods in terms of accuracy and computationalefﬁciency, and provided analysis regarding the interpretability

EEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 12, NO. 6, DECEMBER, 2018. 4 of multimodal classiﬁcation models. Feichtenhofer et al. [48]proposed a convolutional fusion method for two stream 3Dnetworks. They explored multiple fusion functions withindeep architectures and studied the importance of learning thecorrespondences between spatial and temporal feature maps.Various deep supervised multimodal fusion approaches havealso been proposed in the literature for different applicationsincluding medical image analysis applications [55], [56] visualrecognition [41], [40] and visual question answering [53], [43].We refer readers to [39] for more detailed survey of variousdeep supervised multimodal fusion methods.While most of the deep multimodal approaches have re-ported improvements in the supervised tasks, to the best ofour knowledge, there is no deep multimodal learning methodspeciﬁcally designed for unsupervised subspace clustering.III. S

PATIAL F USION - BASED D EEP M ULTIMODAL S UBSPACE C LUSTERING

In this section, we present details of the proposed spatialfusion-based networks for unsupervised subspace clustering.Spatial fusion methods ﬁnd a joint representation that containscomplementary information from different modalities. Thejoint representation has a spatial correspondence to everymodality. Figure 4 shows a visual example of spatial fusionwhere ﬁve different modalities (DP, S0, S1, S2, Visible) arecombined to produce a fused result z . The spatial fusion meth-ods are especially popular in supervised multimodal learningapplications [47], [48]. We investigate applying these fusiontechniques to our problem of deep subspace clustering.An essential component of such methods is the fusionfunction that merges the information from multiple inputrepresentations and returns a fused output. In the case of deepnetworks, ﬂexibility in the choice of fusion network leadsto different models. In what follows, we investigate severalnetwork designs and spatial fusion functions for multimodalsubspace clustering. Then, we formulate an end-to-end trainingobjective for the proposed networks. A. Fusion Structures

We build our deep multimodal subspace clustering networksbased on the architecture proposed in [16] for unimodalsubspace clustering. Our framework consists of three maincomponents: an encoder, a fully connected self-expressivelayer, and a decoder. We propose to achieve the spatial fusionusing an encoder and the fused representation is then fedto a self-expressive layer which essentially exploits the self-expressiveness property of the joint representation. The jointrepresentation resulting from the output of the self-expressivelayer is then fed to a multimodal decoder that reconstructs thedifferent modalities from the joint latent representation.For the case of M input modalities, the decoder consists of M branches, each reconstructing one of the modalities. Theencoders on the other hand, can be designed such that theyachieve early, late or intermediate fusion. Early fusion refersto the integration of multimodal data in the stage of featurelevel before feeding them to the network. Late fusion, on theother hand, involves the integration of multimodal data in the last stage of the network. The ﬂexibility of deep networksalso offers the third type of fusion known as the intermediatefusion, where the feature maps from the intermediate layersof a network are combined to achieve better joint represen-tation. Figures 3 (a), (b) and (c) give an overview of deepmultimodal subspace clustering networks with different spatialfusion structures. Note that the multimodal decoder’s structureremains the same in all three cases. It is worth mentioning thatin the case of intermediate fusion, it is a common practice toaggregate the weak or correlated modalities at earlier stagesand combine the remaining strong modalities at the in-depthstages [39]. B. Fusion Functions

Assume for a particular data point, x i , there are M featuremaps corresponding to the representation of different modali-ties. A fusion function f : { x , x . · · · , x M } → z fuses the M feature maps and produces an output z . For simplicitywe assume that all the input feature maps have the samedimension of R H × W × d in , and the output has the dimension of R H × W × d out . In fact, deep network structures offer the designoption for having feature maps with the same dimensions.We use z i , j , k and x mi , j , k to denote the value in the spatialposition ( i , j , k ) in the output and the m th input feature map,respectively. Various fusion functions can be used to combinethe input feature maps. Below, we investigate a few.

1) Sum fusion z = sum ( x , x . · · · , x M ) : computes the sumof the feature maps at the same special positions as follows z i , j , k = M (cid:213) m = x mi , j , k . (3)

2) Maxpooling function z = max ( x , x . · · · , x M ) : returnsthe maximum value of the corresponding location in the inputfeature maps as follows z i , j , k = Max { x i , j , k , x i , j , k . · · · , x Mi , j , k } . (4)

3) Concatenation function z = cat ( x , x . · · · , x M ) : con-structs the output by concatenating the input feature maps asfollows z = [ x , x . · · · , x M ] , (5)where each input has the dimension R H × W × d in and the outputhas the dimension R H × W ×( d in × M ) . Note that these fusionfunctions are denoted as “Fusion" in blue boxes in Figure 3(a)-(c). C. End-to-End Training Objective

Given N paired data samples { x i , x i , · · · , x Mi } Ni = from M different modalities, deﬁne the corresponding data matrices as X m = [ x m , x ms , · · · , x mN ] , m ∈ { , · · · , M } . Regardless of thenetwork structure and the fusion function of choice, let Θ M . e denote the parameters of the multimodal encoder. Similarly,let Θ s be the self-expressive layer parameters and Θ M . d bethe multimodal decoder parameters. Then the proposed spatial EEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 12, NO. 6, DECEMBER, 2018. 5 fusion models can be trained end-to-end using the followingloss function min Θ (cid:107) Θ s (cid:107) p + λ (cid:107) Z Θ M . e − Z Θ M . e Θ s (cid:107) F + λ M (cid:213) m = (cid:107) X m − ˆ X m Θ (cid:107) s.t diag ( Θ s ) = , (6)where Θ denotes all the training network parameters including Θ M . e , Θ s and Θ M . d . The joint representation is denoted by Z Θ M . e , and ˆ X m Θ is the reconstruction of X m . Here, λ and λ are two regularization parameters, and (cid:107) · (cid:107) p can be either (cid:96) or (cid:96) norm.IV. A FFINITY F USION - BASED D EEP M ULTIMODAL S UBSPACE C LUSTERING

In this section, we propose a new method for fusing theafﬁnities across the data modalities to achieve better clustering.Spatial fusion methods require the samples from differentmodalities to be aligned (see Figure 4) to achieve betterclustering. In contrast, the proposed afﬁnity fusion approachcombines the similarities from the self-expressive layer toobtain a joint representation of the multimodal data. This isdone by enforcing the network to have a joint afﬁnity matrix.This avoids the issue of having aligned data or increasingthe dimensionality of the fused output (i.e. concatenation).The motivation for enforcing a shared afﬁnity matrix is thatsimilar (dissimilar) data in one modality should be similar (dis-similar) in the other modalities as well. Figure 5 shows anexample of the proposed afﬁnity fusion method by forcing themodalities to share the same afﬁnity matrix.In the DSC framework [16], the afﬁnity matrix is calculatedfrom the self-expressive layer weights as follows W = | Θ Ts | + | Θ Ts | , where Θ s corresponds to the self-expressive layer weightslearned by an end-to-end training strategy [16]. Thus a shared Θ s results in a common W across the modalities. We enforcethe modalities to share a common Θ s while having differentencoders, decoders and the latent representations. A. Network Structure

For an M modality problem, we propose to stack M parallelDSC networks, where they share a common self-expressivelayer. In this model, per each modality one encoder-decodernetwork is trained. In contrast to the spatial fusion modelsthat only have one joint latent representation, this modelresults in M distinct latent representations corresponding to M different modalities. The latent representations are connectedtogether by sharing the self-expressive layer. The optimal self-expressive layer should be able to jointly exploit the self-expressiveness property across all the M modalities. Figure3 (d) gives an overview of the proposed afﬁnity fusion-basednetwork architecture. Algorithm 1

Spatial and afﬁnity fusion algorithms procedure DMSC( { X m } Mm = , λ , λ , ‘mode’ ) if mode = Spatial fusion then Train the networks using the loss (6). else if mode = Afﬁnity fusion then Train the networks using the loss (7). end if Extract Θ s from the trained networks. Normalize the columns of Θ s as θ si ← θ si (cid:107) θ si (cid:107) ∞ . Form a similarity graph with N nodes and set theweights on the edges by W = | Θ s | + | Θ Ts | . Apply spectral clustering to the similarity graph. end procedure

Output:

Segmented multimodal data.

B. End-to-End Training

We propose to ﬁnd the shared self-expressive layer weightsby training the network with the following loss min Θ (cid:107) Θ s (cid:107) p + λ M (cid:213) m = (cid:107) Z m Θ me − Z m Θ me Θ s (cid:107) F + λ M (cid:213) m = (cid:107) X m − ˆ X m Θ m (cid:107) s.t. diag ( Θ s ) = , (7)where Θ s is the common self-expressive layer weighs. Here, λ and λ are regularization parameters. Z m Θ me and ˆ X m Θ m arerespectively the latent space representation and the recon-structed decoder’s output corresponding to X m . Θ m denotesthe network parameters corresponding to the m th modalityand Θ indicates to all the trainable parameters. Minimizing(7) encourages the networks to learn the latent representationsthat share the same afﬁnity matrix.Algorithm 1 summarizes the proposed spatial fusion andafﬁnity fusion-based subspace clustering methods. Details ofdifferent network architectures used in this paper are given inAppendix VI. V. E XPERIMENTAL R ESULTS

We evaluate the proposed deep multimodal subspace clus-tering methods on several real-world multimodal datasets. Thefollowing datasets are used in our experiments. • Multiview digit clustering using the MNIST [57] and theUSPS [58] handwritten digits datasets. Here, we viewimages from the individual datasets as two views of thesame digit. These datasets are considered to be spatiallyrelated but not aligned. Since the number of parametersin the self-expressive layer of a deep subspace clusteringnetwork scales quadratically with the size of the data,we randomly select 200 samples per digit to keep thenetworks to a tractable size. • Heterogeneous face clustering using the ARL Polarimet-ric face dataset [59]. The ARL dataset contains ﬁvespatially well-aligned modalities (Visible, DP, S0, S1,S2). • Face clustering based on the facial regions using the Ex-tended Yale-B dataset [60]. We extract facial components

EEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 12, NO. 6, DECEMBER, 2018. 6 M o u t h s N o s e s L - e y e s R - e y e s F a c e s Shared AfﬁnityIndividual Afﬁnities of the Input Modalities Afﬁnity Fusion

Fig. 5. An example of afﬁnity fusion. Afﬁnities corresponding to different modalities are combined to have only a single shared afﬁnity. This method does notrelay on spatial relation across different modalities. Instead, it aggregates the similarities among data points across different modalities and returns a sharedafﬁnity.

Experiment Dataset ofmodalities of samplesper modalityDigits MNIST [57],USPS [58] 2 2000HeterogeneousFaces ARL [59] 5 2160Facialcomponents ExtendedYale-B [60] 5 2432

TABLE ID

ETAILS OF THE MULTIMODAL DATASETS THAT ARE USED IN THEEXPERIMENTS . N

OTE THAT AS OPPOSED TO SUPERVISED METHODS , WEDO NOT SPLIT DATASETS TO TRAINING AND TESTING SETS IN A DEEPSUBSPACE CLUSTERING TASK . (i.e. eyes, nose, mouth) from the images and view themas soft biometrics and use them along with the entire facefor clustering. Here, the modalities do not share any directspatial correspondence.Figure 6 (a), (b), and (c) show sample images from thedigits, ARL and Extended Yale-B datasets, respectively. Ta-ble I gives an overview of their details. Note that as opposedto supervised methods, we do not split datasets into trainingand testing sets for subspace clustering. Similar to [16],the parameters of the deep subspace clustering networks aretrained using the entire dataset.To investigate ability and limitations of different versionsof the proposed fusion methods, we evaluate the afﬁnityfusion method along with a wide range of plausible spatialfusion methods based on different structure designs and fusionfunctions. For the early fusion structure, we consider theconcatenation fusion function . As for the intermediate andlate fusion structures, we consider all the three presentedfusion functions which results in six distinct models. Table IIpresents the structural variations we have used for the pre-sented spatial fusion methods and the name we assign tothem when reporting their performances. Besides, we compareour methods against the following state-of-the-art multimodalsubspace clustering baselines: CMVFC [52], TM-MSC [26],MSSC [36], MLRR [36], KMSSC [36], and KMLRR [36]. Note that applying max-pooling and additive functions in pixel levelfeatures might result in information loss.

Structure Function Max-pooling Additive ConcatenationEarly fusion × ×

Early-concat.Intermediate fusion Interm.-mpool. Interm.-additive Interm.-concat.Late fusion Late-mpool. Late-additive Late-concat.

TABLE IIS

PATIAL FUSION VARIATIONS THAT ARE USED IN THE EXPERIMENTS . DSC[16] AE+SSC SSC[21] LRR[22]MNIST ACC

TABLE IIIT

HE PERFORMANCE OF SINGLE MODALITY SUBSPACE CLUSTERINGMETHODS ON D IGITS . E

XPERIMENTS ARE EVALUATED BY AVERAGE

ACC,NMI

AND

ARI

OVER RUNS . W

E USE BOLDFACE FOR THE TOPPERFORMER . C

OLUMNS SPECIFY THE SINGLE MODALITY SUBSPACECLUSTERING METHOD , AND ROWS SPECIFY THE MODALITY (MNIST OR USPS)

AND CRITERIA . Also, to explore the contribution of leveraging informationfrom multiple modalities into the performance of subspaceclustering task, we report the performance of subspaceclustering methods on the single modalities as well. Inparticular, we report the classical SSC [21] and LRR [22]performances on the individual modalities along with therecently proposed DSC method [16]. Furthermore, we trainan encoder-decoder similar to the network in [16] butwithout the self-expressive layer, and extract the latent spacerepresentations. These deep features are then fed to the SSCalgorithm for clustering. We call this method “AE+SSC”. Thisbaseline will show the signiﬁcance of using an end-to-enddeep learning method for subspace clustering. In our tables,we use boldface letters to denote the top performing methodand specify the corresponding modalities or datasets in therows, and subspace clustering methods on the columns.

EEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 12, NO. 6, DECEMBER, 2018. 7(a) (b) (c) U SPS [ ] M N I S T [ ] S S S D P V i s i b l e M ou t h N o s e R - e y e L - e y e F ace Fig. 6. Sample images from (a) MNIST [57], and USPS [58] digits datasets, (b) ARL polarimetric face dataset [59], and (c) Faces and facial componentsfrom the Extended Yale-B dataset [60]. In our experiments, samples from all the modalities are resized to × , and rescaled to have pixel values between0 and 255. Structures:

We perform all the experiments on differentdatasets using the same protocol and network architecturesto ensure fair and meaningful comparisons (including the net-works for the single modality experiments). All the encodershave four convolutional layers, and decoders are stackedthree deconvolution layers mimicking the inverse task of theencoder. The network details are given in the Appendix.For the spatial fusion experiments, in the case of earlyfusion, we apply the fusion functions on the pixel intensities,and the rest of the network is similar to that of the singlemodality deep subspace clustering network. Conducted exper-iments for the intermediate fusion use a prior knowledge on theimportance of the modalities. They integrate weak modalitiesin the second hidden layer, and then, the combination of themin the third layer. Finally, the fusion of all the weak modalitiesis combined with the strong modality (for example the visibledomain in the ARL dataset) in the fourth layer. In the case oflate fusion, all the modalities are fused in the fourth layer ofthe encoder.As discussed earlier, in the afﬁnity fusion method thereexists an encoder-decoder and a latent space per number ofavailable modalities. For example, in the case of the ARLdataset with 5 modalities, we have 5 distinct encoders anddecoders connected with a shared self-expressive layer. Foreach modality in the experiments with the shared afﬁnity,we use similar encoder-decoders as in the case of the DSCnetwork [16] with unimodal experiments.

Training details:

We implemented our method in Python-2with Tensorﬂow-1.4 [61]. We use the adaptive momentum-based gradient descent method (ADAM) [62] to minimize ourloss functions, and apply a learning rate of − .The input images of all the modalities are resized to × ,and rescaled to have pixel values between 0 and 255. In ourexperiments, the Frobenius norm (i.e. p = ) is used in the lossfunctions (2), (6) and (7) while training the networks. Similarto [16], for all the methods that have self-expressive layer,we start training on the speciﬁed objective functions in eachmodel after a stage of pre-training on the dataset without theself-expressive layer. In particular, for all the proposed deepmultimodal subspace clustering methods, and the unimodal DSC networks in the experiments with individual modalities,we pre-train the encoder-decoders for k epochs with thefollowing objective min ˆ Θ M (cid:213) m = (cid:107) X m − ˆ X m ˆ Θ (cid:107) F , where ˆ Θ indicates the union of parameters in the encoder anddecoder networks. Note that for the unimodal experiments, M = .We use a batch size of for the pretraining stage ofall the experiments. However, once we start training the self-expressive layer, the method requires all the data points to befed as a batch. Thus, in the experiments with digits, ARL facesand Yale-B facial components the batch sizes are 2000, 2160and 2432, respectively.We set the regularization parameters as λ = and λ = × K − , where K is the number of subjects inthe dataset. This experimental rule has been found to beefﬁcient in [16] as well. A sensibility analysis over the range [ − , ] in Section V-E, shows that if λ and λ are keptaround the same scale as our selections, the performance ofthe proposed method is not much sensitive to these parametersfor a set of wide ranges. Evaluation metrics:

We compare the performance of dif-ferent methods using the clustering accuracy rate (ACC),normalized mutual information (NMI) [63], and AdjustedRand Index (ARI) [64] metrics.In external validation of clustering methods where groundtruth labels are available, a correct clustering is usually re-ferred as assigning objects belonging to the same category inthe ground truth to the same cluster, and objects belongingto different categories to different clusters. With that, ACCis deﬁned as the number of data points correctly clustereddivided by the total number of data points. The ARI metric, inaddition to penalizing the misclustered data points, penalizesputting two objects with the same label in different clusters,and is adjusted such that a random clustering will score closeto 0. The NMI captures the mutual information between thecorrect labels and the predicted labels, and is normalizedbetween the range [0,1].

EEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 12, NO. 6, DECEMBER, 2018. 8

A. Handwritten Digits

In the ﬁrst set of experiments, we use the 10 classes (i.e.digits) from the MNIST and the USPS datasets. Figure 6 (a)shows example images from these datasets. For the experi-ments with digits, we randomly sample 200 images per classfrom their training sets to reduce the computations and adjustthe imbalance in the tests.We randomly bundle the same class samples across thetwo datasets and assume they present two modalities (views)of a digit. One can see from Figure 6 (a), that the neededreceptive ﬁeld for recognizing the digits in the MNIST andthe USPS datasets is relatively large. Based on this logic,in the experiments with digits, we use large kernels in theencoders. The detailed network settings for these experimentsare described in the Appendix. Note that some structuresincluding the late fusion methods in Table II and the afﬁnityfusion method have more than one branches in some of theirlayers.Table III shows the performance of deep subspace clusteringper individual digits. This table reveals that the MNIST datasetis easier than the USPS dataset for the subspace clusteringtask. This observance coincides with the performance of othermethods reported in [65].Note that while the DSC method in Table III shows the-state-of-the-art performance on both datasets, a successfulmultimodal method should enhance the performance by lever-aging the information across the two modalities. Table IVcompares the performance of the multimodal methods in termsof accuracy, NMI and ARI metrics. We observe that mostof the multimodal methods can successfully integrate thecomplementary information of the datasets in the subspaceclustering task and provide a better performance in comparisonto their unimodal counterpart. However, the proposed deepmultimodal subspace clustering methods perform signiﬁcantlybetter than the classical multimodal subspace clustering meth-ods. In particular, the afﬁnity fusion and late-addition methodscan segment the digits with an accuracy of . , and NMIand ARI metric of above . B. ARL Heterogeneous Face Dataset

To test our methods on clustering datasets with a largenumber of subjects, we use the ARL dataset [59] whichconsists of facial images from 60 unique individuals in differ-ent spectrums and from different distances. This dataset hasfacial images in the visible domain as well as four differentpolarimetric thermal domains. Each subject has several well-aligned facial images per each modality. Sample images fromthis dataset are shown in Figure 6 (b).Table V compares the performance of subspace clusteringmethods on individual modalities in the ARL dataset. As ex-pected, the visible modality shows better performance amongthe different spectrums. As the samples are well-aligned in thisdataset, we see that most of the subspace clustering methodswork well across all the modalities. In particular, the LRRmethod which takes the advantage of aligned data points,provides comparable results to the DSC method.

Fig. 7. Facial components are extracted by applying a ﬁxed mask on thefaces in the Extended Yale-B dataset [60].

Since the ARL dataset has multiple modalities, beside theearly and late fusion structures, we also use an intermediatestructure when designing the multimodal encoders. Hence, inthis experiment, we add the following intermediate spatialfusion structure to the multimodal methods. Assuming thevisible domain is the main modality, we integrate S , S and S modalities in the second layer and combine their fusedoutput with the DP samples in the third layer. Finally, wefuse the result with the visible domain at the last layer of theencoders.The performances of deep multimodal subspace clusteringmethods are compared in Table IV. We observe that most ofthe methods are able to leverage the complementary informa-tion of the different spectrums and provide a more accurateclustering in comparison to the unimodal performances. Inparticular, the afﬁnity fusion method has the best performance,and late-concat and early-concat methods provide comparableresults. This experiment clearly shows that our proposedmethods can perform well even with a large number of subjectsin the dataset. C. Facial Components

The Extended Yale-B dataset [60] consists of 64 frontal im-ages of individuals under varying illumination conditions.This dataset is popular in subspace clustering studies [16],[22], [21]. We crop the facial components (i.e. eyes, nose andmouth), and view them as weak modalities. In the biometricsliterature, they are viewed as soft biometrics [66]. To cropthe facial components, we apply a ﬁxed face mask as shownin Figure 7 on all the facial images. The extracted facialregions are resized to × images. This experiment isespecially important as the modalities do not share the spatialcorrespondence. For example, spatial locations in the mouthmodality cannot be projected on the spatial positions in thenose modality. Sample images from this dataset are shown inFigure 6 (c). The setting in this experiment can examine theproposed methods under the condition of spatially unrelatedmodalities.The performance of subspace clustering methods on theindividual facial components is summarized in Table VI. Weobserve that the nose and the mouth modalities fail to providegood clustering results. On the other hand, DSC and AE+SSCperform well on the eye and the entire face modalities.Since the mouth, nose, and eyes are considered as weakmodalities, in the design of the intermediate spatial fusion we EEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 12, NO. 6, DECEMBER, 2018. 9

CMVFC[52] TM-MSC[26] MSSC[36] MLRR[36] KMSSC[36] KMLRR[36] Early-concat.Digits ACC 47.6 80.65 81.65 80.6 84.4 86.85 92.2NMI 73.56 83.44 85.33 84.13 89.45 80.34 88.53ARI 38.12 75.67 77.36 76.53 79.61 82.76 84.60ARL ACC 96.58 96.64 97.78 97.5 97.97 97.74 98.24NMI 98.39 98.35 99.58 99.57 99.51 99.58 99.27ARI 94.85 95.85 96.40 95.79 96.09 95.88 97.21Extended Yale-B ACC 66.84 63.12 80.3 67.62 87.65 82.45 65.55NMI 72.03 67.06 82.78 73.36 81.50 85.43 78.82ARI 40.00 38.37 50.18 40.85 63.83 59.71 41.95Interm.-concat. Interm-addition Interm.-mpool. Late-concat. Late-addition Late-mpool Afﬁnity fusionDigits ACC

N/A N/A N/A

NMI

N/A N/A N/A

ARI

N/A N/A N/A

ARL ACC 97.79 96.21 94.99 98.22 96.68 95.77

NMI

Extended Yale-B ACC 94.88 97.65 7.76 92.45 67.41 7.06

NMI 93.90 96.88 9.31 92.53 66.95 6.39

ARI 88.19 94.96 0.73 82.91 33.37 00.48 • N/A indicates that the corresponding method is not applicable to this experiment.

TABLE IVT

HE PERFORMANCE OF MULTIMODAL SUBSPACE CLUSTERING METHODS . E

ACH EXPERIMENT IS EVALUATED BY AVERAGE

ACC, NMI

AND

ARI

OVER RUNS . W

E USE BOLDFACE FOR THE TOP PERFORMER . C

OLUMNS OF THIS TABLE SHOW THE MULTIMODAL SUBSPACE CLUSTERING METHOD , AND THEROWS LIST DATASETS AND CLUSTERING METRICS .

50 100 150 200 25050100150200250

50 100 150 200 25050100150200250 (a) (b) (c) (d)

Fig. 8. Visualization of the afﬁnity matrices for ﬁrst four subjects in the Extended Yale-B dataset calculated from the self-expressive layer weight matricesin (a) unimodal clustering on faces using DSC. (b) The late-mpool method. (c) The late-concat method. (d) The afﬁnity fusion method. Note that (b) showsa failure case of the spatial fusion methods. combine the two eyes, and the mouth and the nose separatelyin the second layer of the encoders, and fuse the result of theircombinations in the third layer. Finally, we fuse the combinedfeatures with the face features in the fourth layer.The performance of various multimodal subspace clusteringmethods are tabulated in Table IV. It is worth highlightingseveral interesting observations from the results. As can beseen, the late-mpool and interm-mpool methods fail to segmentthe data points. That is because this fusion function at eachspatial position returns the maximum of the activation valuesat the same spatial position between its input feature maps.Since the modalities do not share any spatial correspondencein this experiment, this function does not provide good per-formance. In addition, even though additive and concatenatefusion functions have provided good results in some cases,because of a similar reason their performances are highlyrelated to the structure choices. For example, the additivefunction provides better performance with the intermediatefusion structure, while the concatenation works better with the late fusion structure choice. However, the afﬁnity fusion provides the state-of-the-art clustering performance of above accuracy, the NMI of . and ARI metric of . .This is mainly due to the fact that this method does not relyon the spatial correspondence among the modalities.Figure 8 compares the afﬁnity matrices of the ﬁrst foursubjects in the Extended Yale-B datasets. The afﬁnity matricesare calculated from the self-expressive layer weights of theircorresponding trained networks. The depicted afﬁnity matricesin these ﬁgures are the result of a permutation being appliedon the matrix so that data points of the same clusters arealongside each other. With this arrangement, a perfect afﬁnitymatrix should be block diagonal.Figure 8 (a) shows the afﬁnity matrix corresponding to theDSC method for clustering faces. Figure 8 (b) shows thismatrix for the multimodal subspace clustering with the late-mpool method. Note that this method fails to cluster the data,and as can be seen, its afﬁnity matrix is not block-diagonal.Figure 8 (c) and Figure 8 (d) show the afﬁnity matrices of EEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 12, NO. 6, DECEMBER, 2018. 10

DSC[16] AE+SSC SSC[21] LRR[22]Visible ACC

TABLE VT

HE PERFORMANCE OF SINGLE MODALITY SUBSPACE CLUSTERINGMETHODS ON

ARL

DATASET . E

XPERIMENTS ARE EVALUATED BYAVERAGE

ACC, NMI

AND

ARI

OVER RUNS . W

E USE BOLDFACE FORTHE TOP PERFORMER . C

OLUMNS SPECIFY THE SINGLE MODALITYSUBSPACE CLUSTERING METHOD , AND ROWS SPECIFY THE MODALITIESAND CRITERIA . DSC[16] AE+SSC SSC[21] LRR[22]Face ACC

TABLE VIT

HE PERFORMANCE OF SINGLE MODALITY SUBSPACE CLUSTERINGMETHODS ON E XTENDED Y ALE -B DATASET . E

XPERIMENTS AREEVALUATED BY AVERAGE

ACC, NMI

AND

ARI

OVER RUNS . W

E USEBOLDFACE FOR THE TOP PERFORMER . C

OLUMNS SPECIFY THE SINGLEMODALITY SUBSPACE CLUSTERING METHOD , AND ROWS SPECIFY THEFACIA COMPONENTS AND CRITERIA . the late-concat and afﬁnity fusion methods, respectively. Weobserve that both methods provide a solid block diagonalafﬁnity matrices. D. Convergence study

To empirically show the convergence of our proposedmethod, in Figure 9, we show the objective function of the afﬁnity fusion method and its clustering metrics vs iterationplot for solving (7). The reported values in Figure 9 arenormalized between zero and one. As can be seen from theﬁgure, our algorithm converges in a few iterations.

250 500 750 1000 epochs

ACCARINMILoss

Fig. 9. The afﬁnity fusion method’s loss function and the clustering metricsover different training epochs in the Yale-B facial components experiment.The reported values in this ﬁgure are normalized between zero and one. Thisﬁgure shows the convergence of our objective function. -4 -3 -2 -1 PURARINMI -4 -3 -2 -1 PURARINMI

Fig. 10. The afﬁnity fusion method’s performance through different parameterselections for λ and λ . E. Regularization parameters

In this section, we analyze the sensibility of the proposedmethod to the regularization parameters λ and λ in theloss function (7). Figure 10 shows the inﬂuence of theseregularization parameters on the performance of the afﬁnityfusion method on the Extended Yale-B dataset.In Figure 10 (a), we ﬁx λ = and report the metricswith various λ s over the range of [ − , ] . Similarly, inFigure 10 (b), we ﬁx λ = and this time change λ in thesimilar range to analyze the inﬂuence of λ on the performanceof the method. As can be seen from the ﬁgure, in a widerange of values, the ﬁnal performance of the method is notsensitive to the choice of parameters. The experimental settingsuggested in [16] also performed well in all the experiments. F. Performance with respect to different norms on the self-expressive layer

In this section, we compare the performance of the proposed afﬁnity fusion method by changing the p -norm on the self-expressive layer in the optimization problem (7). Table VII EEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 12, NO. 6, DECEMBER, 2018. 11

Metric (cid:107) · (cid:107) p p < . p = . p = p = . p = PUR × NMI × ARI × TABLE VIIA

NALYSIS OF DIFFERENT REGULARIZATION NORMS ON THESELF - EXPRESSIVE LAYER . O

UR EXPERIMENTS WITH p < DID NOTCONVERGED . T

HE RESULTS ARE FOLD AVERAGE . W

E USE BOLDFACEFOR THE TOP PERFORMER . reports the clustering metrics for the experiments with p = . , p = , p = . and p = . As can be seen from this table, whileexperiments with p = , p = . and p = have comparableperformances, applying the p-norm with p = . does notprovide sufﬁcient result. It is worth mentioning that in ourexperiments with different norms with . < p < the methodshowed instability, and for p < . the minimization of (7) didnot converge. The reason is that the norms with p < are non-convex, and one might need additional regularizations to keepthe optimization tractable.VI. C ONCLUSION

We presented novel deep multimodal subspace clusteringnetworks for clustering multimodal data. In particular, wepresented two fusion techniques of spatial fusion and afﬁnityfusion. We observed that spatial fusion methods in a deepmultimodal subspace clustering task relay on spatial corre-spondences among the modalities. On the other hand, theproposed afﬁnity fusion that ﬁnds a shared afﬁnity acrossall the modalities provides the state-of-the-art results in allthe conducted experiments. This method clusters the imagesin the Extended Yale-B dataset with an accuracy of . ,normalized mutual information of . and adjusted randindex of . .A PPENDIX : N

ETWORK A RCHITECTURES

In this section, we provide the details of the networkarchitecture used in the experiments. Note that all the pluggedin convolutional layers use relu as well.

A. Different networks corresponding to digits experiments

TABLE VIIIE

ARLY - FUSION NETWORKS IN THE DIGITS EXPERIMENTS . Layer Input output Kernel (stride,pad)Feature Fusion Fusion Image 1 Fusion - -Image 2Convolutional layers Conv 1 Fusion Conv 1 × × × (2,1)Conv 2 Conv 1 Conv 2 × × × (2,1)Conv 3 Conv 2 Conv 3 × × × (1,0)Conv 4 Conv 3 Latent × × × (1,0)Self-expressiveness Θ s Latent L-recon

Parameters -Multimodal Decoder L-recon Recon 1 Details inDecoder layers Recon 2 Table XI

TABLE IXL

ATE - FUSION NETWORKS IN THE DIGITS EXPERIMENTS . Layer Input output Kernel (stride,pad)Branch 1 B1/Conv 1 Image B1/Conv 1 × × × (2,1)B1/Conv 2 B1/Conv 1 B1/Conv 2 × × × (2,1)B1/Conv 3 B1/Conv 2 B1/Conv 3 × × × (1,0)B1/Conv 4 B1/Conv 3 B1/out × × × (1,0)Branch 2 B2/Conv 1 Image B2/Conv 1 × × × (2,1)B2/Conv 2 B2/Conv 1 B2/Conv 2 × × × (2,1)B2/Conv 3 B2/Conv 2 B2/Conv 3 × × × (1,0)B2/Conv 4 B2/Conv 3 B2/out × × × (1,0)Feature Fusion Fusion B1/out Latent - -B2/outSelf-expressiveness Θ s Latent L-recon

Parameters -Multimodal Decoder L-recon Recon 1 Details inDecoder layers Recon 2 Table XI

TABLE XA

FFINITY FUSION NETWORKS IN THE DIGITS EXPERIMENTS . Layer Input output Kernel (stride,pad)Encoder 1 B1/Conv 1 Image B1/Conv 1 × × × (2,1)B1/Conv 2 B1/Conv 1 B1/Conv 2 × × × (2,1)B1/Conv 3 B1/Conv 2 B1/Conv 3 × × × (1,0)B1/Conv 4 B1/Conv 3 Latent 1 × × × (1,0)Encoder 2 B2/Conv 1 Image B2/Conv 1 × × × (2,1)B2/Conv 2 B2/Conv 1 B2/Conv 2 × × × (2,1)B2/Conv 3 B2/Conv 2 B2/Conv 3 × × × (1,0)B2/Conv 4 B2/Conv 3 Latent 2 × × × (1,0)Self-expressiveness Common Θ s Latent 1 L-recon 1

Parameters -layer Latent 2 L-recon 2Decoder 1 D1/deconv 1 L-recon D1/deconv 1 × × × (1,0)D1/deconv 2 D1/deconv 1 D1/deconv 2 × × × (2,1)D1/deconv 3 D1/deconv 2 Recon 1 × × × (2,1)Decoder 2 D2/deconv 2 L-recon D2/deconv 1 × × × (1,0)D2/deconv 2 D2/deconv 1 D2/deconv 2 × × × (2,1)D2/deconv 3 D2/deconv 2 Recon 2 × × × (2,1) TABLE XIM

ULTIMODAL DECODER IN THE DIGITS EXPERIMENTS . Layer Input output Kernel (stride,pad)Decoder 1 D1/deconv 1 L-recon D1/deconv 1 × × × (1,0)D1/deconv 2 D1/deconv 1 D1/deconv 2 × × × (2,1)D1/deconv 3 D1/deconv 2 Recon 1 × × × (2,1)Decoder 2 D2/deconv 2 L-recon D2/deconv 1 × × × (1,0)D2/deconv 2 D2/deconv 1 D2/deconv 2 × × × (2,1)D2/deconv 3 D2/deconv 2 Recon 2 × × × (2,1) B. Different networks corresponding to ARL experiments

TABLE XIIE

ARLY - FUSION NETWORKS IN THE

ARL

EXPERIMENTS . Layer Input output Kernel (stride,pad)Feature Fusion Fusion Image 1 Fusion - -Image 2Image 3Image 4Image 5Convolutional layers Conv 1 Fusion Conv 1 × × × (2,1)Conv 2 Conv 1 Conv 2 × × × (2,1)Conv 3 Conv 2 Conv 3 × × × (1,0)Conv 4 Conv 3 Latent × × × (1,0)Self-expressiveness Θ s Latent L-recon

Parameters -Multimodal Decoder L-recon Recon 1 Details inDecoder layers Recon 2 Table XVIRecon 3Recon 4Recon 5

EEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 12, NO. 6, DECEMBER, 2018. 12

TABLE XIIIL

ATE - FUSION NETWORKS IN THE

ARL

EXPERIMENTS . Layer Input output Kernel (stride,pad)Branch 1 B1/Conv 1 Image B1/Conv 1 × × × (2,1)B1/Conv 2 B1/Conv 1 B1/Conv 2 × × × (2,1)B1/Conv 3 B1/Conv 2 B1/Conv 3 × × × (1,0)B1/Conv 4 B1/Conv 3 B1/out × × × (1,0)Branch 2 B2/Conv 1 Image B2/Conv 1 × × × (2,1)B2/Conv 2 B2/Conv 1 B2/Conv 2 × × × (2,1)B2/Conv 3 B2/Conv 2 B2/Conv 3 × × × (1,0)B2/Conv 4 B2/Conv 3 B2/out × × × (1,0)Branch 3 B3/Conv 1 Image B3/Conv 1 × × × (2,1)B3/Conv 2 B3/Conv 1 B3/Conv 2 × × × (2,1)B3/Conv 3 B3/Conv 2 B3/Conv 3 × × × (1,0)B3/Conv 4 B3/Conv 3 B3/out × × × (1,0)Branch 4 B4/Conv 1 Image B4/Conv 1 × × × (2,1)B4/Conv 2 B4/Conv 1 B4/Conv 2 × × × (2,1)B4/Conv 3 B4/Conv 2 B4/Conv 3 × × × (1,0)B4/Conv 4 B4/Conv 3 B4/out × × × (1,0)Branch 5 B5/Conv 1 Image B5/Conv 1 × × × (2,1)B5/Conv 2 B5/Conv 1 B5/Conv 2 × × × (2,1)B5/Conv 3 B5/Conv 2 B5/Conv 3 × × × (1,0)B5/Conv 4 B5/Conv 3 B5/out × × × (1,0)Feature Fusion Fusion B1/out Latent - -B2/outB3/outB4/outB5/outSelf-expressiveness Θ s Latent L-recon

Parameters -Multimodal Decoder L-recon Recon 1 Details inDecoder layers Recon 2 Table XVIRecon 3Recon 4Recon 5

TABLE XIVI

NTERMEDIATE SPATIAL FUSION N ETWORKS IN THE

ARL

EXPERIMENTS . Layer Input output Kernel (stride,pad)Layer 1 B1/Conv 1 Image B1/Conv 1 × × × (2,1)B2/Conv 1 Image B2/Conv 1 × × × (2,1)B3/Conv 1 Image B3/Conv 1 × × × (2,1)B4/Conv 1 Image B4/Conv 1 × × × (2,1)B5/Conv 1 Image B5/Conv 1 × × × (2,1)Feature Fusion B345/Fusion B3/Conv 1 B345/Fusion - -B4/Conv 1B5/Conv 1Layer 2 B1/Conv 2 B1/Conv 1 B1/Conv 2 × × × (2,1)B2/Conv 2 B2/Conv 1 B2/Conv 2 × × × (2,1)B345/Conv 2 B345/Fusion B345/Conv 2 × × × (2,1)Feature Fusion B2345/Fusion B345/Conv 2 B2345/Fusion - -B2/Conv 2Layer 3 B1/Conv 3 B1/Conv 2 B1/Conv 3 × × × (1,0)B2345/Conv 3 B2345/Fusion B2345/Conv 3 × × × (1,0)Feature Fusion Ball/Fusion B1/Conv 3 Ball/Fusion - -B2345/Conv 3Layer 4 Ball/Conv 4 Ball/Conv 3 Latent × × × (1,0)Self-expressiveness Θ s Latent L-recon

Parameters -Multimodal Decoder L-recon Recon 1 Details inDecoder layers Recon 2 Table XVIRecon 3Recon 4Recon 5

C. Different networks corresponding to Extended Yale-B ex-periments R EFERENCES[1] R. Basri and D. W. Jacobs, “Lambertian reﬂectance and linear sub-spaces,”

IEEE Transactions on Pattern Analysis and Machine Intelli-gence , vol. 25, no. 2, pp. 218–233, 2003.[2] T. Hastie and P. Y. Simard, “Metrics and models for handwrittencharacter recognition,”

Statistical Science , pp. 54–65, 1998.[3] J. P. Costeira and T. Kanade, “A multibody factorization methodfor independently moving objects,”

International Journal of ComputerVision , vol. 29, no. 3, pp. 159–179, 1998.[4] R. Vidal, “Subspace clustering,”

IEEE Signal Processing Magazine ,vol. 28, no. 2, pp. 52–68, 2011.[5] Y. Wu, Z. Zhang, T. S. Huang, and J. Y. Lin, “Multibody grouping viaorthogonal subspace decomposition,” in null . IEEE, 2001, p. 252.[6] A. Y. Yang, J. Wright, Y. Ma, and S. S. Sastry, “Unsupervised segmen-tation of natural images via lossy data compression,”

Computer Visionand Image Understanding , vol. 110, no. 2, pp. 212–225, 2008. TABLE XVA

FFINITY FUSION NETWORKS IN THE

ARL

EXPERIMENTS . Layer Input output Kernel (stride,pad)Encoder 1 B1/Conv 1 Image B1/Conv 1 × × × (2,1)B1/Conv 2 B1/Conv 1 B1/Conv 2 × × × (2,1)B1/Conv 3 B1/Conv 2 B1/Conv 3 × × × (1,0)B1/Conv 4 B1/Conv 3 Latent 1 × × × (1,0)Encoder 2 B2/Conv 1 Image B2/Conv 1 × × × (2,1)B2/Conv 2 B2/Conv 1 B2/Conv 2 × × × (2,1)B2/Conv 3 B2/Conv 2 B2/Conv 3 × × × (1,0)B2/Conv 4 B2/Conv 3 Latent 2 × × × (1,0)Encoder 3 B3/Conv 1 Image B3/Conv 1 × × × (2,1)B3/Conv 2 B3/Conv 1 B3/Conv 2 × × × (2,1)B3/Conv 3 B3/Conv 2 B3/Conv 3 × × × (1,0)B3/Conv 4 B3/Conv 3 Latent 3 × × × (1,0)Encoder 4 B4/Conv 1 Image B4/Conv 1 × × × (2,1)B4/Conv 2 B4/Conv 1 B4/Conv 2 × × × (2,1)B4/Conv 3 B4/Conv 2 B4/Conv 3 × × × (1,0)B4/Conv 4 B4/Conv 3 Latent 4 × × × (1,0)Encoder 5 B5/Conv 1 Image B5/Conv 1 × × × (2,1)B5/Conv 2 B5/Conv 1 B5/Conv 2 × × × (2,1)B5/Conv 3 B5/Conv 2 B5/Conv 3 × × × (1,0)B5/Conv 4 B5/Conv 3 Latent 5 × × × (1,0)Self-expressiveness Common Θ s Latent 1 L-recon 1

Parameters -layer Latent 2 L-recon 2Latent 3 L-recon 3Latent 4 L-recon 4Latent 5 L-recon 5Decoder 1 D1/deconv 1 L-recon D1/deconv 1 × × × (1,0)D1/deconv 2 D1/deconv 1 D1/deconv 2 × × × (2,1)D1/deconv 3 D1/deconv 2 Recon 1 × × × (2,1)Decoder 2 D2/deconv 2 L-recon D2/deconv 1 × × × (1,0)D2/deconv 2 D2/deconv 1 D2/deconv 2 × × × (2,1)D2/deconv 3 D2/deconv 2 Recon 2 × × × (2,1)Decoder 3 D3/deconv 2 L-recon D3/deconv 1 × × × (1,0)D3/deconv 2 D3/deconv 1 D3/deconv 2 × × × (2,1)D3/deconv 3 D3/deconv 2 Recon 3 × × × (2,1)Decoder 4 D4/deconv 2 L-recon D4/deconv 1 × × × (1,0)D4/deconv 2 D4/deconv 1 D4/deconv 2 × × × (2,1)D4/deconv 3 D4/deconv 2 Recon 4 × × × (2,1)Decoder 5 D5/deconv 2 L-recon D5/deconv 1 × × × (1,0)D5/deconv 2 D5/deconv 1 D5/deconv 2 × × × (2,1)D5/deconv 3 D5/deconv 2 Recon 5 × × × (2,1) TABLE XVIM

ULTIMODAL DECODERS IN THE

ARL

EXPERIMENTS . Layer Input output Kernel (stride,pad)Decoder 1 D1/deconv 1 L-recon D1/deconv 1 × × × (1,0)D1/deconv 2 D1/deconv 1 D1/deconv 2 × × × (2,1)D1/deconv 3 D1/deconv 2 Recon 1 × × × (2,1)Decoder 2 D2/deconv 2 L-recon D2/deconv 1 × × × (1,0)D2/deconv 2 D2/deconv 1 D2/deconv 2 × × × (2,1)D2/deconv 3 D2/deconv 2 Recon 2 × × × (2,1)Decoder 3 D3/deconv 2 L-recon D3/deconv 1 × × × (1,0)D3/deconv 2 D3/deconv 1 D3/deconv 2 × × × (2,1)D3/deconv 3 D3/deconv 2 Recon 3 × × × (2,1)Decoder 4 D4/deconv 2 L-recon D4/deconv 1 × × × (1,0)D4/deconv 2 D4/deconv 1 D4/deconv 2 × × × (2,1)D4/deconv 3 D4/deconv 2 Recon 4 × × × (2,1)Decoder 5 D5/deconv 2 L-recon D5/deconv 1 × × × (1,0)D5/deconv 2 D5/deconv 1 D5/deconv 2 × × × (2,1)D5/deconv 3 D5/deconv 2 Recon 5 × × × (2,1) TABLE XVIIE

ARLY - FUSION NETWORKS IN THE E XTENDED Y ALE -B EXPERIMENTS . Layer Input output Kernel (stride,pad)Feature Fusion Fusion Image 1 Fusion - -Image 2Image 3Image 4Image 5Convolutional layers Conv 1 Fusion Conv 1 × × × (2,1)Conv 2 Conv 1 Conv 2 × × × (2,1)Conv 3 Conv 2 Conv 3 × × × (1,0)Conv 4 Conv 3 Latent × × × (1,0)Self-expressiveness Θ s Latent L-recon

Parameters -Multimodal Decoder L-recon Recon 1 Details inDecoder layers Recon 2 Table XXIRecon 3Recon 4Recon 5

EEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 12, NO. 6, DECEMBER, 2018. 13

TABLE XVIIIL

ATE - FUSION NETWORKS IN THE E XTENDED Y ALE -B EXPERIMENTS . Layer Input output Kernel (stride,pad)Branch 1 B1/Conv 1 Image B1/Conv 1 × × × (2,1)B1/Conv 2 B1/Conv 1 B1/Conv 2 × × × (2,1)B1/Conv 3 B1/Conv 2 B1/Conv 3 × × × (1,0)B1/Conv 4 B1/Conv 3 B1/out × × × (1,0)Branch 2 B2/Conv 1 Image B2/Conv 1 × × × (2,1)B2/Conv 2 B2/Conv 1 B2/Conv 2 × × × (2,1)B2/Conv 3 B2/Conv 2 B2/Conv 3 × × × (1,0)B2/Conv 4 B2/Conv 3 B2/out × × × (1,0)Branch 3 B3/Conv 1 Image B3/Conv 1 × × × (2,1)B3/Conv 2 B3/Conv 1 B3/Conv 2 × × × (2,1)B3/Conv 3 B3/Conv 2 B3/Conv 3 × × × (1,0)B3/Conv 4 B3/Conv 3 B3/out × × × (1,0)Branch 4 B4/Conv 1 Image B4/Conv 1 × × × (2,1)B4/Conv 2 B4/Conv 1 B4/Conv 2 × × × (2,1)B4/Conv 3 B4/Conv 2 B4/Conv 3 × × × (1,0)B4/Conv 4 B4/Conv 3 B4/out × × × (1,0)Branch 5 B5/Conv 1 Image B5/Conv 1 × × × (2,1)B5/Conv 2 B5/Conv 1 B5/Conv 2 × × × (2,1)B5/Conv 3 B5/Conv 2 B5/Conv 3 × × × (1,0)B5/Conv 4 B5/Conv 3 B5/out × × × (1,0)Feature Fusion Fusion B1/out Latent - -B2/outB3/outB4/outB5/outSelf-expressiveness Θ s Latent L-recon

Parameters -Multimodal Decoder L-recon Recon 1 Details inDecoder layers Recon 2 Table XXIRecon 3Recon 4Recon 5

TABLE XIXA

FFINITY FUSION NETWORKS IN THE E XTENDED Y ALE -B EXPERIMENTS . Layer Input output Kernel (stride,pad)Encoder 1 B1/Conv 1 Image B1/Conv 1 × × × (2,1)B1/Conv 2 B1/Conv 1 B1/Conv 2 × × (2,1)B1/Conv 3 B1/Conv 2 B1/Conv 3 × × × (1,0)B1/Conv 4 B1/Conv 3 Latent 1 × × × (1,0)Encoder 2 B2/Conv 1 Image B2/Conv 1 × × × (2,1)B2/Conv 2 B2/Conv 1 B2/Conv 2 × × × (2,1)B2/Conv 3 B2/Conv 2 B2/Conv 3 × × × (1,0)B2/Conv 4 B2/Conv 3 Latent 2 × × × (1,0)Encoder 3 B3/Conv 1 Image B3/Conv 1 × × × (2,1)B3/Conv 2 B3/Conv 1 B3/Conv 2 × × × (2,1)B3/Conv 3 B3/Conv 2 B3/Conv 3 × × × (1,0)B3/Conv 4 B3/Conv 3 Latent 3 × × × (1,0)Encoder 4 B4/Conv 1 Image B4/Conv 1 × × × (2,1)B4/Conv 2 B4/Conv 1 B4/Conv 2 × × × (2,1)B4/Conv 3 B4/Conv 2 B4/Conv 3 × × × (1,0)B4/Conv 4 B4/Conv 3 Latent 4 × × × (1,0)Encoder 5 B5/Conv 1 Image B5/Conv 1 × × × (2,1)B5/Conv 2 B5/Conv 1 B5/Conv 2 × × × (2,1)B5/Conv 3 B5/Conv 2 B5/Conv 3 × × × (1,0)B5/Conv 4 B5/Conv 3 Latent 5 × × × (1,0)Self-expressiveness Common Θ s Latent 1 L-recon 1

NTERMEDIATE SPATIAL FUSION N ETWORKS IN THE E XTENDED Y ALE -B EXPERIMENTS . Layer Input output Kernel (stride,pad)Layer 1 B1/Conv 1 Image B1/Conv 1 × × × (2,1)B2/Conv 1 Image B2/Conv 1 × × × (2,1)B3/Conv 1 Image B3/Conv 1 × × × (2,1)B4/Conv 1 Image B4/Conv 1 × × × (2,1)B5/Conv 1 Image B5/Conv 1 × × × (2,1)Feature Fusion B23/Fusion B2/Conv 1 B23/Fusion - -B3/Conv 1B45/Fusion B4/Conv 1 B45/Fusion - -B5/Conv 1Layer 2 B1/Conv 2 B1/Conv 1 B1/Conv 2 × × × (2,1)B23/Conv 2 B23/Fusion B23/Conv 2 × × × (2,1)B45/Conv 2 B45/Fusion B45/Conv 2 × × × (2,1)Feature Fusion B2345/Fusion B23/Conv 2 B2345/Fusion - -B45/Conv 2Layer 3 B1/Conv 3 B1/Conv 2 B1/Conv 3 × × × (1,0)B2345/Conv 3 B2345/Fusion B2345/Conv 3 × × × (1,0)Feature Fusion Ball/Fusion B1/Conv 3 Ball/Fusion - -B2345/Conv 3Layer 4 Ball/Conv 4 Ball/Conv 3 Latent × × × (1,0)Self-expressiveness Θ s Latent L-recon

Parameters -Multimodal Decoder L-recon Recon 1 Details inDecoder layers Recon 2 Table XXIRecon 3Recon 4Recon 5

TABLE XXIM

ULTIMODAL DECODER DETAILS IN THE E XTENDED Y ALE -B EXPERIMENTS . Layer Input output Kernel (stride,pad)Decoder 1 D1/deconv 1 L-recon D1/deconv 1 × × × (1,0)D1/deconv 2 D1/deconv 1 D1/deconv 2 × × × (2,1)D1/deconv 3 D1/deconv 2 Recon 1 × × × (2,1)Decoder 2 D2/deconv 2 L-recon D2/deconv 1 × × × (1,0)D2/deconv 2 D2/deconv 1 D2/deconv 2 × × × (2,1)D2/deconv 3 D2/deconv 2 Recon 2 × × × (2,1)Decoder 3 D3/deconv 2 L-recon D3/deconv 1 × × × (1,0)D3/deconv 2 D3/deconv 1 D3/deconv 2 × × × (2,1)D3/deconv 3 D3/deconv 2 Recon 3 × × × (2,1)Decoder 4 D4/deconv 2 L-recon D4/deconv 1 × × × (1,0)D4/deconv 2 D4/deconv 1 D4/deconv 2 × × × (2,1)D4/deconv 3 D4/deconv 2 Recon 4 × × × (2,1)Decoder 5 D5/deconv 2 L-recon D5/deconv 1 × × × (1,0)D5/deconv 2 D5/deconv 1 D5/deconv 2 × × × (2,1)D5/deconv 3 D5/deconv 2 Recon 5 × × × (2,1) [7] W. Hong, J. Wright, K. Huang, and Y. Ma, “Multiscale hybrid linearmodels for lossy image representation,” IEEE Transactions on ImageProcessing , vol. 15, no. 12, pp. 3655–3671, 2006.[8] J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. Kriegman, “Clusteringappearances of objects under varying illumination conditions,” in null .IEEE, 2003, p. 11.[9] A. Goh and R. Vidal, “Segmenting motions of different types byunsupervised manifold clustering,” in

IEEE Conference on ComputerVision and Pattern Recognition , 2007.[10] J. Yan and M. Pollefeys, “A general framework for motion segmen-tation: Independent, articulated, rigid, non-rigid, degenerate and non-degenerate,” in

European Conference on Computer Vision , 2006.[11] E. Elhamifar and R. Vidal, “Sparse subspace clustering,” in

IEEEConference on Computer Vision and Pattern Recognition , 2009, pp.2790–2797.[12] G. Liu, Z. Lin, and Y. Yu, “Robust subspace segmentation by low-rankrepresentation,” in

International Conference on Machine Learning , 2010.[13] P. Favaro, R. Vidal, and A. Ravichandran, “A closed form solutionto robust subspace estimation and clustering,” in

IEEE Conference onComputer Vision and Pattern Recognition , 2011.[14] C.-G. Li, R. Vidal et al. , “Structured sparse subspace clustering: Auniﬁed optimization framework.” in

IEEE Conference on ComputerVision and Pattern Recognition , 2015, pp. 277–286.[15] C. You, D. Robinson, and R. Vidal, “Scalable sparse subspace clusteringby orthogonal matching pursuit,” in

IEEE Conference on ComputerVision and Pattern Recognition , 2016, pp. 3918–3927.[16] P. Ji, T. Zhang, H. Lia, M. Salzmann, and I. Reid, “Deep subspaceclustering networks,” in

Advances in Neural Information ProcessingSystems , 2017.

EEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 12, NO. 6, DECEMBER, 2018. 14 [17] M. Abavisani and V. M. Patel, “Adversarial domain adaptive subspaceclustering,” in

IEEE International Conference on Identity, Security, andBehavior Analysis . IEEE, 2018, pp. 1–8.[18] M. Abavisani and V. Patel, “Domain adaptive subspace clustering.” in

British Machine Vision Conference . BMVA, 2016.[19] C.-Y. Lu, H. Min, Z.-Q. Zhao, L. Zhu, D.-S. Huang, and S. Yan, “Robustand efﬁcient subspace segmentation via least squares regression,” in

European Conference on Computer Vision . Springer, 2012, pp. 347–360.[20] P. Ji, M. Salzmann, and H. Li, “Efﬁcient dense subspace clustering,” in

IEEE Winter Conference on Applications of Computer Vision . IEEE,2014, pp. 461–468.[21] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm,theory, and applications,”

IEEE Transactions on Pattern Analysis andMachine Intelligence , vol. 35, no. 11, pp. 2765–2781, 2013.[22] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery ofsubspace structures by low-rank representation,”

IEEE Transactions onPattern Analysis and Machine Intelligence , vol. 35, no. 1, pp. 171–184,2013.[23] V. M. Patel, H. V. Nguyen, and R. Vidal, “Latent space sparse and low-rank subspace clustering,”

IEEE Journal of Selected Topics in SignalProcessing , vol. 9, no. 4, pp. 691–701, 2015.[24] V. M. Patel and R. Vidal, “Kernel sparse subspace clustering,” in

IEEEInternational Conference on Image Processing , 2014.[25] Y. X. Wang, H. Xu, and C. Leng, “Provable subspace clustering: Whenlrr meets ssc,” in

Advances in Neural Information Processing Systems ,C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger,Eds., 2013, pp. 64–72.[26] C. Zhang, H. Fu, S. Liu, G. Liu, and X. Cao, “Low-rank tensorconstrained multiview subspace clustering,” in

IEEE International Con-ference on Computer Vision . IEEE, 2015, pp. 1582–1590.[27] B. Cheng, G. Liu, J. Wang, Z. Huang, and S. Yan, “Multi-task low-rank afﬁnity pursuit for image segmentation,” in

IEEE InternationalConference on Computer Vision , 2011, pp. 2439–2446.[28] K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan, “Multi-view clustering via canonical correlation analysis,” in

InternationalConference on Machine Learning , 2009, pp. 129–136.[29] A. Kumar, P. Rai, and H. Daume, “Co-regularized multi-view spectralclustering,” in

Advances in Neural Information Processing Systems ,2011, pp. 1413–1421.[30] X. Zhao, N. Evans, and J. L. Dugelay, “A subspace co-training frame-work for multi-view clustering,”

Pattern Recognition Letters , vol. 41,pp. 73–82, 2014.[31] K. Zhan, C. Zhang, J. Guan, and J. Wang, “Graph learning for multiviewclustering,”

IEEE Transactions on Cybernetics , pp. 1–9, 2018.[32] M. White, X. Zhang, D. Schuurmans, and Y. liang Yu, “Convex multi-view subspace learning,” in

Advances in Neural Information ProcessingSystems , 2012, pp. 1673–1681.[33] H. Wang, C. Weng, and J. Yuan, “Multi-feature spectral clustering withminimax optimization,” in

IEEE Conference on Computer Vision andPattern Recognition , June 2014, pp. 4106–4113.[34] R. Xia, Y. Pan, L. Du, and J. Yin, “Robust multi-view spectral clus-tering via low-rank and sparse decomposition,” in

AAAI Conference onArtiﬁcial Intelligence , 2014, pp. 2149–2155.[35] V. R. de Sa, “Spectral clustering with two views,” in

ICML Workshopon Learning With Multiple Views , 2005.[36] M. Abavisani and V. M. Patel, “Multimodal sparse and low-ranksubspace clustering,”

Information Fusion , vol. 39, pp. 168–177, 2018.[37] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,“Multimodal deep learning,” in

International Conference on MachineLearning , 2011, pp. 689–696.[38] N. Srivastava and R. R. Salakhutdinov, “Multimodal learning with deepboltzmann machines,” in

Advances in Neural Information ProcessingSystems , 2012, pp. 2222–2230.[39] D. Ramachandram and G. W. Taylor, “Deep multimodal learning:A survey on recent advances and trends,”

IEEE Signal ProcessingMagazine , vol. 34, no. 6, pp. 96–108, 2017.[40] S. E. Kahou, X. Bouthillier, P. Lamblin, C. Gulcehre, V. Michal-ski, K. Konda, S. Jean, P. Froumenty, Y. Dauphin, N. Boulanger-Lewandowski et al. , “Emonets: Multimodal deep learning approaches foremotion recognition in video,”

Journal on Multimodal User Interfaces ,vol. 10, no. 2, pp. 99–111, 2016.[41] A. Jain, J. Tompson, Y. LeCun, and C. Bregler, “Modeep: A deeplearning framework using motion features for human pose estimation,”in

Asian conference on computer vision . Springer, 2014, pp. 302–315. [42] A. Valada, G. L. Oliveira, T. Brox, and W. Burgard, “Deep multispectralsemantic scene understanding of forested environments using multi-modal fusion,” in

International Symposium on Experimental Robotics .Springer, 2016, pp. 465–477.[43] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick,and D. Parikh, “Vqa: Visual question answering,” in

IEEE InternationalConference on Computer Vision , 2015, pp. 2425–2433.[44] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venu-gopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutionalnetworks for visual recognition and description,” in

IEEE Conferenceon Computer Vision and Pattern Recognition , 2015, pp. 2625–2634.[45] P. Perera, M. Abavisani, and V. M. Patel, “In2i: Unsupervised multi-image-to-image translation using generative adversarial networks,”

In-ternational Conference on Pattern Recognition , 2017.[46] X. Di and V. M. Patel, “Large margin multi-modal triplet metriclearning,” in

IEEE International Conference on Automatic Face andGesture Recognition . IEEE, 2017, pp. 370–377.[47] D. Kiela, E. Grave, A. Joulin, and T. Mikolov, “Efﬁcient large-scalemulti-modal classiﬁcation,” arXiv preprint arXiv:1802.02892 , 2018.[48] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-streamnetwork fusion for video action recognition,” in

IEEE Conference onComputer Vision and Pattern Recognition . IEEE, 2016, pp. 1933–1941.[49] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysisand an algorithm,” in

Neural Information Processing Systems , vol. 2,2002, pp. 849–856.[50] S. Bickel and T. Scheffer, “Multi-view clustering.” in

IEEE InternationalConference on Data Mining , vol. 4, 2004, pp. 19–26.[51] X. Cao, C. Zhang, H. Fu, S. Liu, and H. Zhang, “Diversity-inducedmulti-view subspace clustering,” in

IEEE Conference on ComputerVision and Pattern Recognition , June 2015, pp. 586–594.[52] X. Cao, C. Zhang, C. Zhou, H. Fu, and H. Foroosh, “Constrained multi-view video face clustering,”

IEEE Transactions on Image Processing ,vol. 24, no. 11, pp. 4381–4393, 2015.[53] J.-H. Kim, S.-W. Lee, D. Kwak, M.-O. Heo, J. Kim, J.-W. Ha, and B.-T. Zhang, “Multimodal residual learning for visual qa,” in

Advances inNeural Information Processing Systems , 2016, pp. 361–369.[54] N. Neverova, C. Wolf, G. Taylor, and F. Nebout, “Moddrop: adaptivemulti-modal gesture recognition,”

IEEE Transactions on Pattern Analy-sis and Machine Intelligence , vol. 38, no. 8, pp. 1692–1706, 2016.[55] M. Simonovsky, B. Gutiérrez-Becker, D. Mateus, N. Navab, and N. Ko-modakis, “A deep metric for multimodal registration,” in

InternationalConference on Medical Image Computing and Computer-Assisted Inter-vention . Springer, 2016, pp. 10–18.[56] S. Liu, S. Liu, W. Cai, H. Che, S. Pujol, R. Kikinis, D. Feng, M. J.Fulham et al. , “Multimodal neuroimaging feature learning for multiclassdiagnosis of alzheimer’s disease,”

IEEE Transactions on BiomedicalEngineering , vol. 62, no. 4, pp. 1132–1140, 2015.[57] Y. LeCun and C. Cortes, “Mnist handwritten digit database,”

AT&T Labs[Online]. Available: http://yann. lecun. com/exdb/mnist , 2010.[58] J. J. Hull, “A database for handwritten text recognition research,”

IEEETransactions on pattern analysis and machine intelligence , vol. 16, no. 5,pp. 550–554, 1994.[59] S. Hu, N. J. Short, B. S. Riggan, C. Gordon, K. P. Gurton, M. Thielke,P. Gurram, and A. L. Chan, “A polarimetric thermal database for facerecognition research,” in

IEEE Conference on Computer Vision andPattern Recognition Workshops , 2016, pp. 119–126.[60] K. C. Lee, J. Ho, and D. J. Kriegman, “Acquiring linear subspaces forface recognition under variable lighting,”

IEEE Transactions on PatternAnalysis and Machine Intelligence , vol. 27, no. 5, pp. 684–698, May2005.[61] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard et al. , “Tensorﬂow: a system forlarge-scale machine learning.” in

Operating Systems Design and Imple-mentation , vol. 16, 2016, pp. 265–283.[62] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[63] N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures forclusterings comparison: Variants, properties, normalization and correc-tion for chance,”

Journal of Machine Learning Research , vol. 11, no.Oct, pp. 2837–2854, 2010.[64] W. M. Rand, “Objective criteria for the evaluation of clustering meth-ods,”

Journal of the American Statistical association , vol. 66, no. 336,pp. 846–850, 1971.[65] X. Guo, X. Liu, E. Zhu, and J. Yin, “Deep clustering with convolutionalautoencoders,” in

International Conference on Neural Information Pro-cessing . Springer, 2017, pp. 373–382.

EEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 12, NO. 6, DECEMBER, 2018. 15 [66] S. Shekhar, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Jointsparse representation for robust multimodal biometrics recognition,”

IEEE Transactions on Pattern Analysis and Machine Intelligence ,vol. 36, no. 1, pp. 113–126, 2014.

Mahdi Abavisani [S’11] received his M.S. degreesin Electrical and Computer Engineering (ECE) fromIran University of Science and Technology, Tehran,Iran in 2014, and Rutgers University, NJ, USA, in2018. He is currently a Ph.D. candidate in Electricaland Computer Engineering at Rutgers University.During his Ph.D. he has spent time at MicrosoftResearch & AI, and Tesla’s Autopilot team in de-signing deep neural networks for various applica-tions. His research interests include signal and imageprocessing, computer vision, machine learning anddeep learning.