[PDF] Multimodal Sparse Bayesian Dictionary Learning

Abstract

This paper addresses the problem of learning dictionaries for multimodal datasets, i.e. datasets collected from multiple data sources. We present an algorithm called multimodal sparse Bayesian dictionary learning (MSBDL). MSBDL leverages information from all available data modalities through a joint sparsity constraint. The underlying framework offers a considerable amount of flexibility to practitioners and addresses many of the shortcomings of existing multimodal dictionary learning approaches. In particular, the procedure includes the automatic tuning of hyperparameters and is unique in that it allows the dictionaries for each data modality to have different cardinality, a significant feature in cases when the dimensionality of data differs across modalities. MSBDL is scalable and can be used in supervised learning settings. Theoretical results relating to the convergence of MSBDL are presented and the numerical results provide evidence of the superior performance of MSBDL on synthetic and real datasets compared to existing methods.

Full PDF

11 Multimodal Sparse Bayesian Dictionary Learning

Igor Fedorov, Bhaskar D. Rao

Abstract —This paper addresses the problem of learning dic-tionaries for multimodal datasets, i.e. datasets collected frommultiple data sources. We present an algorithm called multimodalsparse Bayesian dictionary learning (MSBDL). MSBDL leveragesinformation from all available data modalities through a jointsparsity constraint. The underlying framework offers a consid-erable amount of ﬂexibility to practitioners and addresses manyof the shortcomings of existing multimodal dictionary learningapproaches. In particular, the procedure includes the automatictuning of hyperparameters and is unique in that it allows thedictionaries for each data modality to have different cardinality,a signiﬁcant feature in cases when the dimensionality of datadiffers across modalities. MSBDL is scalable and can be usedin supervised learning settings. Theoretical results relating tothe convergence of MSBDL are presented and the numericalresults provide evidence of the superior performance of MSBDLon synthetic and real datasets compared to existing methods.

I. I

NTRODUCTION

Due to improvements in sensor technology, acquiring vastamounts of data has become relatively easy. Given the abilityto harvest data, the task becomes how to extract relevantinformation from the data. The data is often multimodal, whichintroduces novel challenges in learning from it. Multimodaldictionary learning has become a popular tool for fusingmultimodal information [1], [2], [3], [4], [5].Let L and J denote the number of data points for eachmodality and number of modalities, respectively. Let Y j = (cid:2) y j · · · y Lj (cid:3) ∈ R N j × L denote the data matrix for modality j , where y ij denotes the i ’th data point for modality j . We useuppercase symbols to denote matrices and lowercase symbolsto denote the corresponding matrix columns. The multimodaldictionary learning problem consists of estimating dictionaries D j ∈ R N j × M j given data Y j such that Y j ≈ D j X j ∀ j. We focus on overcomplete dictionaries because they are moreﬂexible in the range of signals they can represent [6]. Since y ij admits an inﬁnite number of representations under overcom-plete D j , we seek sparse x ij [7].Without further constraints, the multimodal dictionary learn-ing problem can be viewed as J independent unimodal prob-lems. To fully capture the multimodal nature of the problem,the learning process must be adapted to encode the priorknowledge that each set of points y i = (cid:8) y ij (cid:9) j ∈ [ J ] is generatedby a common source which is measured J different ways,where [ J ] = { , · · · , J } . For any variable x , x denotes { x j } j ∈ [ J ] . For instance, in [8], low and high resolution imagepatches are modelled as y i and y i , respectively, and the associ-ation between y i and y i is enforced by the constraint x i = x i . I. Fedorov is with the ARM Machine Learning Research Lab, Boston MA.B.D. Rao is with the Department of Electrical and Computer Engineering,University of California, San Diego, 9500 Gilman Dr, San Diego, CA, 92103

The resulting multimodal dictionary learning problem, referredto here as (cid:96) DL, is to solve [9] arg min ˜ D,X (cid:13)(cid:13)(cid:13) ˜ Y − ˜ DX (cid:13)(cid:13)(cid:13) F + λ (cid:107) X (cid:107) (1) ˜ Y = (cid:2) Y T · · · Y TJ (cid:3) T , ˜ D = (cid:2) D T · · · D TJ (cid:3) T , where (cid:107)·(cid:107) F is the Frobenius norm, (cid:107) X (cid:107) = (cid:80) i ∈ [ L ] (cid:13)(cid:13) x i (cid:13)(cid:13) ,and the (cid:96) -norm is used as a convex proxy to the (cid:96) sparsitymeasure. In a classiﬁcation framework, (1) can be viewed aslearning a multimodal feature extractor, where the optimizer isthe multimodal representation of y i that is fed into a classiﬁer[1], [10], [2]. There are signiﬁcant deﬁciencies associatedwith using (cid:96) DL for multimodal dictionary learning. In fact, allexisting approaches suffer from one of more of the following : D1 While using the same sparse code for each modality es-tablishes an explicit relationship between the dictionariesfor each modality, the same coefﬁcient values may notbe able to represent different modalities well. D2 Some data modalities are often less noisy than othersand the algorithm should be able to leverage the cleanmodalities to learn on the noisy ones. Since (1) constrains λ to be the same for all modalities, it is unclear how thelearning algorithm can incorporate prior knowledge aboutthe noise level of each modality. D3 The formulation in (1) constrains M j = M ∀ j , for some M . Since dimensionality can vary across modalities, it isdesirable to allow M j to vary. D4 The choice of λ is central to the success of approacheslike (1). If λ is chosen too high, the reconstruction termis ignored completely, leading to poor dictionary quality.If λ is chosen too low, the sparsity promoting term iseffectively eliminated from the objective function. Exten-sive work has been done to approach this hyperparameterselection problem from various angles. Two popularapproaches include treating the hyperparameter selectionproblem as an optimization problem of its own [11],[12] and grid search, with the latter being the prevailingstrategy in the dictionary learning community [9], [13].In either case, optimization of λ involves evaluating (1)at various choices of λ , which can be computationallyintensive and lead to suboptimal results in practice. Forsome of the successful multimodal dictionary learningalgorithms [5], [4], where each modality is associatedwith at least one hyperparameter, grid search becomesprohibitive since the search space grows exponentiallywith J . For more details, see Section IV-C, speciﬁcallyTable IV and Fig. 5, for a comparison of competingalgorithms in terms of computational complexity whengrid-search hyperparameter tuning is taken into account. See Table I. a r X i v : . [ s t a t . M L ] M a y Next, we review relevant works from the dictionary learningliterature, highlighting each method’s beneﬁts and drawbacksin light of D1 - D4 . Table I summarizes where competingalgorithms fall short in terms of D1 - D4 .In past work, M j = M ∀ j , thus exhibiting D3 . One of theseminal unimodal dictionary learning algorithms is K-SVD,which optimizes [7] arg min D, { (cid:107) x i (cid:107) ≤ s } i ∈ [ L ] (cid:107) Y − DX (cid:107) F , (2)where s denotes the desired sparsity level and modality sub-scripts have been omitted for brevity. The K-SVD algorithmproceeds in a block-coordinate descent fashion, where D isoptimized while holding X ﬁxed and vice-versa. The update of X involves a greedy (cid:96) pseudo-norm minimization procedure[14]. In a multimodal setting, K-SVD can be naively adapted,where Y is replaced by ˜ Y and D by ˜ D in (2), as in (1).One recent approach, referred to here as Joint (cid:96) DictionaryLearning (J (cid:96) DL), builds upon K-SVD for the multimodaldictionary learning problem and proposes to solve [4] arg min (cid:110) { χ ij = χ i } j ∈ [ J ] , | χ i |≤ s (cid:111) i ∈ [ L ] J (cid:88) j =1 λ j (cid:107) Y j − D j X j (cid:107) F (3)where χ ij denotes the support of x ij . The J (cid:96) DL algorithmtackles D1 - D2 by establishing a correspondence between thesupports of the sparse codes for each modality and by allowingmodality speciﬁc regularization parameters, which allow forencoding prior information about the noise level of y ij . On theother hand, J (cid:96) DL does not address D3 and presents even moreof a challenge than (cid:96) DL with respect to D4 since the size ofthe grid search needed to ﬁnd λ grows exponentially with J .Another major drawback of J (cid:96) DL is that, since (3) has an (cid:96) type constraint, solving it requires a greedy algorithm whichcan produce poor solutions, especially if some modalities aremuch noisier than others [15].The multimodal version of (1), referred to here as Joint (cid:96) DL (J (cid:96) DL), seeks [16] arg min D , X (cid:88) i ∈ [ L ] ,j ∈ [ J ] (cid:13)(cid:13) y ij − D j x ij (cid:13)(cid:13) + λ (cid:88) i ∈ [ L ] (cid:13)(cid:13) Π i (cid:13)(cid:13) (4)where Π i = (cid:2) x i · · · x iJ (cid:3) , (cid:13)(cid:13) Π i (cid:13)(cid:13) = (cid:80) m ∈ [ M ] (cid:13)(cid:13) Π i [ m, :] (cid:13)(cid:13) , and Π i [ m, :] denotes the m ’throw of Π i . The (cid:96) -norm in (4) promotes row sparsityin Π i , which promotes x i that share the same support.Interestingly, regularizers which promote common supportamongst variables appear in a wide range of works, includinggraphical model identiﬁcation [17], [18]. Like all of theprevious approaches, J (cid:96) DL adopts a block-coordinatedescent approach to solving (4), where an alternatingdirection method of multipliers algorithm is used to computethe sparse code update stage [19]. While J (cid:96) DL makesprogress toward addressing D1 , it does so at the cost ofsacriﬁcing the hard constraint that x i share the same support.The authors of [16] attempt to address D2 by relaxing the See Section IV-C.

MSBDL J (cid:96) DL J (cid:96) DL (cid:96) DL KSVDProposed [5] D1 (cid:88) (cid:88) D2 (cid:88) (cid:88) (cid:88) D3 (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) D4 (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) TABLE I: Deﬁciencies that existing dictionary learning algo-rithms suffer from.constraint on the support of x i even more, but we willnot study this approach here because it moves even furtherfrom the theme of this work. In addition, J (cid:96) DL does notaddress D3 - D4 . Although D4 is not as serious of an issuefor J (cid:96) DL as for J (cid:96) DL because J (cid:96) DL requires tuning asingle hyperparameter, D3 limits the usefulness of J (cid:96) DLin situations where the observed modalities are drasticallydifferent in type and cardinality, i.e. text and images .One desirable property of dictionary learning approaches isthat they be scalable with respect to the size of the dictionaryas well as the dataset. When L becomes large, the algorithmmust be able to learn in a stochastic manner, where onlya subset of the data samples need to be processed at eachiteration. Stochastic learning strategies have been studied inthe context of (cid:96) DL [9], [20] and J (cid:96) DL [16], but not forK-SVD or J (cid:96) DL. Likewise, the algorithm should be able toaccommodate large N j .When the dictionary learning algorithm is to be used as abuilding block in a classiﬁcation framework, class informationcan be incorporated within the learning process. In a super-vised setting, the input to the algorithm is { Y , H } , where H = (cid:2) h · · · h L (cid:3) ∈ B C × L is the binary class label matrixfor the dataset and C is the number of classes. Each h i is thelabel for the i ’th data point in a one-of- C format. This type ofdictionary learning is referred to as task-driven [16], [20], labelconsistent [13], or discriminative [21] and the goal is to learna dictionary D j such that x ij is indicative of the class label.For instance, discriminative K-SVD (D-KSVD) optimizes [21] arg min D,W, { (cid:107) x i (cid:107) ≤ s } i ∈ [ L ] (cid:107) Y − DX (cid:107) F + λ su (cid:107) H − W X (cid:107) F (5)where W can be viewed as a linear classiﬁer for x i .Task-driven (cid:96) DL (TD- (cid:96) DL) optimizes [20] arg min

D,W E x [ l su ( h, W, x ∗ ( y, D )] + ν (cid:107) W (cid:107) F (6)where E x [ · ] denotes the expectation with respect to p ( x ) , l su ( · ) denotes the supervised loss function , x ∗ ( y, D ) denotesthe solution of (1) with the dictionary ﬁxed to D , and the lastterm provides regularization on W to avoid over-ﬁtting.Task-driven J (cid:96) DL (TD-J (cid:96) DL) optimizes [16] arg min D , W E x  (cid:88) j ∈ [ J ] l su ( h j , W j , x ∗ j ( y j , D j )  + ν (cid:88) j ∈ [ J ] (cid:107) W j (cid:107) F where x ∗ j ( y j , D j ) denotes the j ’th modality sparse code inthe optimizer of (4) with ﬁxed dictionaries. See Section VIII-B. Examples of supervised losses include the squared loss in (5), logistic loss,and hinge loss [20].

Fig. 1: Modeling capacity of proposed algorithm compared toexisting works.

Proposed [5]Scalable (cid:88)

Task-driven extension (cid:88)

Solves D3 (cid:88) Solves D4 (cid:88) TABLE II: Comparison of proposed MSBDL variant with [5].

A. Contributions

We present the multimodal sparse Bayesian dictionary learn-ing algorithm (MSBDL). MSBDL is based on a hierarchicalsparse Bayesian model, which was introduced in the contextof Sparse Bayesian Learning (SBL) [22], [23], [24] as wellas dictionary learning [22], and has since been extendedto various structured learning problems [25], [26], [27]. Wepresented initial work on this approach in [5] , where wemade some progress towards tackling D1 - D2 . Here, we gobeyond our preliminary work in a number of signiﬁcant ways,which are summarized in Table II. We address D1 - D2 toa fuller extent by offering scalable and task-driven variantsof MSBDL. More importantly, we tackle D3 - D4 , where oursolution to D4 is crucial to the ability of MSBDL to address D1 and resolves a major hyperparameter tuning issue in [5].By resolving D3 , the present work represents a large class ofalgorithms that contains [5] as a special case. This point isdepicted in Fig. 1, which provides a visualization of the typesof models and variable interactions that can be learned by eachof the competing algorithms.In summary:1) We extend MSBDL to address D3 . To the best of ourknowledge, MSBDL is the ﬁrst dictionary learning algo-rithm capable of learning differently sized dictionaries.2) We extend MSBDL to the task-driven learning scenario.3) We present scalable versions of MSBDL.4) We optimize algorithm hyperparameters during learning,obviating the need for grid search, and conduct a theo-retical analysis of the approach.5) We show that multimodal dictionary learning offers prov-able advantages over unimodal dictionary learning. The algorithm presented in [5] was also called MSBDL and we distinguishit from the algorithm presented here by appending the appropriate citation tothe algorithm name whenever referencing our previous work.

Fig. 2: Visualization of high level problem motivation. γ [ m ] x [ m ] y D x [ m ] y D m ∈ [ M ] Fig. 3: Graphical model for two modality MSBDL.

B. On the Motivation of this Work

Consider the scenario depicted in Fig. 2. The solid rect-angles represent learning models 1 and 2, corresponding tomodalities 1 and 2, respectively, independently of each other.The intuition behind this work is that model 2 can be improvedby performing learning in a joint fashion, where informationﬂows between models 1 and 2, as depicted by the dotted shapesin Fig. 2. The experimental and theoretical results reported inthis paper all serve as evidence for the ability of MSBDL toleverage all available data sources to learn models for eachmodality that are superior to ones that could be found iflearning was done independently. As such, the aims of thiswork are, in some sense, tangential to works like [16], wherethe goal is more to learn an information fusion scheme andless to learn superior models for each modality.II. P

ROPOSED A PPROACH

The graphical model for MSBDL is shown in Fig. 3. Thesignal model is given by y j = D j x j + v j , v j ∼ N (cid:0) , σ j I (cid:1) (7)where N ( · ) denotes a Gaussian distribution and the noisevariance is allowed to vary among modalities. In order topromote sparse x ij , we assume a Gaussian Scale Mixture(GSM) prior on each element of x j [23], [24], [28], [29]. TheGSM prior is a popular class of distributions whose membersinclude many sparsity promoting priors, such as the Laplacianand Student’s-t [23], [24], [28], [29], [30], [31]. The remaining task is to specify the conditional density of x j given γ . Oneoption is to use what we refer to as the one-to-one prior [5]: p ( x j | γ ) = (cid:89) m ∈ [ M ] p ( x j [ m ] | γ [ m ]) = (cid:89) m ∈ [ M ] N (0 , γ [ m ]) (8)where x j [ m ] denotes the m ’th element of x j and the choice of p ( γ [ m ]) determines the marginal density of x j [ m ] . We assumea non-informative prior on γ j [ m ] [23]. As will be shown inSection II-B, the conditional distribution in (8) represents aBayesian realization of the constraint that x share the samesupport. The prior in (8) still constrains M j = M ∀ j , but thisrestriction will be lifted in Section V. When J = 1 , this modelis identical to the one used in [22]. A. Inference Procedure

We adopt an empirical Bayesian inference strategy to esti-mate θ = (cid:110) D , (cid:8) γ i (cid:9) i ∈ [ L ] (cid:111) by optimizing [32] arg max θ log p ( Y | θ ) = arg max θ (cid:88) j ∈ [ J ] log p ( Y j | θ ) (9) p ( Y | θ ) = (cid:89) i ∈ [ L ] ,j ∈ [ J ] p (cid:0) y ij | θ (cid:1) , p (cid:0) y ij | θ (cid:1) = N (0 , Σ iy ) (10) Σ iy,j = σ j I + D j Γ i D Tj , Γ i = diag ( γ i ) . (11)We use Expectation-Maximization (EM) to maximize (9),where { X , Y , θ } and X are treated as the complete andnuisance data, respectively [33]. At iteration t , the E-stepcomputes Q ( θ, θ t ) = (cid:68) log p (cid:16) Y , X , D , (cid:8) γ i (cid:9) i ∈ [ L ] (cid:17)(cid:69) , where (cid:104)·(cid:105) denotes the expectation with respect to p ( X | Y , θ t ) , and θ t denotes the estimate of θ at iteration t . Due to the conditionalindependence properties of the model, the posterior factorsover i and p (cid:0) x ij | y ij , θ (cid:1) = N (cid:0) µ ij , Σ ix,j (cid:1) (12) Σ ix,j = (cid:16) σ − j D Tj D j + (cid:0) Γ i (cid:1) − (cid:17) − (13) µ ij = σ − j Σ ix,j D Tj y ij . (14)In the M-step, Q ( θ, θ t ) is maximized with respect to θ . Ingeneral, the M-step depends on the choice of p ( x | γ ) . For thechoice in (8), the M-step becomes (cid:0) γ i [ m ] (cid:1) t +1 = J − J (cid:88) j =1 Σ ix,j [ m, m ] + (cid:0) µ ij [ m ] (cid:1) (15) D t +1 j = Y j U Tj  U j U Tj + (cid:88) i ∈ [ L ] Σ ix,j  − (16) U j = (cid:2) µ j · · · µ Lj (cid:3) . (17) B. How does MSBDL solve deﬁciency D1 ? One consequence of the GSM prior is that many of theelements of γ i converge to during inference [24]. When γ i [ m ] = 0 , p (cid:0) x ij [ m ] | y ij , γ i (cid:1) reduces to δ ( x j [ m ]) for all j ,where δ ( · ) denotes the Dirac-delta function [24]. Since theonly role of x in the inference procedure is in the E-step,where we take the expectation of the complete data log-likelihood with respect to p (cid:0) x ij | y ij , γ i (cid:1) , the effect of having p (cid:0) x ij [ m ] | y ij , γ i (cid:1) = δ (cid:0) x ij [ m ] (cid:1) is that the E-step reduces toevaluating the complete data log-likelihood at x ij [ m ] = 0 , ∀ j .Therefore, upon convergence, the proposed approach exhibitsthe property that x i share the same support. C. Connection to J (cid:96) DL Suppose that the conditional distribution in (8) is used with γ [ m ] ∼ Ga ( J/ , ∀ m , where Ga ( · ) refers to the Gammadistribution. It can be shown that [34] p (cid:0) x i (cid:1) = c (cid:89) m ∈ [ M ] K (cid:0)(cid:13)(cid:13)(cid:2) x i [ m ] · · · x iJ [ m ] (cid:3)(cid:13)(cid:13) (cid:1) , where c is a normalization constant and K ( · ) denotes themodiﬁed Bessel function of the second kind and order . Forlarge x , K ( x ) ≈ π exp( − x ) / √ πx [35]. In the following, wereplace K ( x ) by its approximation for purposes of exposition.Under the constraint σ j = 0 . λ ∀ j , the MAP estimate of { D , X } is given by arg min D , X (cid:88) i ∈ [ L ] ,j ∈ [ J ] (cid:13)(cid:13) y ij − D j x ij (cid:13)(cid:13) + λ (cid:88) i ∈ [ L ] (cid:13)(cid:13) Π i (cid:13)(cid:13) +0 . λ (cid:88) i ∈ [ L ] ,m ∈ [ M ] log (cid:13)(cid:13) Π i [ m, :] (cid:13)(cid:13) . (18)This analysis exposes a number of similarities between MS-BDL and J (cid:96) DL. If we ignore the last term in (18), J (cid:96) DLbecomes the MAP estimator of { D , X } under the one-to-oneprior in (8). If we keep the last term in (18), the effect is toadd a Generalized Double Pareto prior on the (cid:96) norm of therows of Π i [31], [36].At the same time, there are signiﬁcant differences betweenMSBDL and J (cid:96) DL. The J (cid:96) DL objective function assumes σ j is constant across modalities, which can lead to a strongmismatch between data and model when the dataset consistsof sources with disparate noise levels. In contrast to J (cid:96) DL,MSBDL enjoys the beneﬁts of evidence maximization [37],[32], [31], which naturally embodies ”Occam’s razor” bypenalizing unnecessarily complex models and searches forareas of large posterior mass, which are more informative thanthe mode when the posterior is not well-behaved.III. C

OMPLETE A LGORITHM

So far, it has been assumed that σ is known. Although it ispossible to include σ in θ and estimate it within the evidencemaximization framework in (9), we experimental observe that σ decays very quickly and the algorithm tends to get stuckin poor stationary points. An alternative approach is describednext. Consider the scenario where D j is known and we seekthe MAP estimate of x j given y j , D j : arg min x j (cid:107) y j − D j x j (cid:107) − σ j log p ( x j ) . (19)The estimator in (19) shows that σ j can be thought of as aregularization parameter which controls the trade-off betweensparsity and reconstruction error. As such, we propose toanneal σ j . The motivation for annealing σ j is that the qualityof D j increases with t , so giving too much weight to the Require: Y , σ , σ ∞ , α σ while σ not converged do while D not converged do for i ∈ [ L ] do Update Σ ix , µ i , γ i using (13), (14), and (15) end for { Update D j using (16) if σ j not converged } j ∈ [ J ] end while { Update σ j using (21) if σ j not converged } j ∈ [ J ] end while Fig. 4: MSBDL algorithm for the one-to-one prior in (8).reconstruction error term early on can force EM to convergeto a poor stationary point.Let σ j > σ ∞ ≥ , α σ < , ˜ σ t +1 j = max( σ ∞ , α σ σ tj ) . Theproposed annealing strategy can then be stated as σ t +1 j = (cid:40) ˜ σ t +1 j if log p (cid:0) Y j | θ t +1 , ˜ σ t +1 j (cid:1) > log p (cid:0) Y j | θ t +1 , σ tj (cid:1) σ tj else. (20) Although it may seem that we have replaced the task ofselecting σ with that of selecting (cid:8) σ , α σ , σ ∞ (cid:9) , we claimthat the latter is easy to select and provide both theoretical(Section VII) and experimental (Section VIII) validation forthis claim. The main beneﬁt of the proposed approach is that itessentially traverses a continuous space of candidate σ withoutexplicitly performing a grid search, which would be intractableas J grows. The parameter σ ∞ can be set arbitrarily small and σ can be set arbitrarily large. The only recommendation wemake is to set σ j > σ j (cid:48) if modality j is a-priori believed tobe more noisy than modality j (cid:48) .Section VII studies the motivation for and properties of theannealing strategy in greater detail. In practice, the compu-tation of p (cid:0) Y j | θ t +1 , ˜ σ t +1 j (cid:1) is costly because it requires thecomputation of the sufﬁcient statistics in (11) for all L datapoints and J modalities. Instead, we replace the conditionin (20) by checking whether decreasing σ j should increase p (cid:0) Y j | θ t +1 , σ t (cid:1) . This check is performed by checking the signof the ﬁrst derivative of p (cid:0) Y j | θ t +1 , σ t (cid:1) , replacing (20) with σ t +1 j = (cid:40) ˜ σ t +1 j if ∂ log p (cid:0) Y j | θ t +1 , σ tj (cid:1) /∂σ tj < σ tj else. (21)It can be shown that (21) can be computed essentially for freeby leveraging the sufﬁcient statistics computed in the θ updatestep . The complete MSBDL algorithm is summarized in Fig.4. In practice, each D j is normalized to unit (cid:96) column normat each iteration to prevent scaling instabilities. A. Dictionary Cleaning

We adopt the methodology in [7] and “clean” each D j every T iterations. Cleaning D j means removing atoms whichare aligned with one or more other atoms and replacing theremoved atom with the element of Y j which has the poorestreconstruction under D j . A given atom is also replaced if See Supplemental Material. EM Type µ/ Σ x CC MCMSBDL Exact Exact LN LM MSBDL-1 Incremental Exact L N LM MSBDL-2 Incremental Approximate L N (cid:100) N (cid:101) LM MSBDL-3 Batch Exact L N L M MSBDL-4 Batch Approximate L N (cid:100) N (cid:101) L M TABLE III: CC, MC, L , and (cid:100) N (cid:101) denote worst case computa-tional complexity, worst case memory complexity, batch size,and a quantity which is upper-bounded by N , respectively.it does not contribute to explaining Y j , as measured by theenergy of the corresponding row of U j .IV. S CALABLE L EARNING

When L is large, it is impractical to update the sufﬁcientstatistics for all data points at each EM iteration. To drawa parallel with stochastic gradient descent (SGD), when theobjective function is a sum of functions of individual datapoints, one can traverse the gradient with respect to a randomlychosen data point at each iteration instead of computing thegradient with respect to every sample in the dataset. In thedictionary learning community, SGD is often the optimizationalgorithm of choice because the objective function is separableover each data point [9], [20], [16]. In addition, as N j grows,the computation of the sufﬁcient statistics in (13)-(14) canbecome intractable. In the following, we propose to addressthese issues using a variety of modiﬁcations of the MSBDLalgorithm from Section II. The proposed methods also applyto priors other than the one in (8) . A. Scalability with respect to the size of the dataset

In the following, we present two alternatives to the EMMSBDL algorithm to achieve scalability with respect to L .The ﬁrst proposed approach, referred to here as Batch EM,computes sufﬁcient statistics only for a randomly chosen sub-set φ = { i , · · · , i L } at each EM iteration, where L denotesthe batch size. The M-step consists of updating { γ i } i ∈ φ using(15) and updating D j using (16), with the exception that onlythe sufﬁcient statistics from i ∈ φ are employed.Another stochastic inference alternative is called Incremen-tal EM, which is reviewed in the Supplementary Material [38].In the context of MSBDL, Incremental EM is tantamount to aninference procedure which, at each iteration, randomly selectsa subset φ of points and updates the sufﬁcient statistics in(13)-(14) for i ∈ φ . During the M-step, the hyperparameters { γ i } i ∈ φ are updated. The dictionaries D j are updated using(16), where the update rule depends on sufﬁcient statisticscomputed for all L data points, only a subset of which havebeen updated during the given iteration. B. Scalability with respect to the size of the dictionary

In order to avoid the inversion of the N j × N j matrix required to compute (13), we use the conjugate gradient See Section V Due to the matrix inversion lemma, (13) can be computed using Γ i − Γ i D Tj (cid:16) σ j I + D j Γ i D Tj (cid:17) − D j Γ i . MSBDL (proposed)

T LN J MSBDL [5] F J T LN J J (cid:96) DL F T LM J [16]J (cid:96) DL F J T LJ ( N ( M + s + s ) + s ) [41], [42] (cid:96) DL F T L ( M + M NJ ) [43]K-SVD F T L ( JN ( M + s + s ) + s ) [41] TABLE IV: Learning complexity, where T denotes the numberof iterations over the entire training set and F denotes the gridspacing for a grid search over the algorithm hyperparameters.We assume that the hyperparameters consist of λ in (1) and(4), { λ j } j ∈ [ J ] in (3), and s in (2).algorithm to compute µ ij and approximate Σ ix,j by Σ ix,j ≈ (cid:16) diag (cid:16) σ − D Tj D j + (cid:0) Γ i (cid:1) − (cid:17)(cid:17) − (22)where, in this case, diag ( · ) denotes setting the off-diagonalelements of the input to [39], [40].Table III shows a taxonomy of MSBDL algorithms consid-ered using exact or incremental EM and exact or approximatecomputation of (13)-(14), along with the corresponding worstcase computational and memory complexity per EM iteration.The Supplementary Material provides a visualization of thedifference between the proposed algorithms. C. Learning complexity

Finally, we consider the learning complexity (LC) of MS-BDL, speciﬁcally with respect to that of competing ap-proaches, where we deﬁne LC to be the (worst-case) com-putational complexity of running the algorithm and tuning itshyperparameters. In the case of MSBDL, the hyperparametersare learned jointly with the parameters θ , obviating the needfor hyperparameter optimization. In the case of J (cid:96) DL, J (cid:96) DL, (cid:96) DL, and K-SVD, we assume that a grid search is performedto estimate the hyperparameters, with a grid spacing of F for each hyperparameter. Analytical results for LC are shownin Table IV and Fig. 5 plots the ratio of LC for competingapproaches to that of MSBDL as a function of J for N = 100 , M = 400 , F = 10 , s = 10 , T = L = 1 . For comparison,we also plot the LC of MSBDL-1 with (cid:100) N (cid:101) = 1 . The resultsshow that J (cid:96) DL and the MSBDL variant in [5] scale leastwell with J , which can be explained by the fact that bothrequire grid-search to tune J hyperparameters. J (cid:96) DL exhibitsworse LC behavior than MSBDL, with the gap between thetwo methods growing as J increases. Interestingly, MSBDL-1 exhibits the best LC amongst all of the algorithms . Theanalysis in this section shows that D4 is a serious concern forseveral existing approaches, especially [5], which motivatesthe current approach.V. M ODELING M ORE C OMPLEX R ELATIONSHIPS

The drawback of the one-to-one prior in (8) is that itconstrains M j = M ∀ j . In the following, we propose two The dependence of LC for J (cid:96) DL on ( MJ ) comes from the need to solvethe least squares problem in [(21) , [16]], with a matrix of size MJ × MJ ,in the worst case. The assumption that (cid:100) N (cid:101) = 1 may not be true in practice, so the plot ofLC for MSBDL-1 represents the best-case scenario. J -20246810 Fig. 5: Learning complexity x [ k ] x [ m ] m ∈ T k γ B [ k ] (a) x [ k ] x [ m ] γ [ m ] γ [ m ] m ∈ T k (b) Fig. 6: Prototype branches for the atom-to-subspace (6a) andhierarchical sparsity (6b) models.models which allow for M j to be modality-dependent. Forease of exposition, we set J = 2 , but the models we describecan be readily expanded to J > . We propose to organize x into a tree with K disjoint branches. We adopt the conventionthat the elements of x and x form the roots and leaves of thetree, respectively . The root of the k ’th branch is x [ k ] and theleaves are indexed by T k ⊆ [ M ] . The deﬁning property ofthe models we propose is the relationship between the sparsitypattern of the root and leaf levels. A. Atom-to-subspace sparsity

The one-to-one prior in (8) can be viewed as linking the one-dimensional subspaces spanned by d m and d m for m ∈ [ M ] .Whenever d m is used to represent y , d m is used to represent y , and vice-versa. The extension to the multi-dimensionalsubspace case stipulates that if d k is used to represent y ,then { d m } m ∈ T k is used to represent y , and vice-versa. Thismodel does not constrain | T k | to be the same for all k , suchthat M can be chosen independently of M .Let γ B ∈ R K + , in contrast to Section II where γ ∈ R M + .We encode the atom-to-subspace sparsity prior by assigning asingle hyperparameter γ B [ k ] to each branch k . The distribution p ( x | γ B ) is then given by (cid:89) k ∈ [ K ] N ( x [ k ]; 0 , γ B [ k ]) (cid:89) m ∈ T k N ( x [ m ]; 0 , γ B [ k ]) . (23) We adopt this convention without loss of generality since the modalitiescan be re-labeled arbitrarily. See Supplementary Material for a derivation of the resulting prior on x . Fig. 6a shows a prototype branch under the atom-to-subspaceprior. Inference for the prior in (23) proceeds in much the sameway as in Section II-A. The form of the marginal likelihood in(10) and posterior in (12) remain the same, with the exceptionthat Σ iy,j and Σ ix,j are re-deﬁned to be Σ iy,j = σ j I + D j Γ ij D Tj (24) Σ ix,j = (cid:16) σ − j D Tj D j + (cid:0) Γ ij (cid:1) − (cid:17) − , Γ i = diag ( γ iB ) , where Γ i is a diagonal matrix whose [ m, m ] ’th entry is γ iB [ k ] for m ∈ T k . The update of γ iB is given by Σ ix, [ m, m ] + (cid:0) µ i [ m ] (cid:1) + (cid:80) m ∈ T k Σ ix, [ m, m ] + (cid:0) µ i [ m ] (cid:1) | T k | , while the update of D j remains identical to (16). There aremany ways to extend the atom-to-subspace prior for J > ,depending on the speciﬁc application. One possibility is tosimply append more branches to the root at x [ k ] correspond-ing to coefﬁcients from modalities j > .One problem is that, for | T k | > , the atoms of D indexedby T k are not identiﬁable. The reason for the identiﬁabilityissue is that D appears in the objective function in (10) onlythrough the D Γ D T term in (24), which can be written as D Γ D T = (cid:88) k ∈ [ K ] γ B [ k ] (cid:88) m ∈ T k d m ( d m ) T . (25)Therefore, any D (cid:48) which satisﬁes (cid:80) m ∈ T k d (cid:48) m (cid:0) d (cid:48) m (cid:1) T = (cid:80) m ∈ T k d m ( d m ) T for all k achieves the same objectivefunction value as D . Since the objective function is agnosticto the individual atoms of D , the performance of this modelis severely upper-bounded in terms of the ability to recover D . In the following, we propose an alternative model whichcircumvents the identiﬁability problem. B. Hierarchical Sparsity

In this section, we propose a model which allows the rootof each branch to control the sparsity of the leaves, but notvice-versa. Speciﬁcally, we stipulate that if x [ k ] = 0 , then x [ m ] = 0 ∀ m ∈ T k . Hierarchical sparsity was ﬁrst studiedin [44], [45], [46] and later incorporated into a unimodaldictionary learning framework in [47]. Later, Bayesian hi-erarchical sparse signal recovery techniques were developed,which form the basis for the following derivation [48], [49].From an optimization point of view, hierarchical sparsitycan be promoted through a composite regularizer [46]. In thiscase, the regularizer could take the form (cid:88) k ∈ [ K ] ,m ∈ T k (cid:13)(cid:13)(cid:2) x [ k ] x [ m ] (cid:3)(cid:13)(cid:13) + | x [ m ] | . (26)As described in [46], the key to designing a compositeregularizer for a given root-leaf pair is to measure the groupnorm of the pair along with the energy of the leaf alone. See Section VIII-D. The exact form of the regularizer depends on how the energy in a givengroup is measured.

The combination of the group and individual norms serve twopurposes which, jointly, promote hierarchical sparsity [46]: R1 It is possible that x [ m ] = 0 , m ∈ T k , without requiring x [ k ] = 0 . R2 The inﬁnitesimal penalty on x [ k ] deviating from tendsto for | x [ m ] | > , m ∈ T k .The choice of (cid:96) -norm in (26) guarantees that the regularizersatisﬁes R2 [Theorem 1, [46]]. The conditions of [Theorem1, [46]] are violated if the (cid:96) norm is replaced by an (cid:96) normin (26).In a Bayesian setting, we can mimic the effect of theregularizer in (26) through an appropriately deﬁned prior on x . Let ˜ x j = S j x j , S ∈ B M × M , S = (cid:2) I I (cid:3) T ∈ B M × M (27) where S is a binary matrix such that S [ m, k ] = 1 if and onlyif m ∈ T k . Let R j be a diagonal matrix such that S Tj S j R j = I and deﬁne ˆ x j = S j R j x j . Let γ j ∈ R M ∀ j and p ( x | γ ) = N (ˆ x ; 0 , Γ ) N (ˆ x ; 0 , Γ ) , (28)where Γ = diag ( γ ) and Γ = diag (cid:16)(cid:2) γ T γ T (cid:3) T (cid:17) . Aprototype branch for the hierarchical sparsity prior is shown inFig. 6b. To see how this model leads to hierarchical sparsity,observe that p ( x [ k ] | γ ) = N (0 , γ [ k ]) p ( x [ m ] | γ ) = N (cid:16) , (cid:0) γ − [ k ] + γ − [ m ] (cid:1) − (cid:17) (29)for m ∈ T k . If we infer that γ [ k ] = 0 , then the prior on both x [ k ] and x [ m ] , ∀ m ∈ T k , reduces to a dirac-delta function,i.e. if the root is zero, then the leaves must also be zero. Onthe other hand, if γ [ m ] is inferred to be , only the prior on γ [ m ] is affected, i.e. leaf sparsity does not imply root sparsity.Inference for the tree-structured model proceeds in a similarfashion to that shown in Section II-A, with a few variations.The goal is to optimize (9) through the EM algorithm. Thedifference here is that we use (cid:110) ˆ X , Y , θ (cid:111) and ˆ X as thecomplete and nuisance data, respectively. In order to carry outinference, we must ﬁrst ﬁnd the posterior density p (cid:16) ˆ X | Y , θ (cid:17) .It is helpful to ﬁrst derive the signal model in terms of ˆ x [48]: p ( y j | D j , ˆ x j , σ j ) = N (cid:0) A j S Tj ˆ x j , σ j I (cid:1) . (30)Using (30), it can be shown that p (cid:0) ˆ x ij | y ij , θ (cid:1) = N (cid:0) µ i ˆ x,j , Σ i ˆ x,j (cid:1) (31) Σ i ˆ x,j = (cid:16) σ − j S j D Tj D j S Tj + (cid:0) Γ ij (cid:1) − (cid:17) − (32) µ i ˆ x,j = σ − j Σ i ˆ x,j D j S Tj y j . (33) A diagonal R j is guaranteed to exist because S Tj S j is itself a diagonalmatrix. The likelihood function itself is different from (10)-(11) andis given by p (cid:0) y ij | θ (cid:1) = N (cid:16) , Σ i ˆ y,j (cid:17) , where Σ i ˆ y,j = σ j I + D j S Tj Γ ij S j D Tj . The EM update rules are given by (cid:0) γ ij [ m ] (cid:1) t +1 =  . (cid:80) j (cid:48) =1 Σ i ˆ x,j (cid:48) [ m, m ] + (cid:16) µ i ˆ x,j (cid:48) [ m ] (cid:17) if m ≤ M Σ i ˆ x,j [ m, m ] + (cid:16) µ i ˆ x,j [ m ] (cid:17) else. (34) ( D j ) t +1 = Y j U Tj S j  S Tj  U j U Tj + (cid:88) i ∈ [ L ] Σ i ˆ x,j  S j  − . (35) Extending the hierarchical sparsity prior for

J > isstraightforward and depends on the speciﬁc application beingconsidered. It is possible to simply append more leaves toeach root x [ k ] corresponding to coefﬁcients from modality j > . Another possibility is to treat each x [ m ] as itself aroot with leaves from x , assigning a hyperparameter to each x - x root-leaf pair as well as to each x leaf, and repeatingthe process until all modalities are incorporated into the tree. C. Avoiding Poor Stationary Points

For both the atom-to-subspace and hierarchical sparsitymodels, we experimentally observe that MSBDL tends to getstuck in undesirable stationary points. In the following, wedescribe the behavior of MSBDL in these situations and offer asolution. Suppose that data is generated according to the atom-to-subspace model, where D j denotes the true dictionary formodality j . In this scenario, we experimentally observe thatMSBDL performs well when | T k | = c ∀ k . On the other hand,if | T k | varies as a function of k , MSBDL tends to get stuck inpoor stationary points, where the quality of a stationary pointis (loosely) deﬁned next. Let | T k | = 1 for all k except k (cid:48) , forwhich | T k (cid:48) | = 2 , i.e. M = M + 1 . In this case, MSBDL isable to recover D , but recovers only the atoms of D whichare indexed by T k ∀ k (cid:54) = k (cid:48) .To avoid poor stationary points, we adopt the followingstrategy. If the tree describing the assignment of columns of D to those of D is unbalanced, i.e. | T k | varies with k ,then we ﬁrst balance the tree by adding additional leaves .Let ˆ M be the number of leaves in the balanced tree. We runMSBDL until convergence to generate ˆ D . Finally, we pruneaway ˆ M − M columns of ˆ D using the algorithm in Fig. 7.VI. T ASK D RIVEN

MSBDL (TD-MSBDL)In the following, we describe a task-driven extension of theMSBDL algorithm. For purposes of exposition, we assumethe one-to-one prior in (8), but the approach applies equallyto the priors discussed in Section V, as discussed in SectionVIII. To incorporate task-driven learning, we modify theMSBDL graphical model to the one shown in Fig. 8. Weset p ( h | x j , W j ) = N (cid:0) W j x j , β j I (cid:1) , where β j is the classlabel noise standard deviation for modality j . The class labelnoise standard deviation is modality dependent, affording themodel an extra level of ﬂexibility compared to [16], [13], [20].The choice of the Gaussian distribution for the conditional A balanced tree is one which has the same number of leaves for eachsubtree, or | T k | = c . Require:

D, U, { T k } k ∈ [ K ] , M p , S, (cid:15) Let z [ m ] = (cid:13)(cid:13)(cid:0) S T U (cid:1) [ m, :] (cid:13)(cid:13) ∀ m and v [ k ] = arg max m ,m ∈ T k s.t. m (cid:54) = m | ( d m ) T d m |(cid:107) d m (cid:107) (cid:107) d m (cid:107) while ∃ k s.t. v [ k ] > (cid:15) and | T k | > and M p > do k = arg max k v [ k ] Find m = arg min m ∈ T k z [ m ] Remove column m from D and T k M p ← M p − end while while M p > do Find m = arg min m ∈ T k s.t. | T k | > z [ m ] Remove column m from D and T k M p ← M p − end while Fig. 7: Pruning algorithm for the learning strategy in SectionV-C. M p denotes the number of columns to be pruned, S isgiven by the identity matrix for the atom-to-subspace modeland by S in (27) for the hierarchical sparsity model, and U denotes the matrix of ﬁrst order sufﬁcient statistics. x [ m ] y D hγ [ m ] y D x [ m ] W W m ∈ [ M ] Fig. 8: Graphical model for two modality TD-MSBDL.density of h , as opposed to a multinomial or softmax, stemsfrom the fact that the posterior p ( x j | y j , h, D j , W j ) neededto perform EM remains computable in closed form, i.e. p (cid:0) x ij | y ij , h i , D j , W j (cid:1) = N (cid:16) Σ T D,ix,j , µ

T D,ij (cid:17) where Σ T D,ix,j = (cid:16) σ − j D Tj D j + β − j W Tj W j + (cid:0) Γ i (cid:1) − (cid:17) − (36) µ T D,ij = Σ

T D,ix,j (cid:0) σ − j D Tj y ij + β − j W Tj h i (cid:1) (37) A. Inference Procedure

We employ EM to optimize arg max θ TD log p (cid:0) Y , H | θ T D (cid:1) (38)where θ T D = { θ, W } . It can be shown that p (cid:0) y j , h | x j , θ T D (cid:1) = N (cid:16)(cid:2) y Tj h T (cid:3) T ; 0 , Σ T Dy,j (cid:17) where Σ T Dy,j = (cid:20) σ j I + D j Γ D Tj β j I + W j Γ W Tj (cid:21) . (39) The update rules for D and (cid:8) γ i (cid:9) i ∈ [ L ] remain identicalto (15) and (16), respectively, with the exception that themodiﬁed posterior statistics shown in (36)-(37) are used. Theupdate of W j is given by W t +1 j = H (cid:0) U T Dj (cid:1) T  U T Dj (cid:0) U T Dj (cid:1) T + (cid:88) i ∈ [ L ] Σ T D,ix,j  − (40) U T Dj = (cid:104) µ T D, j · · · µ T D,Lj (cid:105) . (41)We also ﬁnd it useful to add regularization in the form of ν (cid:80) j ∈ [ J ] (cid:107) W j (cid:107) F , leading to the update rule W t +1 j = (cid:104) H (cid:16) U TDj (cid:17) T (cid:105) (cid:16)(cid:104) U TDj (cid:16) U TDj (cid:17) T + (cid:80) i ∈ [ L ] Σ TD,ix,j √ ν I (cid:105)(cid:17) − . (42) TD-MSBDL has the same worst-case computational complex-ity as MSBDL with the beneﬁt of supervised learning.

B. Complete Algorithm

Supervised learning algorithms are ultimately measured bytheir performance on test data. While a given algorithm mayperform well on training data, it may generalize poorly totest data . To maintain the generalization properties of themodel, it is common to split the training data into a trainingset { Y , H } and validation set (cid:8) Y V , H V (cid:9) , where the numberof training points L does not necessarily have to equal thenumber of validation points L V . The validation set is then usedduring the training process as an indicator of generalization,as summarized by the following rule of thumb: Continueoptimizing θ T D until performance on the validation set stopsimproving. In the context of TD-MSBDL, the concept ofgeneralization has a natural Bayesian deﬁnition: The parameterset (cid:8) θ T D , β (cid:9) which achieves optimal generalization solves arg max θ TD ∈ H , β (cid:89) j ∈ [ J ] p (cid:0) H V | θ T D , β j (cid:1) , (43)where H denotes the set of solutions to (38). Notethat p (cid:0) H V | θ T D , β j (cid:1) = p (cid:0) H V | W j , β j (cid:1) , which isintractable to compute since it requires integrating p (cid:0) h V,i | W j , γ V,i , β (cid:1) = N (cid:0) , β j I + W j Γ V,i W Tj (cid:1) over γ V,i , where Γ V,i = diag ( γ V,i ) . As such, we approximate p (cid:0) H V | W j , β j (cid:1) by p (cid:0) H V | W j , γ ∗ ,V,i , β j (cid:1) , where γ ∗ ,V,i is theoutput of MSBDL with ﬁxed D j for input data y V,ij , leadingto the tractable optimization problem arg max θ TD ∈ H , β (cid:89) j ∈ [ J ] p (cid:16) H V | W j , (cid:8) γ ∗ ,V,i (cid:9) i ∈ [ L V ] , β j (cid:17) . (44)What remains is to select β . As β j decreases, TD-MSBDL ﬁtsthe parameters θ T D to the training data to a larger degree, i.e.the optimizers of (38) achieve increasing objective functionvalues. Since direct optimization of (44) over β presentsthe same challenges as the optimization of σ , we propose Assuming that the prior in (8) is used. In the supervised learning community, lack of generalization to test datais commonly referred to as over-ﬁtting the training data. an annealing strategy which proposes progressively smallervalues of β j until the objective in (44) stops improving: β t +1 j =  ˜ β t +1 j if log p (cid:16) H V | W j , (cid:8) γ ∗ ,V,i (cid:9) i ∈ [ L V ] , ˜ β t +1 j (cid:17) > log p (cid:16) H V | W j , (cid:8) γ ∗ ,V,i (cid:9) i ∈ [ L V ] , β tj (cid:17) β tj else (45) where β j > β ∞ ≥ , α β < , ˜ β t +1 j = max( β ∞ , α β β tj ) ,and we make the same recommendations for setting β and β ∞ as σ and σ ∞ in Section III. Computing (45) iscomputationally intensive because it requires running MSBDL,so we only update log p (cid:16) H V | W j , (cid:8) γ ∗ ,V,i (cid:9) i ∈ [ L V ] , β tj (cid:17) every T V iterations. Due to space considerations, the complete TD-MSBDL algorithm is provided in the Supplementary Material.Given test data Y testj , we ﬁrst run MSBDL with D j ﬁxed andtreat µ test,ij as an estimate of x test,ij . The data is then classiﬁedaccording to arg max c ∈ [ C ] (cid:16) W j µ test,ij (cid:17) [ c ] , where e c refers tothe c ’th standard basis vector [16].VII. A NALYSIS

We begin by analyzing the convergence properties of MS-BDL. Note that MSBDL is essentially a block-coordinatedescent algorithm with blocks θ and σ . Therefore, we ﬁrstanalyze the θ update block, then the σ update block, andﬁnally the complete algorithm. Unless otherwise speciﬁed,we assume a non-informative prior on γ and the one-to-oneprior . While MSBDL relies heavily on EM, it is not strictlyan EM algorithm because of the σ annealing procedure. It willbe shown that MSBDL still admits a convergence guaranteeand an argument will be presented for why annealing σ produces favorable results in practice. Although we focusspeciﬁcally on MSBDL, the results can be readily extendedto TD-MSBDL, with details omitted for brevity. Proofs for allresults are shown in the Supplementary Material.We ﬁrst prove that the set of iterates { θ t } ∞ t =1 produced bythe inner loop of MSBDL converge to the set of stationarypoints of the log-likelihood with respect to θ . This is es-tablished by proving that the objective function is coercive(Theorem 1), which means that { θ t } ∞ t =1 admits at least onelimit point (Corollary 1), and then proving that limit pointsare stationary (Theorem 2). Theorem 1:

Let { σ j > } j ∈ [ J ] , let there be at least one j ∗ for each m such that (cid:13)(cid:13) d mj ∗ (cid:13)(cid:13) > , and let ∃ i such that γ i [ m ] > for any choice of m . Then, − log p ( Y | θ, σ ) is a coercivefunction of { θ, σ } . Corollary 1:

Let the conditions of Theorem 1 be satisﬁed.Then, the sequence { θ t } ∞ t =1 produced by the inner loop of theMSBDL algorithm admits at least one limit point.The proof of Corollary 1 is shown in the Appendix, but thehigh level argument is that { θ t } ∞ t =1 is a member of a compactset. As such { θ t } ∞ t =1 admits at least one convergent subse-quence. Since there may be multiple convergent subsequences,Corollary 1 does not preclude { θ t } ∞ t =1 having multiple limitpoints, or a limit set. The point is that { θ t } ∞ t =1 admits one or Extension of Theorems 2-6 to the atom-to-subspace and hierarchicalpriors is straightforward, but omitted for brevity. more limit points, and Theorem 2 below proves each of thesepoints to be stationary. Theorem 2:

Let the conditions of Corollary 1 be satisﬁed, D tj be full rank for all t and j , σ be ﬁxed, and generate { θ t } ∞ t =1 using the inner loop of MSBDL. Then, { θ t } ∞ t =1 converges to the set of stationary points of log p ( Y | θ, σ ) .Moreover, { log p ( Y | θ, σ ) } ∞ t =1 converges monotonically to log p ( Y | θ ∗ , σ ) , for stationary point θ ∗ .The requirement that D tj be full rank is easily satisﬁed inpractice as long as L > N j . Note that the SBL algorithm in[24] is a special case of MSBDL for ﬁxed D and J = 1 .To the best of our knowledge, Theorem 2 represents the ﬁrstresult in the literature guaranteeing the convergence of SBL.A similar result to Theorem 2 can be given in the stochasticEM regime. Theorem 3:

Let σ be ﬁxed, U tj be full rank for all t, j ,and generate { θ t } ∞ t =1 using the inner loop of MSBDL, onlyupdating the sufﬁcient statistics for a batch of points at eachiteration (i.e. incremental EM). Then, the limit points of { θ t } ∞ t =1 are stationary points of log p ( Y | θ, σ ) .If we consider the entire MSBDL algorithm, i.e. including theupdate of σ , we can still show that MSBDL is convergent. Corollary 2:

Let the conditions of Theorems 1 and 2 besatisﬁed. Then, the sequence (cid:8) θ t , σ t (cid:9) ∞ t =1 produced by theMSBDL algorithm admits at least one limit point.Although the preceding results establish that MSBDL hasfavorable convergence properties, the question still remainsas to why we choose to anneal σ j . At a high level, it canbe argued that setting σ j to a large value initially and thengradually decreasing it prevents MSBDL from getting stuckin poor stationary points with respect to θ at the beginningof the learning process. To motivate this intuition, considerthe log likelihood function in (9), which decomposes intoa sum of J modality-speciﬁc log-likelihoods. The curvatureof the log-likelihood for modality j depends directly on σ j .Setting a high σ j corresponds to choosing a relatively ﬂat loglikelihood surface, which, from an intuitive point of view, hasless stationary points. This intuition can be formalized in thescenario where D j is constrained in a special way. Theorem 4:

Let σ j > σ j , Ψ j = { D j : D j = (cid:2) ˇ D j I (cid:3) } , and Ω σ j ,j = (cid:8) Σ y,j : Σ y,j = σ j I + D j Γ D jT , D j ∈ Ψ j (cid:9) . Then, Ω σ j ,j ⊆ Ω σ j ,j .Theorem 4 suggests that as σ j gets smaller, the space overwhich the log-likelihood in (9) is optimized grows. As theoptimization space grows, the number of stationary pointsgrows as well . As a result, it may be advantageous to slowlyanneal σ j in order to allow MSBDL to learn D j withoutgetting stuck in a poor stationary point.If we constrain the space over which D j is optimized to Ψ j ,as in Theorem 4, then we can establish a number of interestingconvergence results. Theorem 5:

Let α σ be arbitrarily close to , σ ∞ =0 , σ j ≥ arg max σ j max θ log p ( Y j | θ, σ j ) , D j ∈ Ψ j , θ t = arg max θ log p (cid:0) Y j | θ, σ t − j (cid:1) , and consider updat-ing σ j using (20). Then, σ tj = σ t − j implies σ tj =arg max σ j log p ( Y j | θ t , σ j ) . This holds only for the constrained scenario in Theorem 4.

Theorem 5 states that, under certain conditions, annealingterminates for a given modality j at a global maximum of thelog-likelihood with respect to σ j for ﬁxed θ t . The conditionsof Theorem 5 ensure that (20) only terminates at stationarypoints of the log-likelihood. Theorem 6:

Let the conditions of Theorems 2 and 5 andCorollary 2 be satisﬁed. Then, the sequence (cid:8) θ t , σ t (cid:9) ∞ t =1 produced by MSBDL converges to the set of stationary pointsof log p ( Y | θ, σ ) .Convergence results like Theorem 6 cannot be establishedfor K-SVD and J (cid:96) DL because they rely on greedy searchtechniques. In addition, no convergence results are presentedin [16] for J (cid:96) DL.Finally, we consider what guarantees can be given for dic-tionary recovery in the noiseless setting, i.e. Y j = D j X j ∀ j. We assume M j = M ∀ j and that x i share a common sparsityproﬁle for all i . We do not claim that the following resultapplies to more general cases when M j (cid:54) = M ∀ j or differentpriors. The question we seek to answer is: Under what condi-tions is the factorization Y j = D j X j unique? Due to the natureof dictionary learning, uniqueness can only be considered upto permutation and scale. Two dictionaries D and D areconsidered equivalent if D = D Z , where Z is a permutationand scaling matrix. Guarantees of uniqueness in the unimodalsetting were ﬁrst studied in [50]. The results relied on severalassumptions about the data generation process. Assumption 1:

Let s = (cid:107) x ij (cid:107) ∀ i, j , and s < spark ( D j )2 , wherespark ( D j ) is the minimum number of columns of D j whichare linearly dependent [51]. Each x ij has exactly s non-zeros. Assumption 2: Y j contains at least s + 1 different pointsgenerated from every combination of s atoms of D j . Assumption 3:

The rank of every group of s + 1 pointsgenerated by the same s atoms of D j is exactly s . The rankof every group of s + 1 points generated by different atomsof D j is exactly s + 1 . Lemma 1 (Theorem 3, [50]):

Let Assumptions 1-3 be true.Then, Y j admits a unique factorization D j X j . The minimumnumber of samples required to guarantee uniqueness is givenby ( s + 1) (cid:0) Ms (cid:1) .Treating the multimodal dictionary learning problem as J independent unimodal dictionary learning problems, the fol-lowing result follows from Lemma 1. Corollary 3:

Let Assumptions 1-3 be true for all j . Then, Y j admits a unique factorization D j X j for all j . The minimumnumber of samples required to guarantee uniqueness is givenby J ( s + 1) (cid:0) Ms (cid:1) .As the experiments in Section VIII will show, there arebeneﬁts to jointly learning multimodal dictionaries. It is there-fore interesting to inquire whether or not there are provable beneﬁts to the multimodal dictionary learning problem, at leastfrom the perspective of the uniqueness of factorizations. Toformalize this intuition, consider the scenario where some datapoints i do not have data available for all modalities. Let theBoolean matrix P ∈ B J × L be deﬁned such that P [ j, i ] is if data for modality j is available for instance i and else.The conditions on the amount of data needed to guaranteeunique recovery of D by Corollary 3 can be restated as L = ( s + 1) (cid:0) Ms (cid:1) and P [ j, i ] = 1 ∀ j, i . The natural question to ask next is: Can uniqueness of factorization be guaranteed if P [ j, i ] = 0 for some ( j, i ) ? Theorem 7:

Let x i share a common sparsity pro-ﬁle for all i and Assumptions 1-2 be true for all j .Let Assumption 3 be true for a single j ∗ . Let G kj = (cid:8) i : P [ j, i ] = 1 and y ij ∈ span (cid:0) D j [: , Υ k ] (cid:1)(cid:9) where Υ k is the k ’th subset of size s of [ M ] . Let | G kj ∩ G kj ∗ | ≥ s for all j (cid:54) = j ∗ and k . Then, the factorization Y j = D j X j is uniquefor all j . The minimum total number of data points requiredto guarantee uniqueness is given by J (cid:0) s + J (cid:1) (cid:0) Ms (cid:1) .Theorem 7 establishes that the number of samples requiredto guarantee a unique solution to the multimodal dictionarylearning problem is strictly less than in Corollary 3.VIII. R ESULTS

A. Synthetic Data Dictionary Learning

To validate how well MSBDL is able to learn unimodal andmultimodal dictionaries, we conducted a series of experimentson synthetic data. We adopt the setup from [52] and generateground-truth dictionaries D j ∈ R × by sampling eachelement from a N (0 , distribution and scaling the resultingmatrices to have unit (cid:96) column norm. We then generate x ij by randomly selecting s = 5 indices and generating the non-zero entries by drawing samples from a N (0 , distribution.The supports of x i are constrained to be the same, whilethe coefﬁcients are not. The elements of v ij are generated bydrawing samples from a N (0 , distribution and scaling theresulting vector in order to achieve a speciﬁed Signal-to-NoiseRatio (SNR). We use L = 1000 and simulate both bimodaland trimodal datasets. The bimodal dataset consists of dB( j = 1 ) and dB ( j = 2 ) SNR modalities. The trimodaldataset consists of dB ( j = 1 ), dB ( j = 2 ), and dB( j = 3 ) SNR modalities. We use the empirical probability ofrecovering D as the measure of success, which is given by M j (cid:88) m ∈ [ M j ] (cid:104) ι (cid:16) d mj , ˆ D j (cid:17) > . (cid:105) (46) ι (cid:16) d mj , ˆ D j (cid:17) = max ≤ m (cid:48) ≤ M j (cid:12)(cid:12)(cid:12)(cid:0) d mj (cid:1) T ˆ d m (cid:48) j (cid:12)(cid:12)(cid:12)(cid:13)(cid:13) d mj (cid:13)(cid:13) (cid:13)(cid:13) d m (cid:48) j (cid:13)(cid:13) , (47)where ˆ D j denotes the output of the dictionary learning algo-rithm and [ · ] denotes the indicator function. The experimentis performed times and averaged results are reported.We compare MSBDL with (cid:96) DL , K-SVD , J (cid:96) DL, andJ (cid:96) DL. While code for J (cid:96) DL was publicly available, it couldnot be run on any of our Windows or Linux machines, sowe used our own implementation. Code for J (cid:96) DL was notpublicly available, so we used our own implementation. Forall algorithms, the batch size was set to L .For the bimodal setting, the parameters σ and σ wereset to and √ , respectively. For the trimodal setting, theparameters σ , σ , and σ were set to , √ . , and √ , respec-tively. In both cases, we set α σ = √ . and σ ∞ = √ e − ,where this choice of σ ∞ corresponds to the lowest candidate http://spams-devel.gforge.inria.fr/downloads.html ∼ ronrubin/software.html MSBDL J ℓ DL J ℓ DL ℓ DL KSVD00.20.40.60.81 P r obab ili t y o f S u cc e ss Unimodal, SNR = 30 dBMultimodal, SNR = 30 dBUnimodal, SNR = 10 dBMultimodal, SNR = 10 dB (a)

MSBDL J ℓ DL J ℓ DL ℓ DL KSVD00.20.40.60.81 P r obab ili t y o f S u cc e ss Unimodal, SNR = 30 dBMultimodal, SNR = 30 dBUnimodal, SNR = 20 dBMultimodal, SNR = 20 dBUnimodal, SNR = 10 dBMultimodal, SNR = 10 dB (b)

Fig. 9: Bimodal (9a) and trimodal (9b) synthetic data resultswith one standard deviation error bars. (a)

100 200 300 400 500 1000

Batch Size P r obab ili t y o f S u cc e ss MSBDLMSBDL-1MSBDL-2MSBDL-3MSBDL-4J ℓ DL (b) Fig. 10: Bimodal synthetic data results using stochastic learn-ing for dB (Fig. 10a) and dB (Fig. 10b) datasets. λ in the cross-validation procedure for competing algorithms.It was experimentally determined that MSBDL is relativelyinsensitive to the choice of σ j as long as σ < σ < σ ,thus obviating the need to cross-validate these parameters. Theregularization parameters λ in (1) and λ in (3) were selectedby a grid search over { e − , e − , e − , } and both K-SVD and J0DL were given the true s . The parameter λ wasset to for J (cid:96) DL across all experiments because the objectivefunction in (3) depends only on the relative weighting ofmodalities. All algorithms were run until convergence.The bimodal and trimodal dictionary recovery results areshown in Fig.’s 9a and 9b, respectively. For unimodal data, allof the algorithms recover the true dictionary almost perfectlywhen the SNR = 30 dB, with the exception of J0DL and K-SVD. All of the tested algorithms perform relatively well fordata with dB SNR and poorly on data with dB SNR,although MSBDL outperforms the other tested method in thesescenarios. In the multimodal scenario, the proposed methodclearly distinguishes itself from the other methods tested. Fortrimodal data, not only does MSBDL achieve accuracyon the dB data dictionary, but it achieves accuracies of and . on the dB and dB data dictionaries,respectively. MSBDL outperforms the next best method by . on the dB data recovery task . J (cid:96) DL was able tocapture some of the multimodal information in learning the dB data dictionary, but the dB data dictionary accuracy onlyreaches . . J (cid:96) DL performs even worse in recovering the dB data dictionary, achieving accuracy. Similar trendscan be seen in the bimodal results. Throughout this work, we report the improvement to the probability ofsuccess or the classiﬁcation rate in absolute terms. Next, we evaluate the performance of the MSBDL algo-rithms in Table III. We repeat the bimodal experiment andcompare the proposed methods with J (cid:96) DL, which is theonly competing multimodal dictionary learning algorithm thathas a stochastic version. The dictionary recovery results areshown in Fig. 10. The results show that J (cid:96) DL is not ableto recover any part of either the dB nor dB datasetdictionaries. In terms of the asymptotic performance as thebatch size approaches L , MSBDL-1 exhibits negligible bias onboth datasets, whereas the other MSBDL ﬂavors incur a smallbias, especially on the dB dataset. On the other hand, it isinteresting that MSBDL-2 dramatically outperforms MSBDL-1 for batch sizes less than , which is unexpected sinceMSBDL-2 performs approximate sufﬁcient statistic compu-tations. The poor performance of MSBDL-3 and MSBDL-4suggests that these algorithms should be considered only inextremely memory constrained scenarios. Finally, we reporton the performance of the proposed annealing strategy for σ j .For the bimodal dataset, one expects to σ to converge to asmaller value than σ . Fig. 12 shows a histogram of the valuesto which σ converge to. The results align with expectationsand lend experimental validation for the annealing strategy.We observe the same trend for MSBDL-1 with L < L .To validate the performance of MSBDL using the atom-to-subspace model, we run a number of synthetic data ex-periments. In all cases, we use J = 2 and L = 1000 .We use MSBDL-1 with L = 500 to highlight that thealgorithm works in incremental EM mode. We simulate scenarios, summarized in Table V. For each scenario, weﬁrst generate the elements of the ground-truth dictionary D by sampling from a N (0 , distribution and normalizing theresulting dictionaries to have unit (cid:96) column norm. We thenset T k = { k, k + M } if k + M ≤ M and k otherwise. Thischoice of (cid:8) T k (cid:9) k ∈ [ K ] represents the most uniform assignmentof columns of D to columns of D . We then generate x i byrandomly selecting s = 5 indices and generating the non-zeroentries by drawing samples from a N (0 , distribution. We use (cid:8) T k (cid:9) k ∈ [ K ] to ﬁnd the support of x i and generate the non-zero entries by drawing from a N (0 , distribution. In orderto assess the performance of the learning algorithm, we mustﬁrst deﬁne the concept of distance between multi-dimensionalsubspaces. We follow [53] and compute the distance between D [: , T k ] and ˆ D using ι (cid:16) D [: , T k ] , ˆ D (cid:17) = max ≤ k (cid:48) ≤ K (cid:113) | V T V V T V | , where we use D [: , T k ] to denote the columns of D indexedby T k , and V , V denote orthonormal bases for D [: , T k ] and ˆ D [: , T k (cid:48) ] , respectively. We then deﬁne the distancebetween D and ˆ D using the two quantities ϑ (cid:16) D , ˆ D (cid:17) = c (cid:88) k ∈{ k : | T k | =1 } (cid:104) ι (cid:16) D [: , T k ] , ˆ D (cid:17) > . (cid:105) ϑ (cid:16) D , ˆ D (cid:17) = c (cid:88) k ∈{ k : | T k | > } (cid:104) ι (cid:16) D [: , T k ] , ˆ D (cid:17) > . (cid:105) where c = |{ k : | T k | = 1 }| − and c = |{ k : | T k | > }| − .We use MSBDL-1 to learn ˆ D , with accuracy results reported (a) (b) Fig. 11: Histograms of ι (cid:16) D [: , T k ] , ˆ D (cid:17) for test case C inTable V (11a) and ι (cid:16) D [: , m ] , ˆ D (cid:17) for test case C in TableVI (11b).in Table V. Histograms of results are provided in the Sup-plementary Materials for a higher resolution perspective intothe performance of the proposed approach. Note that Table Vreports the accuracy of MSBDL-1 in recovering both the atomsand subspaces of D . Test case A simulates the scenario whereboth modalities have a high SNR and M = 2 M . In otherwords, test case A tests if MSBDL-1 is able to learn in theatom-to-subspace model, without the complications that arisefrom added noise. The results show that MSBDL-1 effectivelylearns both the atoms of D and the subspaces comprising D . Test case B simulates the scenario where | T k | = 1 forsome k but not for others. In effect, test case B tests thepruning strategy described in Section V-C and summarized inFig. 7. The results show that the pruning strategy is effectiveand allows MSBDL-1 to learn both the atoms of D and theatoms and subspaces comprising D . Test case C is identicalto test case B, but with noise added to modality . Theresults show that MSBDL-1 still effectively recovers D , butthere is a drop in performance with respect to recoveringthe atoms of D and a signiﬁcant drop in recovering thesubspaces of D . The histogram in Fig. 11a shows that thedistribution of (cid:110) ι ( D [: T k ] , ˆ D ) (cid:111) k ∈ K is concentrated near for test case C, suggesting that an alignment thresholdof . is simply too strict in this case. Finally, test caseD demonstrates that MSBDL-1 exhibits robust performancewhen the modality comprising the roots of the tree is noisy.To provide experimental evidence for the fact that the atom-to-subspace model is agnostic to the atoms of D , as discussedin Section V-A, we show the ability of MSBDL-1 to recoverthe atoms of D for test case A in Fig. 13. The results showthat MSBDL-1 is not able to recover the atoms of D .Next, we present experimental results for dictionary learningunder the hierarchical model, using the same setup as for theatom-to-subspace model. We simulate scenarios, summa-rized in Table VI. To evaluate the performance of the proposedapproach, we measure how well it is able to recover the atomsof D and D , where we distinguish between the atoms of D corresponding to | T k | = 1 and | T k | > using σ σ Fig. 12: Histogram of σ j atconvergence for a bimodaldataset consisting of dBand dB modalities. o f c oun t s Fig. 13: Histogram of ι (cid:16) D [: , m ] , ˆ D (cid:17) ∀ m fortest case A . d , M d , M SNR SNR ϑ ( D , ˆ D ) ϑ ( D , ˆ D ) ϑ ( D , ˆ D ) A ,

50 40 ,

100 30 30 99 . — B ,

50 30 ,

60 30 30 99 . C ,

50 30 ,

60 30 10 99 . .

05 3 . D ,

50 30 ,

60 10 30 73 .

64 97 . . TABLE V: Recovery results using atom-to-subspace model. (cid:37) (cid:16) D , ˆ D (cid:17) = c (cid:88) k ∈{ k : | T k | =1 } (cid:104) ι (cid:16) D [: , T k ] , ˆ D (cid:17) > . (cid:105) (cid:37) (cid:16) D , ˆ D (cid:17) = c (cid:88) k ∈{ k : | T k | =1 } ,m ∈ T k (cid:104) ι (cid:16) D [: , m ] , ˆ D (cid:17) > . (cid:105) where c = |{ k : | T k | = 1 }| − and c = |{ k : | T k | > }| − .The recovery results are reported in Table VI. Histograms ofthe results are provided in the Supplementary Material. Testcase A demonstrates that MSBDL-1 is able to learn the atomsof D and D in the low noise scenario for M = 2 M . Testcase B shows that MSBDL-1 is able to learn the atoms of D and D for M < M < M , i.e. highlighting that the prun-ing strategy in Fig. 7 is effective for the hierarchical sparsitymodel. Test case C adds a considerable amount of noise tothe modality occupying the leaves of the tree. Although therecovery results in Table VI suggest that MSBDL-1 does notperform well in recovering D in this scenario, the histogramin Fig. 11b shows robust performance. Finally, test case Dshows the scenario where a large amount of noise is added tothe modality occupying the roots of each subtree. B. Photo Tweet Dataset Classiﬁcation

We validate the performance of TD-MSBDL on the PhotoTweet dataset [54]. The Photo Tweet dataset consists of 603tweets covering 21 topics. Each tweet contains text and animage with an associated binary label indicating its sentiment.The dataset consists of 5 partitions. We use these partitionsto perform 5 rounds of leave-one-out cross-validation, where,during each round, we use one partition as the test set, one asthe validation set, and the remaining partitions as the training d , M d , M SNR SNR (cid:37) (cid:16) D , ˆ D (cid:17) (cid:37) (cid:16) D , ˆ D (cid:17) (cid:37) (cid:16) D , ˆ D (cid:17) A ,

50 40 ,

100 30 30 99 . — . B ,

50 30 ,

60 30 30 100 94 . . C ,

50 30 ,

60 30 10 100 22 . . D ,

50 30 ,

60 10 30 5 . . . TABLE VI: Recovery results using hierarchical model.

Feature Type TD-MSBDL-2 TD-J (cid:96) DL TD- (cid:96) DL D-KSVDImages TABLE VII: Photo tweet dataset classiﬁcation accuracy (%) . TD-MSBDL-2 TD-MSBDL-1 TD-MSBDL-1Prior One-to-one (8) Hierarchical (28) Atom-to-subspace (23) d × M /d × M × / ×

40 10 × / ×

80 10 × / × Images 65.6 69.2 69.5Text 76 74.6 74.3

TABLE VIII: Photo tweet dataset classiﬁcation accuracy (%) using TD learning and priors from Section V. Our conventionis to designate text as modality 1 and images as modality 2.set. For each round, we process the images by ﬁrst extractinga bag of SURF features [55] from the training set using theMATLAB computer vision system toolbox. We then encodethe training, validation, and test sets using the learned bag offeatures, yielding a 500-dimensional representation for eachimage [56]. Finally, we compute the mean of each dimensionacross the training set, center the training, validation, and testsets using the computed means, and perform 10 componentPCA to generate a 10-dimensional image representation. Weprocess the text data by ﬁrst building a 2688 dimensional bagof words from the training set using the scikit-learn Pythonlibrary. We then encode the training, validation, and test setsusing the learned bag of words, normalizing the resultingrepresentations by the number of words in the tweet. We thencenter the data and perform 10 component PCA to yield a10-dimensional text representation.We run TD-MSBDL using incremental EM and approximateposterior sufﬁcient statistic computations, referring to theresulting algorithm as TD-MSBDL-2 in accordance with thetaxonomy in Table III. Our convention is to refer to thetext and image data as modalities and respectively. Weuse M j = 40 ∀ j , L = 200 , σ = √ . , σ = √ . , β j = √ ∀ j , σ ∞ = √ e − , α σ = √ . , β ∞ = √ e − , α β = √ . , and T V = 500 . We compare TD-MSBDL with several unimodal and multimodal approaches.We use TD- (cid:96) DL and D-KSVD to learn classiﬁers for imagesand text using unimodal data. For TD- (cid:96) DL we use the valida-tion set to optimize for λ over the set { e − , e − , e − , } .We also compare with TD-J (cid:96) DL trained on the multimodaldata, using the the same validation approach as for TD- (cid:96) DL.In all cases, we run training for a maximum of e iterations.The classiﬁcation results are shown in Table VII. ComparingTD-MSBDL with the unimodal methods (i.e. TD- (cid:96) DL andD-KSVD), the results show that TD-MSBDL achieves higherperformance for both feature types. Moreover, it is interestingthat TD-J (cid:96) DL performs worse than TD- (cid:96) DL, suggesting thatit is not capable of capturing the multimodal relationshipswhich TD-MSBDL beneﬁts from.Finally, we show the efﬁcacy of the priors in Section Vin classifying the Photo Tweet dataset. The goal is to showthat, by allowing the number of atoms of the image andtext dictionaries to be different, the atom-to-subspace andhierarchical sparsity priors lead to superior classiﬁcationperformance. We begin by extracting features from thetext and image data as before, with the exception that we use PCA components to represent images. We then set M , the text data dictionary size, to and M , the imagedictionary size, to , corresponding to an oversampling factor M j /N j of for both modalities. We run the TD-MSBDLalgorithm with the atom-to-subspace and hierarchicalsparsity priors. For the atom-to-subspace prior, the onlyrequired modiﬁcation is to change the update rule of γ i to(15) . For the hierarchical sparsity prior, the γ i , D j , and W j update rules are modiﬁed to (34), (35), and ( W j ) t +1 = H (cid:0) U T Dj (cid:1) T S j (cid:16) S Tj (cid:16) U T Dj (cid:0) U T Dj (cid:1) T + (cid:80) i ∈ [ L ] Σ T D,i ˆ x,j (cid:17) S j (cid:17) − ,respectively. Because there are signiﬁcant dependencies amongthe elements in x j a-priori, we use exact sufﬁcient statisticcomputation, referring to the resulting algorithm as TD-MSBDL-1. The classiﬁcation results are presented in TableVIII, where the TD-MSBDL-2 results with one-to-one priorfrom Section VIII-B are shown for reference. The resultsshow a signiﬁcant improvement in image classiﬁcation.Although text classiﬁcation deteriorates slightly, the textclassiﬁcation rate for both atom-to-subspace and hierarchicalpriors is still higher than the competing methods in TableVII. Note that this type of experiment, where a differentnumber of dictionary atoms is used for the image and textmodalities, is not possible for any of the other dictionarylearning approaches considered in this work. C. CIFAR10

The CIFAR10 dataset consists of × images from 10classes [57]. In this experiment, we extract × patchesfrom images in the dataset to form Y . We then form Y byadding noise to Y , i.e. y = y + ξ, ξ ∼ N (cid:16) , σ ξ I (cid:17) . Ourgoal is to learn representations { D j } j ∈ [ J ] for { Y j } j ∈ [ J ] , where J = 2 , and to evaluate the quality of those representations.Since ground truth dictionaries do not exist in this scenario,we evaluate the learned { D j } j ∈ [ J ] by measuring how wellthey can denoise test data.We begin by learning { D j } j ∈ [ J ] from training data. Let { X j } j ∈ [ J ] denote the sparse codes for training data { Y } j ∈ [ J ] under { D j } j ∈ [ J ] and let P denote the linear mapping from X to X satisfying arg min P (cid:107) X − P X (cid:107) . Note that X = X for (cid:96) DL and K-SVD, such that P = I forthese methods. Given a noisy test point y test , we ﬁrst computethe sparse code x test using D , we then form ˆ x test = P x test ,and, ﬁnally, we form the denoised image patch using ˆ y test = D ˆ x test .We drew patches from one of the classes in theCIFAR10 dataset and corrupted them with dB SNR noiseto create the training set { Y j } j ∈ [ J ] . We then selected validation and test images from the same class , extractedoverlapping × patches from each image to form Y val and We also found it necessary to introduce a post-processing step to theoutput of the TD-MSBDL algorithm with the atom-to-subspace prior. Weoutput the W tj which corresponds to the maximum measured classiﬁcationaccuracy on the validation set during training. The training, validation, and test sets were all mutually exclusive. MSBDL-1 J (cid:96) DL J (cid:96) DL (cid:96) DL K-SVD

TABLE IX: CIFAR10 denoising results. The input SNR is 10dB and the results denote the output SNR in dB. Y test , and added dB SNR noise to generate Y val and Y test .We set M j = 196 ∀ j and learn { D j } j ∈ [ J ] using each of thecompeting algorithms, performing cross-validation on λ for (cid:96) DL and J (cid:96) DL and on { λ j } j ∈ [ J ] for J (cid:96) DL to maximizethe output SNR of the validation set patches ˆ Y val . We set L = 400 , σ = 1 , σ = √ , σ ∞ = √ e − , α σ = √ . for MSBDL. During the denoising stage for MSBDL, weform { X j } j ∈ [ J ] by solving (4) with { D j } j ∈ [ J ] ﬁxed to thedictionaries learned by MSBDL and cross-validate over λ . Weemphasize that hyperparameter tuning is not necessary to learn { D j } j ∈ [ J ] using MSBDL, but is useful during the denoisingstage because the dictionary learning task is not necessarilyaligned with the denoising task. In particular, since cross-validation was only necessary after { D j } j ∈ [ J ] were learned,the learning complexity analysis in Section IV-C, visualizedin Fig. 5, still applies to this experiment.Denoising performance on the test set is shown in Table IX.MSBDL performed best across all algorithms tested. Theseresults indicate that MSBDL was able to leverage informationfrom both modalities to learn a better representation for Y compared to the competing algorithms. On the whole,multimodal learning algorithms outperformed the unimodalalgorithms on this task. D. MIR Flickr Image Annotation

The MIR Flickr dataset consists of images and associatedannotations, with each image annotated by up to 38 tags [58].For this task, we use 3 kinds of features to represent a givenimage: 100-dimensional dense hue, 1000-dimensional HarrisSIFT, and 4096-dimensional HSV, comprising modalities one,two, and three, respectively [59]. We use PCA to reduce thefeature dimensionality to N = 20 , N = 40 , and N = 60 formodalities one through three. The choice of { N j } j ∈ [ J ] servesto reduce the data dimensionality while preserving the relativecardinality of the modalities, i.e. N > N > N . Our trainand validation sets both contain 2000 images, while the testset contains 12500 images.Since each image is annotated by multiple tags, each labelvector h i contains a for all of the tags associated with the i ’th image and we denote the total number of tags for image i as k i . In order to evaluate how well each method is ableto predict tags from images, we compute ˆ h i = W j x ij , set allof the elements except the top k i to , and set the top k i elements to . We then compute the recall for image i using (cid:13)(cid:13)(cid:13) h i − ˆ h i (cid:13)(cid:13)(cid:13) / (cid:13)(cid:13) h i (cid:13)(cid:13) , which essentially measures the fractionof the true tags for image i appearing in the top k i elementsof W j x ij .We set the dictionary size to M = 4 N for TD-J (cid:96) DL, TD- (cid:96) DL, and D-KSVD. We learn dictionaries for each modalityin a multimodal fashion using TD-MSBDL-1 and TD-J (cid:96) DLand in a unimodal fashion using TD- (cid:96) DL and D-KSVD, Feature TD-MSBDL-1 TD-J (cid:96) DL TD- (cid:96) DL D-KSVDDense hue

TABLE X: Average recall on MIR Flickr dataset.identical to the setup in Section VIII-B. Since TD-MSBDL canassign different numbers of dictionary elements to differentmodalities, we choose the atom-to-subspace prior detailed inSection V-A and set M = 4 N and M = M = 4 N ,where each atom of D is grouped with two atoms from D and two atoms from D . TD-MSBDL is the only algorithmcapable of modeling dictionaries with varying size acrossmodalities, which is useful in this application given that thefeature size varies drastically across modalities. We performcross-validation to set λ for TD-J (cid:96) DL and TD- (cid:96) DL using thesame settings as in Section VIII-B. We set s = 10 for D-KSVDand run all algorithms for a maximum of e iterations witha batch size of . We set σ = σ = σ = √ . , α σ = √ . , β = β = β = √ , β ∞ = √ e − , α β = √ . for MSBDL.Table X reports the average recall for all of the imagesin the test set across the algorithms tested. TD-MSBDLperforms best across all feature types among the algorithmstested. In particular, TD-MSBDL outperforms the unimodallearning algorithms whereas TD-J (cid:96) DL performs worse thanthe unimodal algorithms for the Harris SIFT and HSV featuretypes.

E. Discussion

The results in this section have shown that MSBDL isable to outperform competing dictionary learning methods ona number of tasks, including dictionary recovery, sentimentclassiﬁcation, image denoising, and image annotation. Weconclude the results section by reiterating the main pointsof distinction between MSBDL and competing approaches,which may not be evident by inspecting performance metricsalone.For all of the experiments performed in this section, MS-BDL performed automatic hyperparameter tuning, whereashyperparameters were tuned using a grid search for all ofthe competing algorithms . This highlights the fact that theproposed approach offers superior performance at a discount,so to speak, in terms of LC. The same LC beneﬁts apply forTD-MSBDL, i.e. the LC of TD-MSBDL is identical to that ofMSBDL.None of the competing methods, including the MSBDLvariant in [5], are capable of learning models where thenumber of dictionary elements varies across modalities. Thiswas particularly useful in the experiments on the Photo Tweetand MIR Flickr datasets, where the data dimensionality variedsigniﬁcantly across modalities. The exception to this was when we performed cross-validation for theCIFAR10 denoising experiment, but we only did so after having learned { D j } j ∈ [ J ] , such that the argument in this section still applies. IX. C

ONCLUSION

We have detailed a sparse multimodal dictionary learningalgorithm. Our approach incorporates the main features of ex-isting methods, which establish a correspondence between theelements of the dictionaries for each modality, while address-ing the major drawbacks of previous algorithms. Our methodenjoys the theoretical guarantees and superior performanceassociated with the sparse Bayesian learning framework.R

EFERENCES[1] M. Cha, Y. Gwon, and H. Kung, “Multimodal sparse representationlearning and applications,” arXiv preprint arXiv:1511.06238 , 2015.[2] S. Shekhar, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Jointsparse representation for robust multimodal biometrics recognition,”

IEEE Transactions on Pattern Analysis and Machine Intelligence ,vol. 36, no. 1, pp. 113–126, Jan 2014.[3] J. C. Caicedo and F. A. Gonz´alez, “Multimodal fusion for image retrievalusing matrix factorization,” in

Proceedings of the 2nd ACM internationalconference on multimedia retrieval . ACM, 2012, p. 56.[4] Y. Ding and B. D. Rao, “Joint dictionary learning and recovery algo-rithms in a jointly sparse framework,” in . IEEE, 2015, pp. 1482–1486.[5] I. Fedorov, B. D. Rao, and T. Q. Nguyen, “Multimodal sparse bayesiandictionary learning applied to multimodal data classiﬁcation,” in , March 2017, pp. 2237–2241.[6] D. L. Donoho, M. Elad, and V. N. Temlyakov, “Stable recovery ofsparse overcomplete representations in the presence of noise,”

IEEETransactions on Information Theory , vol. 52, no. 1, pp. 6–18, Jan 2006.[7] M. Aharon, M. Elad, and A. Bruckstein, “k -svd: An algorithm fordesigning overcomplete dictionaries for sparse representation,”

IEEETransactions on Signal Processing , vol. 54, no. 11, pp. 4311–4322, Nov2006.[8] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution viasparse representation,”

Image Processing, IEEE Transactions on , vol. 19,no. 11, pp. 2861–2873, 2010.[9] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learningfor sparse coding,” in

Proceedings of the 26th annual internationalconference on machine learning . ACM, 2009, pp. 689–696.[10] Y. Gwon, W. Campbell, K. Brady, D. Sturim, M. Cha, and H. Kung,“Multimodal sparse coding for event detection,”

NIPS MMML , 2015.[11] J. Bergstra and Y. Bengio, “Random search for hyper-parameter opti-mization,”

Journal of Machine Learning Research , vol. 13, no. Feb, pp.281–305, 2012.[12] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. K´egl, “Algorithmsfor hyper-parameter optimization,” in

Advances in Neural InformationProcessing Systems , 2011, pp. 2546–2554.[13] Z. Jiang, Z. Lin, and L. S. Davis, “Label consistent k-svd: Learning adiscriminative dictionary for recognition,”

Pattern Analysis and MachineIntelligence, IEEE Transactions on , vol. 35, no. 11, pp. 2651–2664,2013.[14] Y. C. Pati, R. Rezaiifar, and P. Krishnaprasad, “Orthogonal matchingpursuit: Recursive function approximation with applications to waveletdecomposition,” in

Signals, Systems and Computers, 1993. 1993 Con-ference Record of The Twenty-Seventh Asilomar Conference on . IEEE,1993, pp. 40–44.[15] D. Baron, M. F. Duarte, M. B. Wakin, S. Sarvotham, and R. G. Baraniuk,“Distributed compressive sensing,” arXiv preprint arXiv:0901.3403 ,2009.[16] S. Bahrampour, N. M. Nasrabadi, A. Ray, and W. K. Jenkins, “Mul-timodal task-driven dictionary learning for image classiﬁcation,”

IEEETransactions on Image Processing , vol. 25, no. 1, pp. 24–38, Jan 2016.[17] M. Zorzi and R. Sepulchre, “Ar identiﬁcation of latent-variable graphicalmodels,”

IEEE Transactions on Automatic Control , vol. 61, no. 9, pp.2327–2340, 2016.[18] D. Alpago, M. Zorzi, and A. Ferrante, “Identiﬁcation of sparse reciprocalgraphical models,”

IEEE control systems letters , vol. 2, no. 4, pp. 659–664, 2018.[19] N. Parikh, S. Boyd et al. , “Proximal algorithms,”

Foundations andTrends R (cid:13) in Optimization , vol. 1, no. 3, pp. 127–239, 2014.[20] J. Mairal, F. Bach, and J. Ponce, “Task-driven dictionary learning,” IEEETransactions on Pattern Analysis and Machine Intelligence , vol. 34,no. 4, pp. 791–804, 2012. [21] Q. Zhang and B. Li, “Discriminative k-svd for dictionary learning inface recognition,” in Computer Vision and Pattern Recognition (CVPR),2010 IEEE Conference on . IEEE, 2010, pp. 2691–2698.[22] M. Girolami, “A variational method for learning sparse and overcompleterepresentations,”

Neural computation , vol. 13, no. 11, pp. 2517–2532,2001.[23] M. E. Tipping, “Sparse bayesian learning and the relevance vectormachine,”

The journal of machine learning research , vol. 1, pp. 211–244, 2001.[24] D. P. Wipf and B. D. Rao, “Sparse bayesian learning for basis selection,”

Signal Processing, IEEE Transactions on , vol. 52, no. 8, pp. 2153–2164,2004.[25] I. Fedorov, R. Giri, B. D. Rao, and T. Q. Nguyen, “Robust bayesianmethod for simultaneous block sparse signal recovery with applicationsto face recognition,” in , Sept 2016, pp. 3872–3876.[26] I. Fedorov, A. Nalci, R. Giri, B. D. Rao, T. Q. Nguyen, andH. Garudadri, “A uniﬁed framework for sparse non-negative leastsquares using multiplicative updates and the non-negative matrixfactorization problem,”

Signal Processing

IEEE Transactions on Signal Processing , vol. 66,no. 12, pp. 3124–3139, 2018.[28] D. P. Wipf and B. D. Rao, “An empirical bayesian strategy for solvingthe simultaneous sparse approximation problem,”

Signal Processing,IEEE Transactions on , vol. 55, no. 7, pp. 3704–3716, 2007.[29] D. P. Wipf, B. D. Rao, and S. Nagarajan, “Latent variable bayesianmodels for promoting sparsity,”

Information Theory, IEEE Transactionson , vol. 57, no. 9, pp. 6236–6255, 2011.[30] J. A. Palmer, “Variational and scale mixture representations of non-gaussian densities for estimation in the bayesian linear model: Sparsecoding, independent component analysis, and minimum entropy seg-mentation,” 2006.[31] R. Giri and B. Rao, “Type I and Type II Bayesian Methods for SparseSignal Recovery Using Scale Mixtures,”

IEEE Transactions on SignalProcessing , vol. 64, no. 13, pp. 3418–3428, July 2016.[32] D. J. MacKay, “Bayesian interpolation,”

Neural computation , vol. 4,no. 3, pp. 415–447, 1992.[33] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihoodfrom incomplete data via the em algorithm,”

Journal of the royalstatistical society. Series B (methodological) , pp. 1–38, 1977.[34] A. Boisbunon, “The class of multivariate spherically symmetric distri-butions,” 2012.[35] T. Eltoft, T. Kim, and T.-W. Lee, “On the multivariate laplace distri-bution,”

IEEE Signal Processing Letters , vol. 13, no. 5, pp. 300–303,2006.[36] E. J. Candes, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity byreweighted 1 minimization,”

Journal of Fourier analysis and applica-tions , vol. 14, no. 5-6, pp. 877–905, 2008.[37] D. J. MacKay, “Hyperparameters: optimize, or integrate out?”

Funda-mental theories of physics , vol. 62, pp. 43–60, 1996.[38] R. M. Neal and G. E. Hinton, “A view of the em algorithm that jus-tiﬁes incremental, sparse, and other variants,” in

Learning in graphicalmodels . Springer, 1998, pp. 355–368.[39] S. D. Babacan, R. Molina, and A. K. Katsaggelos, “Variational bayesiansuper resolution,”

IEEE Transactions on Image Processing , vol. 20,no. 4, pp. 984–999, 2011.[40] Y. Marnissi, Y. Zheng, E. Chouzenoux, and J.-C. Pesquet, “A variationalbayesian approach for image restoration. application to image deblurringwith poisson-gaussian noise,”

IEEE Transactions on ComputationalImaging , 2017.[41] B. L. Sturm and M. G. Christensen, “Comparison of orthogonal match-ing pursuit implementations,” in . IEEE, 2012, pp. 220–224.[42] J. A. Tropp, A. C. Gilbert, and M. J. Strauss, “Simultaneous sparseapproximation via greedy pursuit,” in

Proceedings.(ICASSP’05). IEEEInternational Conference on Acoustics, Speech, and Signal Processing,2005. , vol. 5. IEEE, 2005, pp. v–721.[43] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani et al. , “Least angleregression,”

The Annals of statistics , vol. 32, no. 2, pp. 407–499, 2004.[44] S. Kim and E. P. Xing, “Tree-guided group lasso for multi-task regres-sion with structured sparsity,” 2010. [45] R. Jenatton, J.-Y. Audibert, and F. Bach, “Structured variable selectionwith sparsity-inducing norms,”

Journal of Machine Learning Research ,vol. 12, no. Oct, pp. 2777–2824, 2011.[46] P. Zhao, G. Rocha, and B. Yu, “The composite absolute penalties familyfor grouped and hierarchical variable selection,”

The Annals of Statistics ,pp. 3468–3497, 2009.[47] R. Jenatton, J. Mairal, G. Obozinski, and F. R. Bach, “Proximal methodsfor sparse hierarchical dictionary learning.” in

ICML , no. 2010. Citeseer,2010, pp. 487–494.[48] G. Zhang, T. D. Roberts, and N. Kingsbury, “Image deconvolution usingtree-structured bayesian group sparse modeling,” in

Image Processing(ICIP), 2014 IEEE International Conference on . IEEE, 2014, pp. 4537–4541.[49] G. Zhang and N. Kingsbury, “Fast l0-based image deconvolution withvariational bayesian inference and majorization-minimization,” in

GlobalConference on Signal and Information Processing (GlobalSIP), 2013IEEE . IEEE, 2013, pp. 1081–1084.[50] M. Aharon, M. Elad, and A. M. Bruckstein, “On the uniqueness ofovercomplete dictionaries, and a practical way to retrieve them,”

Linearalgebra and its applications , vol. 416, no. 1, pp. 48–67, 2006.[51] D. L. Donoho and M. Elad, “Optimally sparse representation in general(nonorthogonal) dictionaries via 1 minimization,”

Proceedings of theNational Academy of Sciences , vol. 100, no. 5, pp. 2197–2202, 2003.[52] L. Yang, J. Fang, and H. Li, “Sparse bayesian dictionary learning witha gaussian hierarchical model,” in . IEEE, 2016,pp. 2564–2568.[53] H. Gunawan, O. Neswan, and W. Setya-Budhi, “A formula for anglesbetween subspaces of inner product spaces,”

Contributions to Algebraand Geometry , vol. 46, no. 2, pp. 311–320, 2005.[54] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scalevisual sentiment ontology and detectors using adjective noun pairs,” in

Proceedings of the 21st ACM international conference on Multimedia .ACM, 2013, pp. 223–232.[55] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robustfeatures,” in

European conference on computer vision . Springer, 2006,pp. 404–417.[56] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, “Visualcategorization with bags of keypoints,” in

Workshop on statisticallearning in computer vision, ECCV , vol. 1, no. 1-22. Prague, 2004,pp. 1–2.[57] A. Krizhevsky and G. Hinton, “Learning multiple layers of features fromtiny images,” Citeseer, Tech. Rep., 2009.[58] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zis-serman, “The pascal visual object classes challenge 2007 (voc2007)results,” 2007.[59] M. Guillaumin, J. Verbeek, and C. Schmid, “Multimodal semi-supervised learning for image classiﬁcation,” in2010 IEEE Computersociety conference on computer vision and pattern recognition