Block-Term Tensor Decomposition Model Selection and Computation: The Bayesian Way
Paris V. Giampouras, Athanasios A. Rontogiannis, Eleftherios Kofidis
11 A Bayesian Approach to Block-Term TensorDecomposition Model Selection andComputation
Paris V. Giampouras, Athanasios A. Rontogiannis,
Member, IEEE, andEleftherios Kofidis,
Member, IEEE
Abstract
The so-called block-term decomposition (BTD) tensor model, especially in its rank- ( L r , L r , version, has beenrecently receiving increasing attention due to its enhanced ability of representing systems and signals that are composedof blocks of rank higher than one, a scenario encountered in numerous and diverse applications. Its uniqueness andapproximation have thus been thoroughly studied. Nevertheless, the challenging problem of estimating the BTD modelstructure, namely the number of block terms and their individual ranks, has only recently started to attract significantattention. In this work, a Bayesian approach is taken to addressing the problem of rank- ( L r , L r , BTD modelselection and computation, based on the idea of imposing column sparsity jointly on the factors and in a hierarchical manner and estimating the ranks as the numbers of factor columns of non-negligible energy. Using variational inferencein the proposed probabilistic model results in an iterative algorithm that comprises closed-form updates. Its Bayesiannature completely avoids the ubiquitous in regularization-based methods task of hyper-parameter tuning. Simulationresults with synthetic data are reported, which demonstrate the effectiveness of the proposed scheme in terms of bothrank estimation and model fitting.
Index Terms
Bayesian inference, block-term tensor decomposition (BTD), rank, sparsity, tensor, variational inference
I. I
NTRODUCTION
Block-Term Decomposition (BTD) was introduced in [1] as a tensor model that combines the Canonical PolyadicDecomposition (CPD) and the Tucker decomposition (TD), in the sense that it decomposes a tensor in a sum oftensors (block terms) that have low multilinear rank (instead of rank one as in CPD). Hence a BTD can be seen asa constrained TD, with its core tensor being block diagonal (see [1, Fig. 2.3]). It can also be seen as a constrained
P. V. Giampouras is with the Mathematical Institute for Data Science, Johns Hopkins University, Baltimore, MD 212 18, USA. E-mail:[email protected]. A. Rontogiannis is with the IAASARS, National Observatory of Athens, 152 36 Penteli, Greece. E-mail: [email protected]. Kofidis is with the Dept. of Statistics and Insurance Science, University of Piraeus, 185 34 Piraeus, Greece and the Computer TechnologyInstitute & Press “Diophantus” (CTI), 265 04 Patras, Greece. E-mail: kofi[email protected]
January 11, 2021 DRAFT a r X i v : . [ s t a t . M E ] J a n X = c A B + c A B + · · · + c R A R B R Fig. 1. Rank- ( L r , L r , block-term decomposition. CPD having factors with (some) collinear columns [1]. In a way, BTD lies between the two extremes (in terms ofcore tensor structure), CPD and TD, and it is interesting to recall the related remark made in [1], namely that “therank of a higher-order tensor is actually a combination of the two aspects: one should specify the number of blocks and their size”. Accurately and efficiently estimating these numbers for a given tensor is the main subject of thiswork.Although [1] introduced BTD as a sum of R rank- ( L r , M r , N r ) terms ( r = 1 , , . . . , R ) in general, the specialcase of rank- ( L r , L r , BTD has attracted a lot more of attention, because of both its more frequent occurrence inapplications and the existence of more concrete and easier to check uniqueness conditions. This work will also focuson this special yet very popular BTD model. Consider a 3rd-order tensor, X ∈ C I × J × K . Then its rank- ( L r , L r , decomposition is written as X = R (cid:88) r =1 E r ◦ c r , where E r is an I × J matrix of rank L r , c r is a nonzero column K -vector and ◦ denotes outer product. Clearly, E r can be written as a matrix product A r B T r with the matrices A r ∈ C I × L r and B r ∈ C J × L r being of full columnrank, L r . Eq. (1) can thus be re-written as X = R (cid:88) r =1 A r B T r ◦ c r . (1)A schematic diagram of the above is shown in Fig. 1.BTD has been successfully used in a wide range of application areas and its uniqueness and computation havebeen thoroughly studied (cf. [2] for an extensive review). With the exception of a few recently reported BTDmethods, R and L r , r = 1 , , . . . , R are assumed a-priori known (and it is commonly assumed that all L r are allequal to L , for simplicity). However, unless external information is given (such as in a telecommunications [3] ora hyper-spectral image unmixing application with given or estimated ground truth [4]), there is no way to knowthese values beforehand. Although it has been observed that the unmixing performance of BTD-based blind sourceseparation methods does not strongly depend on the particular values of R and L r , this is not the case in general.Setting L r too high will not only increase the computational cost but will also risk to hinder interpretation of theresults through letting noise/artifact sources interfere with the desired sources. The choice of R is known to bemore crucial to the obtained performance: setting it too high may result in source splitting, thus compromising theseparation and interpretation of the components. Poor results, with difficult to interpret components, will also resultif the model complexity is (significantly) underestimated. January 11, 2021 DRAFT
Model selection for BTD, that is estimating the number of block terms, R , and their ranks, L r , r = 1 , , . . . , R ,is clearly more challenging than in CPD and TD models and has only recently started to be studied; see [2] for anextensive review of the related literature. The most recent contribution of this kind can be found in our work [2],[5], which relies on a regularization of the squared error function with the sum of the Frobenius norms of thefactors reweighted by a diagonal weighting which jointly depends on the factors in two levels: the reweightednorms of A (cid:44) (cid:104) A A · · · A R (cid:105) and B (cid:44) (cid:104) B B · · · B R (cid:105) are combined and then coupled withthe reweighted norm of C (cid:44) (cid:104) c c · · · c R (cid:105) . This two-level coupling naturally matches the structure of themodel in (1), making explicit the different roles of A , B and C . This way, column sparsity is imposed jointly onthe factors and in a hierarchical manner, which allows to estimate the ranks as the numbers of factor columns ofnon-negligible energy. Following a block coordinate descent (BCD) solution approach, an alternating hierarchicaliterative reweighted least squares (HIRLS) algorithm, called BTD-HIRLS, was developed in [5] that manages toboth reveal the ranks and compute the BTD factors at a high convergence rate and low computational cost.Nevertheless, BTD-HIRLS, being a regularization-based method, faces the same challenge that all such methodshave to address, namely to appropriately tune the regularization parameter so as to achieve the best possibleperformance. Although a rough guideline for the parameter selection has been given and utilized in [2], [5] as areference point for the trial-end-error search, this is still only a rule of thumb, not completely relieving the algorithmfrom the need to spend resources on searching for the most appropriate parameter value. To overcome this difficulty,we take in this work an alternative, Bayesian approach, viewing the unknowns as random variables and tacklingthe problem as one of Bayesian modeling and inference [6]. The idea is again to impose column sparsity jointlyon the factors in a hierarchical, two-level manner. This is achieved through a Bayesian hierarchy of priors withsparsity inducing effect, that realize the coupling of the columns of C and the A r , B r blocks at the outer leveland that between the columns of corresponding blocks at the inner level. Thus, R is estimated as the number ofcolumns of C of non-negligible energy while the L r ’s are found similarly from the columns of the A r , B r blocks.Using variational inference [7] in the proposed probabilistic model results in an iterative algorithm that comprisesclosed-form updates. Its Bayesian nature completely avoids the need for parameter tuning. Simulation results withsynthetic data are reported, which demonstrate the effectiveness of the proposed scheme in terms of both rankestimation and model fitting. Bayesian methods for tensor decomposition model selection and computation havebeen already reported in the literature, with [8], [9], and [10] being recent examples concerning TD, CPD, andTensor Train (TT) decomposition, respectively. To the best of our knowledge, however, the present work is the firstof its kind for rank- ( L r , L r , decomposition.The rest of this preprint is organized as follows. The adopted notation is described in the following subsection. Theproblem is stated in Section II. Section III describes the proposed probabilistic model. Its variational approximationand the resulting algorithm are presented in Section IV. Section V reports and discusses the simulation results.Conclusions are drawn and future work plans are outlined in Section VI. January 11, 2021 DRAFT
A. Notation
Lower- and upper-case bold letters are used to denote vectors and matrices, respectively. Higher-order tensorsare denoted by upper-case bold calligraphic letters. For a tensor X , X ( n ) stands for its mode- n unfolding. ∗ standsfor the Hadamard product and ⊗ for the Kronecker product. The Khatri-Rao product is denoted by (cid:12) in its general(partition-wise) version and by (cid:12) c in its column-wise version. ◦ denotes the outer product. The superscript T standsfor transposition. The identity matrix of order N and the all ones M × N matrix are respectively denoted by I N and M × N . N stands for N × . The column vectorization and the trace of a matrix X are denoted by vec( X ) and tr( X ) , respectively. diag( x ) is the diagonal matrix with the vector x on its main diagonal. The Euclidean vectornorm and the Frobenius matrix norm are denoted by (cid:107) · (cid:107) and (cid:107) · (cid:107) F , respectively. We also use the Matlab indexingnotation wherever needed. Thus, M (: , S ) stands for the submatrix of M consisting of its columns indexed by S .The cardinality of a set S is denoted by |S| . R and C are the fields of real and complex numbers, respectively.II. P ROBLEM S TATEMENT
Given the I × J × K tensor Y = X + σ N , (2)where X is given by (1) and N is an I × J × K tensor of zero-mean unit variance i.i.d. Gaussian entries, weaim at estimating R , L r , r = 1 , , . . . , R and the factor matrices A r = (cid:104) a r a r · · · a rL r (cid:105) ∈ C I × L r , B r = (cid:104) b r b r · · · b rL r (cid:105) ∈ C J × L r , C ∈ C K × R , subject of course to the inherent ambiguity resultingfrom the fact that only the product A r B T r can be uniquely identified (with a scaling ambiguity) [1]. In terms ofits mode unfoldings X (1) ∈ C I × JK , X (2) ∈ C J × IK and X (3) ∈ C K × IJ , the tensor X can be written as [1] X T(1) = ( B (cid:12) C ) A T (cid:44) PA T , (3) X T(2) = ( C (cid:12) A ) B T (cid:44) QB T , (4) X T(3) = (cid:104) ( A (cid:12) c B ) L · · · ( A R (cid:12) c B R ) L R (cid:105) C T (cid:44) TC T . (5)In this work, we follow a Bayesian approach to address the above problem. A fully Bayesian analysis is detailednext. III. P ROPOSED B AYESIAN M ODEL
Let R and the L r s be overestimated to R ini and L ini , respectively. Heavy-tailed multi-parameter Laplace dis-tributions, known for their sparsity inducing effect, are placed over the columns of the A r s, B r s, and C in away that implicitly implements a regularization analogous to that of the BTD-HIRLS method [5]. Namely, thenumber of block terms and the ranks of A r s and B r s are jointly penalized, while respecting the different rolethat these matrices play in the BTD model. This results in the nulling of all but R columns of C , and the nullingof all but L r columns of the corresponding “surviving” A r , B r blocks. Following the premise of the well-knownautomatic relevance determination (ARD) framework [6], [11], the priors are assigned via a 3-level hierarchy ofprior distributions outlined next. January 11, 2021 DRAFT
The likelihood function, which encodes the underlying causal relation between the data and the latent variables,can be written in three equivalent forms, with respect to (w.r.t.) the three unfoldings of Y (cf. (3), (4), (5)), asfollows: p ( Y T(1) | A , B , C , β ) = I (cid:89) i =1 p ( y (1) i | A , B , C , β ) = I (cid:89) i =1 N ( y (1) i | P a i , β − I JK ) , (6) p ( Y T(2) | A , B , C , β ) = J (cid:89) j =1 p ( y (2) i | A , B , C , β ) = J (cid:89) j =1 N ( y (2) j | Q b j , β − I IK ) , (7) p ( Y T(3) | A , B , C , β ) = K (cid:89) k =1 p ( y (3) k | A , B , C , β ) = K (cid:89) k =1 N ( y (3) k | T c k , β − I IJ ) , (8)where β is the noise precision and a i , b j , c k and y (1) i , y (2) j , y (3) k are the i th, j th, k th rows of A , B , C and Y (1) , Y (2) , Y (3) , respectively, in column form. The matrices A , B , C are considered as unobserved variables and areassigned 3-level hierarchical prior distributions. Namely, at the first level of the hierarchy, Gaussian distributionsare placed over A , B , and C , namely, p ( A | s , ζ , β ) = I (cid:89) i =1 N ( a i | , β − S − ( Z − ⊗ I L ini ) , (9) p ( B | s , ζ , β ) = J (cid:89) j =1 N ( b j | , β − S − ( Z − ⊗ I L ini )) , (10) p ( C | ζ , β ) = K (cid:89) k =1 N ( c k | , β − Z − ) , (11)where S = diag( s ) with s ∈ R L ini R ini × and Z = diag( ζ ) , ζ ∈ R R ini × . Note that the priors of A and B arezero-mean with the same covariance matrix, which is formed from the diagonal matrices Z and S . This particularselection is of critical importance from an implicit regularization perspective, since it induces identical sparsitypatterns over columns/sub-blocks of A and B . A sufficiently large value of ζ r will lead the r th column of C (cf. (11)) and the entire set of the redundant L ini columns of sub-matrices A r , B r (cf. (9), (10)) to zero. At thesame time, the superfluous l th columns of the “surviving” A r , B r , denoted by a rl , b rl , are jointly forced to bezero when the value of s rl becomes sufficiently large (cf. (9), (10)). Thus, the R ini diagonal values of Z act asweights that determine the number of nonzero block terms, R , while the L ini R ini diagonal values of S determinethe number of nonzero columns of A and B . The sophisticated selection of both (9), (10) leads to joint sparsityimposition on the columns of the factors, in two levels. Thus, ideas of ARD and sparse Bayesian leaning (SBL) [11]are put into effect to address the challenging problem of BTD model selection.At the second level of the hierarchy of priors, inverse Gamma priors are assigned over s and ζ , p ( s ) = R ini (cid:89) r =1 L ini (cid:89) l =1 IG (cid:18) s rl (cid:12)(cid:12)(cid:12)(cid:12) I + J + 12 , δ rl (cid:19) , (12) p ( ζ ) = R ini (cid:89) r =1 IG (cid:18) ζ r (cid:12)(cid:12)(cid:12)(cid:12) ( I + J ) L ini + KR ini + 12 , ρ r (cid:19) , (13) We denote matrix rows and columns with bold italic and roman letters, respectively.
January 11, 2021 DRAFT s ζ ρδ μν A C B β κθτψ Fig. 2. The proposed Bayesian model. where δ rl and ρ r are the scale parameters of the distributions over s rl and ζ r , respectively.The third level involves Gamma prior distributions over the scale parameters of the inverse Gamma distributionsof s rl and ζ r , namely, p ( δ rl ) = G ( δ rl | ψ, τ ) , (14) p ( ρ r ) = G ( ρ r | µ, ν ) , (15)where ψ, τ, µ, ν take very small positive values rendering the respective priors non-informative. Note that these priorsare conjugate w.r.t. the likelihood functions and w.r.t. each other, which guarantees that the posterior distributionwill belong to the same class of distributions with the priors [6]. Moreover, one can show that heavy-tailed multi-parameter Laplace marginal distributions show up over the columns of A , B and C after integrating out the variables s and ζ .The above Bayesian model is depicted in the form of a graphical model in Fig. 2.IV. A PPROXIMATE P OSTERIOR I NFERENCE
Let Θ be the cell array which includes all unobserved variables, that is, Θ (cid:44) { A , B , C , s , ζ , β, ρ , δ } . The exactjoint posterior of the variables of the adopted Bayesian model is given by p ( Θ | Y ) = p ( Y , Θ ) (cid:82) p ( Y , Θ ) d Θ . (16)Due to the complexity of the model, the marginal distribution of Y in the denominator is computationally intractable.Therefore, we follow a variational inference (VI) approach for approximating (16). The idea is to approximate theposterior by a distribution which is as close as possible to the exact posterior in terms of the Kullback-Leiblerdivergence [7]. VI allows for an efficient approximate inference process even in vastly complicated Bayesian modelsthat involve high-dimensional variables. It is usually coupled with mean-field approximation, namely, the assumption January 11, 2021 DRAFT that the posterior distribution can be factorized w.r.t. the involved variables, implying statistical independence amongthem. In our case, the approximate posterior q ( Θ ) of p ( Θ | Y ) is written in the form q ( Θ ) = q ( β ) I (cid:89) i =1 q ( a i ) J (cid:89) j =1 q ( b j ) K (cid:89) k =1 q ( c k ) R ini (cid:89) r =1 L ini (cid:89) l =1 q ( s rl ) q ( δ rl ) R ini (cid:89) r =1 q ( ζ r ) q ( ρ r ) (17)Denoting the individual variables above by θ i , the corresponding VI-based posteriors are known to satisfy q ( θ i ) = exp( (cid:104) ln( p ( Y , Θ )) (cid:105) i (cid:54) = j ) (cid:82) exp( (cid:104) ln( p ( Y , Θ )) (cid:105) i (cid:54) = j d θ i , (18)where (cid:104)·(cid:105) i (cid:54) = j denotes expectation w.r.t. all q ( θ j ) s but q ( θ i ) . To solve (18) a block coordinate ascent approach istaken, employing the cyclic update rule, namely solving for q ( θ i ) given q ( θ j ) , j (cid:54) = i . It follows that the posteriordistributions of a i is given by q ( a i ) = N ( (cid:104) a i (cid:105) , Σ a ) , (19)with (cid:104) a i (cid:105) = (cid:104) β (cid:105) Σ a (cid:104) P (cid:105) T y (1) i , (20) Σ a = (cid:104) β (cid:105) − ( (cid:104) P T P (cid:105) + (cid:104) S (cid:105) ( (cid:104) Z (cid:105) ⊗ I L )) − , (21)where (cid:104)·(cid:105) denotes expectation w.r.t the posterior of the involved variable. The posterior of b j results in an analogousmanner: q ( b j ) = N ( (cid:104) b j (cid:105) , Σ b ) , (22)with (cid:104) b j (cid:105) = (cid:104) β (cid:105) Σ b (cid:104) Q (cid:105) T y (2) j (23) Σ b = (cid:104) β (cid:105) − ( (cid:104) Q T Q (cid:105) + (cid:104) S (cid:105) ( (cid:104) Z (cid:105) ⊗ I L )) − . (24)The posterior of c k is q ( c k ) = N ( (cid:104) c k (cid:105) , Σ c ) (25)with (cid:104) c k (cid:105) = (cid:104) β (cid:105) Σ c (cid:104) T (cid:105) T y (3) k (26) Σ c = (cid:104) β (cid:105) − ( (cid:104) T T T (cid:105) + (cid:104) Z (cid:105) ) − . (27) All a i ’s have the same covariance matrix, Σ a , and similarly for the b j ’s and the c k ’s. January 11, 2021 DRAFT
One can readily verify that (cid:104) P (cid:105) = (cid:104) B (cid:105) (cid:12) (cid:104) C (cid:105) (28) (cid:104) Q (cid:105) = (cid:104) C (cid:105) (cid:12) (cid:104) A (cid:105) (29) (cid:104) T (cid:105) = (cid:104) ( (cid:104) A (cid:105) (cid:12) c (cid:104) B (cid:105) ) L ini · · · ( (cid:104) A R ini (cid:105) (cid:12) c (cid:104) B R ini (cid:105) ) L ini (cid:105) (30) (cid:104) P T P (cid:105) = (cid:104) ( B T B ) ∗ ( C T C ⊗ L ini × L ini ) (cid:105) = (cid:104) B T B (cid:105) ∗ ( (cid:104) C T C (cid:105) ⊗ L ini × L ini ) (31) (cid:104) Q T Q (cid:105) = (cid:104) ( A T A ) ∗ ( C T C ⊗ L ini × L ini ) (cid:105) = (cid:104) A T A (cid:105) ∗ ( (cid:104) C T C (cid:105) ⊗ L ini × L ini ) (32) (cid:104) T T T (cid:105) = ( I R ini ⊗ T L ini )( (cid:104) A T A (cid:105) ∗ (cid:104) B T B (cid:105) )( I R ini ⊗ L ini ) , (33)where the second-order statistics appearing in the above expressions can be computed as follows (cid:104) A T A (cid:105) = (cid:104) A (cid:105) T (cid:104) A (cid:105) + I Σ a (34) (cid:104) B T B (cid:105) = (cid:104) B (cid:105) T (cid:104) B (cid:105) + J Σ b (35) (cid:104) C T C (cid:105) = (cid:104) C (cid:105) T (cid:104) C (cid:105) + K Σ c . (36)Next, the approximate posteriors of the variables belonging to the second level of hierarchy are given. The posteriorof s rl turns out [12] to be a GIG pdf, q ( s rl ) = GIG (cid:18) s rl (cid:12)(cid:12)(cid:12)(cid:12) − , (cid:104) β (cid:105)(cid:104) ζ r (cid:105) ( (cid:104) a T rl a rl (cid:105) + (cid:104) b T rl b rl (cid:105) ) , (cid:104) δ rl (cid:105) (cid:19) (37)with mean (cid:104) s rl (cid:105) = (cid:115) (cid:104) δ rl (cid:105)(cid:104) β (cid:105)(cid:104) ζ r (cid:105) ( (cid:104) a T rl a rl (cid:105) + (cid:104) b T rl b rl (cid:105) ) , (38)where (cid:104) a T rl a rl (cid:105) and (cid:104) b T rl b rl (cid:105) are the (( r − L ini + l, ( r − L ini + l ) entries of (34) and (35), respectively. Similarly,the approximate posterior of ζ r is also GIG, q ( ζ r ) = GIG (cid:32) ζ r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − , (cid:104) β (cid:105) (cid:34) L (cid:88) l =1 ( (cid:104) s rl (cid:105)(cid:104) a T rl a rl (cid:105) + (cid:104) b T rl b rl (cid:105) ) + (cid:104) c T r c r (cid:105) (cid:35) , (cid:104) ρ r (cid:105) (cid:33) , (39)with (cid:104) ζ r (cid:105) given by (cid:104) ζ r (cid:105) = (cid:115) (cid:104) ρ r (cid:105)(cid:104) β (cid:105) ( (cid:80) L ini l =1 (cid:104) s rl (cid:105) ( (cid:104) a T rl a rl (cid:105) + (cid:104) b T rl b rl (cid:105) ) + (cid:104) c T r c r (cid:105) ) (40)and (cid:104) c T r c r (cid:105) denoting the ( r, r ) entry of (36). It can be shown [12] that, at the third level of hierarchy, q ( δ rl ) = G ( δ rl | ¯ ψ, ¯ τ ) , (41)where ¯ ψ = ψ + I + J + 12 (42) ¯ τ = τ + (cid:28) s rl (cid:29) , (43)and q ( ρ r ) = G ( ρ r | ¯ µ, ¯ ν ) (44) January 11, 2021 DRAFT with ¯ µ = µ + ( I + J ) L ini + KR ini + 12 (45) ¯ ν = ν + (cid:28) ζ r (cid:29) . (46)The expectations (cid:68) s rl (cid:69) and (cid:68) ζ r (cid:69) can be obtained as [12] (cid:28) s rl (cid:29) = 1 (cid:104) δ rl (cid:105) + 1 (cid:104) s rl (cid:105) , (47) (cid:28) ζ r (cid:29) = 1 (cid:104) ρ r (cid:105) + 1 (cid:104) ζ r (cid:105) . (48)Finally, the posterior of the noise precision parameter β is given by q ( β ) = G ( β | ¯ κ, ¯ θ ) (49)where ¯ κ = κ + ( I + J ) L ini R ini + KR ini + IJK (50) ¯ θ = θ + 12 (cid:104)(cid:107) Y T(1) − PA T (cid:107) (cid:105) + 12 R ini (cid:88) r =1 (cid:34) L ini (cid:88) l =1 (cid:0) (cid:104) a T rl s rl ζ r a rl (cid:105) + (cid:104) b T rl s rl ζ r b rl (cid:105) (cid:1) + (cid:104) c T r ζ r c r (cid:105) (cid:35) . (51)The quantity (cid:104)(cid:107) Y T(1) − PA T (cid:107) (cid:105) appearing in ¯ θ is computed as (cid:104)(cid:107) Y T(1) − PA T (cid:107) (cid:105) = tr { Y T(1) Y (1) } − {(cid:104) A (cid:105) T Y (1) (cid:104) P (cid:105)} + tr {(cid:104) A T A (cid:105) ( (cid:104) B T B (cid:105) ∗ ( (cid:104) C T C (cid:105) ⊗ L ini × L ini )) } , (52)and for the remaining terms there holds (cid:104) a T rl s rl ζ r a rl (cid:105) = (cid:104) s rl (cid:105)(cid:104) ζ r (cid:105)(cid:104) a T rl a rl (cid:105) (53) (cid:104) b T rl s rl ζ r b rl (cid:105) = (cid:104) s rl (cid:105)(cid:104) ζ r (cid:105)(cid:104) b T rl b rl (cid:105) (54) (cid:104) c T r ζ r c r (cid:105) = (cid:104) ζ r (cid:105)(cid:104) c T r c r (cid:105) . (55)The whole process can be viewed as a coordinate ascent algorithm applied on a lower bound of the evidencefunction (ELBO) [12]. The resulting Bayesian-BTD (BBTD) algorithm is summarized in Table I. The iterationsstop when a convergence criterion is met (e.g., the relative difference of the tensor reconstruction errors in twoconsecutive iterations becomes less than a user-defined threshold) or the maximum number of iterations is reached.V. S
IMULATION R ESULTS
In this section, we evaluate the effectiveness of the BBTD algorithm in revealing the ranks and computing themodel parameters. First, a × × tensor Y is generated as in (2), with R = 3 and the L r s being randomlyselected from the set { , , . . . , } . The entries of A r , B r and C are i.i.d., sampled from the standard Gaussian dis-tribution. The noise power is set so as to result in a signal-to-noise ratio SNR = 10 log (cid:107) X (cid:107) / ( σ (cid:107) N (cid:107) ) of 15 dB. January 11, 2021 DRAFT0
TABLE IT HE B AYESIAN
BTD (BBTD) A
LGORITHM
Input : Y , R ini , L ini Output : ˆ R , ˆ L r , r = 1 , , . . . , ˆ R , ˆA , ˆB , ˆC Initialize B , C , β, S , Z , δ , ρ repeat Σ a = (cid:104) β (cid:105) − ( (cid:104) P T P (cid:105) + (cid:104) S (cid:105) ( (cid:104) Z (cid:105) ⊗ I L ini )) − (cid:104) A (cid:105) = (cid:104) β (cid:105) Y (1) (cid:104) P (cid:105) Σ a Σ b = (cid:104) β (cid:105) − ( (cid:104) Q T Q (cid:105) + (cid:104) S (cid:105) ( (cid:104) Z (cid:105) ⊗ I L ini )) − (cid:104) B (cid:105) = (cid:104) β (cid:105) Y (2) (cid:104) Q (cid:105) Σ b Σ c = (cid:104) β (cid:105) − ( (cid:104) T T T (cid:105) + (cid:104) Z (cid:105) ) − (cid:104) C (cid:105) = (cid:104) β (cid:105) Y (3) (cid:104) T (cid:105) Σ c (cid:104) s rl (cid:105) = (cid:114) (cid:104) δ rl (cid:105)(cid:104) β (cid:105)(cid:104) ζ r (cid:105) ( (cid:104) a T rl a rl (cid:105) + (cid:104) b T rl b rl (cid:105) ) , r = 1 , , . . . , R ini , l = 1 , , . . . , L ini (cid:68) s rl (cid:69) = (cid:104) δ rl (cid:105) + (cid:104) s rl (cid:105) , r = 1 , , . . . , R ini , l = 1 , , . . . , L ini (cid:104) δ rl (cid:105) = ψ + I + J +12 τ + (cid:104) srl (cid:105) , r = 1 , , . . . , R ini , l = 1 , , . . . , L ini (cid:104) ζ r (cid:105) = (cid:114) (cid:104) ρ r (cid:105)(cid:104) β (cid:105) ( (cid:80) L ini l =1 (cid:104) s rl (cid:105) ( (cid:104) a T rl a rl (cid:105) + (cid:104) b T rl b rl (cid:105) ))+ (cid:104) c Tr c r (cid:105) , r = 1 , , . . . , R ini (cid:68) ζ r (cid:69) = (cid:104) ρ r (cid:105) + (cid:104) ζ r (cid:105) , r = 1 , , . . . , R ini (cid:104) ρ r (cid:105) = µ +( I + J ) L ini + KR ini +12 ν + (cid:104) ζr (cid:105) , r = 1 , , . . . , R ini (cid:104) β (cid:105) = κ +( I + J ) L ini R ini + KR ini + IJK θ + (cid:28)(cid:13)(cid:13)(cid:13) Y T(1) − PA T (cid:13)(cid:13)(cid:13) (cid:29) + (cid:80) R ini r =1 (cid:104)(cid:80) L ini l =1 ( (cid:104) a T rl s rl ζ r a rl (cid:105) + (cid:104) b T rl s rl ζ r b rl (cid:105) ) + (cid:104) c T r ζ r c r (cid:105) (cid:105) until convergence I = { i ∈ { , , . . . , R ini } | i th column of (cid:104) C (cid:105) is of non-negligible energy } ˆ R = |I| ˆC = (cid:104) C (cid:105) (: , I ) Let the elements of I be sorted in increasing order as i , i , . . . , i ˆ R I r = { l ∈ { , , . . . , L ini } | l th column of (cid:104) A (cid:105) r (cid:44) (cid:104) A (cid:105) (: , ( i r − L ini + 1 : i r L ini ) is of non-negligible energy } = { l ∈ { , , . . . , L ini } | l th column of (cid:104) B (cid:105) r (cid:44) (cid:104) B (cid:105) (: , ( i r − L ini + 1 : i r L ini ) is of non-negligible energy } , r = 1 , , . . . , ˆ R ˆ L r = |I r | r = 1 , , . . . , ˆ R ˆA r = (cid:104) A (cid:105) r (: , I r ) , ˆB r = (cid:104) B (cid:105) r (: , I r ) , r = 1 , , . . . , ˆ R ˆA = (cid:104) ˆA ˆA · · · ˆA ˆ R (cid:105) ˆB = (cid:104) ˆB ˆB · · · ˆB ˆ R (cid:105) Both R and L r s are initialized to 10. The evolution of the median value of the NMSE = (cid:80) Rr =1 (cid:107) A r B T r ◦ c r − ˆ A r ˆ B T r ◦ ˆ c r (cid:107) (cid:107) A r B T r ◦ c r (cid:107) values obtained in 10 independent runs is plotted in Fig. 3, for a single initialization of the algorithm .The second experiment aims at evaluating the ability of the algorithm to correctly estimate R and the L r s, in ascenario of strong noise. Y is now of dimensions I = 30 , J = 30 , K = 30 , with R = 5 , L = 8 , L = 6 , L = The Hungarian algorithm is employed to resolve the permutation ambiguity, that is, to match the ˆ R estimated non-zero block terms withthe true ones. Note that the proposed algorithm is shown to be robust to initialization in the current experimental setting, hence a single initialization issufficient.
January 11, 2021 DRAFT1 -2 -1 Fig. 3. NMSE vs. iterations. s u cc e ss r a t e s u cc e ss r a t e s u cc e ss r a t e s u cc e ss r a t e s u cc e ss r a t e s u cc e ss r a t e (a) (b) Fig. 4. Success rates of the recovery of (a) R and (b) L r s. , L = 5 and L = 3 , and SNR=5 dB. Both R and L r s are again overestimated to 10. Fig. 4 depicts the successrates in the recovery of the true R and L r s obtained out of 100 independent runs of the experiment. The algorithmis seen to recover the correct number of non-zero block terms in the vast majority of the cases, achieving a successrate of 97% for the estimation of R . It should be also noted that in the rare cases that it fails, this happens only witha small deviation from the true R , namely 6. Fig. 4(b) shows the success rates for the L r s in the cases that R iscorrectly estimated. These are nearly 100%, demonstrating the effectiveness of the proposed algorithm in revealingthe true ranks of the A r s and B r s. VI. C ONCLUSIONS AND F UTURE W ORK
The problem of estimating the BTD model structure, namely the number of block terms and their individualranks, is significantly more challenging than that for the TD and CPD models and has only recently started toattract significant attention. In this preprint, a Bayesian approach to the problem of rank-revealing rank- ( L r , L r , BTD tensor modeling is developed for the first time. Using ideas that stem from the celebrated ARD framework wedevise a Bayesian model that is composed of 3 levels of prior distributions. The resulting marginal priors are heavy
January 11, 2021 DRAFT2 tailed and have been carefully crafted so as to jointly enforce sparsity on the columns of the factor matrices. In doingso, model selection can be performed in parallel with the model computation. This is achieved with an iterativealgorithm that results from the application of variational inference in the proposed probabilistic model. Its Bayesiannature completely avoids the ubiquitous in regularization-based methods task of hyper-parameter tuning. Simulationresults with synthetic data demonstrate its effectiveness in terms of both rank estimation and model fitting. Theextended version of this preprint will provide further theoretical insights into the Bayesian model and possiblevariants thereof. The connections of the proposed algorithm with our earlier regularization-based method [5] willbe explored in more detail. Moreover, the performance of the algorithm will be evaluated in additional synthetic-and real-data experiments. Finally, nonnegativity constraints will be incorporated to the current Bayesian approach.In doing so, we aspire to get new insights when it comes to the interpretability of the BTD model, thus allowingfor the use of it in a wide range of blind-source separation type applications.R
EFERENCES[1] L. De Lathauwer, “Decompositions of a higher-order tensor in block terms — Part II: Definitions and uniqueness,”
SIAM J. Matrix Anal.Appl. , vol. 30, no. 3, pp. 1033–1066, 2008.[2] A. A. Rontogiannis, E. Kofidis, and P. V. Giampouras, “Block-term tensor decomposition: Model selection and computation,”arXiv:2002.09759v2 [cs.NA], Oct. 2020.[3] L. De Lathauwer, “Block component analysis: a new concept for blind source separation,” in
Proc. LVA/ICA-2012 , Tel Aviv, Israel, Mar.2012.[4] Y. Qian, F. Xiong, S. Zeng, J. Zhou, and Y. Y. Tang, “Matrix-vector nonnegative tensor factorization for blind unmixing of hyperspectralimagery,”
IEEE Trans. Geosci. Remote Sens. , vol. 55, no. 3, pp. 1776–1792, Mar. 2017.[5] A. A. Rontogiannis, E. Kofidis, and P. V. Giampouras, “Block-term tensor decomposition: Model selection and computation,” in
Proc. EUSIPCO-2020 , Amsterdam, The Netherlands, Aug. 2020, deferred for Jan. 2021.[6] S. Theodoridis,
Machine Learning — A Bayesian and Optimization Perspective , 2nd ed. Academic Press, 2020.[7] D. G. Tzikas, A. C. Likas, and N. P. Galatsanos, “The variational approximation for Bayesian inference,”
IEEE Signal Process. Mag. , pp.131–146, Nov. 2008.[8] Q. Zhao, L. Zhang, and A. Cichocki, “Bayesian sparse Tucker models for dimension reduction and tensor completion,” arXiv:1505.02343v1[cs.LG], May 2015.[9] L. Cheng, Z. Chen, Q. Shi, Y.-C. Wu, and S. Theodoridis, “Towards probabilistic tensor canonical polyadic decomposition 2.0: Automatictensor rank learning using generalized hyperbolic prior,” arXiv:2009.02472v1 [cs.LG], Sep. 2020.[10] L. Xu, L. Cheng, N. Wong, and Y.-C. Wu, “Learning tensor train representation with automatic rank determination from incomplete noisydata,” arXiv:2010.06564v1 [eess.SP], Oct. 2020.[11] M. E. Tipping, “Sparse Bayesian learning and the relevance vector machine,”
J. Mach. Learn. Res. , vol. 1, pp. 211–244, Jun. 2001.[12] P. V. Giampouras, A. A. Rontogiannis, K. E. Themelis, and K. D. Koutroumbas, “Online sparse and low-rank subspace learning fromincomplete data: A Bayesian view,”
Signal Processing , vol. 137, pp. 199–212, 2017., vol. 137, pp. 199–212, 2017.