[PDF] On learning with shift-invariant structures

Abstract

We describe new results and algorithms for two different, but related, problems which deal with circulant matrices: learning shift-invariant components from training data and calculating the shift (or alignment) between two given signals. In the first instance, we deal with the shift-invariant dictionary learning problem while the latter bears the name of (compressive) shift retrieval. We formulate these problems using circulant and convolutional matrices (including unions of such matrices), define optimization problems that describe our goals and propose efficient ways to solve them. Based on these findings, we also show how to learn a wavelet-like dictionary from training data. We connect our work with various previous results from the literature and we show the effectiveness of our proposed algorithms using synthetic, ECG signals and images.

Full PDF

aa r X i v : . [ c s . L G ] J un On learning with shift-invariant structures

Cristian Rusu

Abstract —In this paper, we describe new results and algo-rithms, based on circulant matrices, for the task of learningshift-invariant components from training data. We deal with theshift-invariant dictionary learning problem which we formulateusing circulant and convolutional matrices (including unions ofsuch matrices), deﬁne optimization problems that describe ourgoals and propose efﬁcient ways to solve them. Based on theseﬁndings, we also show how to learn a wavelet-like dictionaryfrom training data. We connect our work with various previousresults from the literature and we show the effectiveness of ourproposed algorithms using synthetic as well as real ECG signalsand images.

I. I

NTRODUCTION

Circulant matrices [1] are highly structured matrices whereeach column is a circular shift of the previous one. Because oftheir structure and their connection to the fast Fourier trans-form [2] (circulant matrices are diagonalized by the Fouriermatrix [1, Section 3]), these matrices have seen many applica-tions in the past: computing the shift between two signals (theshift retrieval problem exempliﬁed in the circular convolutionand cross-correlation theorems) for the GPS locking problem[3], time delay estimation [4], compressed shift retrieval fromFourier components [5], matching or alignment problems forimage processing [6], designing numerically efﬁcient lineartransformations [7] and overcomplete dictionaries [8], matrixdecompositions [9], convolutional dictionary learning [10],[11], [12] and sparse coding [13], learning shift invariantstructured from data [14],[15] for medical imaging [16], EEG[17] and audio [18] signal analysis. A recent review of themethods, solutions and applications related to circulant andconvolutional representations is given in [19].In this paper, we propose several numerically efﬁcientalgorithms to extract shift-invariant components or alignmentsfrom data using several structured dictionaries related tocirculant matrices.Previously, several dictionary learning techniques that ac-commodate for shift invariance have been proposed: extendingthe well-known K-SVD algorithm to deal with shift-invariantstructures [17], [20], [21], proposing a shift-invariant iterativeleast squares dictionary learning algorithm [22], extendingthe dictionary while solving an eigenvalue problem [23], fastonline learning approach [24], research that combines shift and2D rotation invariance [25] and proposing new algorithms thatoptimize directly the dictionary learning objective functionswith circulant matrices [14], [15], [26]. The convolutionalsparse representation model [27], [28], [29] where the dic-tionary is a concatenation of circulant matrices has beenextensively studied in the past. Furthermore, recent work [13]

The author is with the Istituto Italiano di Tecnologia (IIT), Gen-ova, Italy. Contact e-mail address: [email protected]. Demo source code:https://github.com/cristian-rusu-research/shift-invariance uses tools developed in the sparse representations literature toprovide theoretical insights into convolutional sparse codingwhere the dictionary is a concatenation of banded circulantmatrices and its connection to convolutional neural networks[30]. Detailed literature reviews of these learning and convolu-tional sparse representations problems and proposed solutionshave been recently described in [19, Section II].Structured dictionaries have received a lot of attentionmainly because of two reasons: the structure means that thedictionaries will be easier to store and use (lower memoryfootprint and lower computational complexity to perform,for example, matrix-vector multiplication or solving linearsystems) and they act as regularizers modeling some propertyof the data that is interesting. In the case of shift-invariant dic-tionaries these two advantages are transparent: manipulation ofcirculant matrices is done via the fast Fourier transform whilestoring them takes only linear space (instead of quadratic) andthey are able to model patterns from the data that are repetitive,as we expect real-world data (especially image data and time-series) to exhibit such patterns.We start by outlining in Section II circulant matrices andtheir properties, particularly their factorization with the Fouriermatrix, and other structured matrices that we use in thispaper. Then, we propose algorithms to learn shift-invariant(circulant, convolutional and unions of these) and wavelet-like components from training data (Section III). Finally, inSection IV, we show experimental results with various datasources (synthetic, ECG, images) that highlight the learningcapabilities of the proposed methods.

Notation : bold lowercase x ∈ R n is used to denote a columnvector, bold uppercase X ∈ R n × m is used to denote a matrix,non-bold lowercase Greek letters like α ∈ R are used to denotescalar values, calligraphic letters K are used to denote sets and |K| is the cardinality of K (abusing this notation, | α | is themagnitude of a scalar). Then k x k is the ℓ norm, k x k is the ℓ pseudo-norm, k X k F = tr ( X H X ) is the Frobenius norm, tr( X ) denotes the trace, vec ( X ) ∈ R nm vectorizes the matrix X ∈ R n × m columnwise, diag ( x ) denotes the diagonal matrixwith the vector x on its diagonal, X H is the complex conjugatetranspose, X T is the matrix transpose, X ∗ is the complexconjugate, X − denotes the inverse of a square matrix, x kj is the ( k, j ) th entry of X . Tilde variables like ˜X representsthe columnwise Fourier transform of X , X ⊗ Y denotes theKronecker product and X ⊙ Y is the Khatri-Rao product [31].II. T HE PROPOSED STRUCTURED DICTIONARIES

A. Circulant dictionaries

We consider in this paper circulant matrices C . These squarematrices are completely deﬁned by their ﬁrst column c ∈ R n :every column is a circular shift of the ﬁrst one. With a down shift direction the right circulant matrices are: C = circ ( c ) def =  c c n . . . c c c c . . . c c ... . . . . .. . . . ... c n − c n − . . . c c n c n c n − . . . c c  = (cid:2) c Pc P c . . . P n − c (cid:3) ∈ R n × n . (1)The matrix P ∈ R n × n denotes the orthonormal circulantmatrix that circularly shifts a target vector c by one position,i.e., P = circ ( e ) where e is the second vector of the standardbasis of R n . Notice that P q − = circ ( e q ) is also orthonormalcirculant and denotes a cyclic shift by q − . The main propertyof circulant matrices (1) is their eigenvalue factorization whichreads: C = F H ΣF , Σ = diag ( σ ) ∈ C n × n , (2)where F ∈ C n × n is the unitary Fourier matrix ( F H F = FF H = I ) and the diagonal σ = √ n Fc , σ ∈ C n .Note that the multiplication with F on the right is equiv-alent to the application of the Fast Fourier Transform, i.e., Fc = FFT ( c ) , while the multiplication with F H is equivalentto the inverse Fourier transform, i.e., F H c = IFFT ( c ) . Bothtransforms are applied in O (cid:0) n log n (cid:1) time and memory. B. Convolutional dictionaries

Convolutional dictionaries can be reduced to circulant dic-tionaries by observing that given c ∈ R n and x ∈ R m with m ≥ n the result of their convolution is a vector y of size p = n + m − as y = c ∗ x = toep ( c ) x =  c . . . c c . . . ... . . . . . . . . . ... c n c n − . . . c c n . . . c c . . . c c ... ... . . . . . . ... . . . c n  x = circ (cid:18)(cid:20) c0 ( m − × (cid:21)(cid:19) (cid:20) x0 ( n − × (cid:21) = C conv x conv , (3)where toep ( c ) is a Toeplitz matrix of size p × m . Paddingwith zeros such that all variables are of size p leads again to amatrix-vector multiplication by a circulant matrix. An alterna-tive, but ultimately equivalent, way to write the convolution interms of a circulant matrix is to notice that a Toeplitz matrixcan be embedded into an extended circulant matrix of twicethe size (see [15, Section 4]).For our purposes, there is a fundamental difference between C and C conv : in the case of C it is exactly equivalent if wechoose to operate with c or σ while for C conv we necessarilyhave to work with c conv in order to impose its sparse structure.As we will see, this means that in the convolutional case wecannot exploit some simpliﬁcations that occur in the Fourierdomain (when working directly with σ ). C. Wavelet-like dictionaries

Multiples, powers, products, and sums of circulant matricesare themselves circulant. Therefore, extending this class ofstructures to include other dictionaries is not straightforward.In order to represent a richer class of dictionaries, still basedon circulants, consider now the following structured matrix C ( p ) k = (cid:2) G k S H k S (cid:3) ∈ R p × p , (4)where G k = circ ( g k ) and H k = circ ( h k ) are both p × p circulant matrices and S ∈ R p × p is a selectionmatrix that keeps only every other column, i.e., G k S = (cid:2) g k P g k P g k . . . P n − g k (cid:3) ∈ R p × p (downsam-pling the columns by a factor of 2). In general we assumethat the ﬁlters g k and h k have compact support with lengthdenoted n ≤ p . As such, these transformations are related tothe convolutional dictionaries previously described. We call g k and h k ﬁlters because (3) is equivalent to a ﬁltering operationof a signal x where the ﬁlter coefﬁcients are stored in thecirculant matrix.Now we deﬁne a new transformation that operates only onthe ﬁrst p k − coordinates and keeps the other unchanged: W ( p ) k =  C ( p k − ) k p k − × ( p − p k − ) ( p − p k − ) × p k − I p − p k −  ∈ R p × p . (5)Finally, we deﬁne a wavelet-like synthesis transformationthat is a cascade of the fundamental stages (5) as W = W ( p )1 · · · W ( p ) m − W ( p ) m . (6)We call this transformation wavelet-like because Wx appliesconvolutions to parts of the signal x at different scales (forexample, see [32] for a description of discrete wavelet trans-formations from the perspective of matrix linear algebra).III. T HE PROPOSED DICTIONARY LEARNING ALGORITHMS

Dictionary learning [33] provides heuristics that approx-imate solutions to the following problem: given a dataset Y ∈ R n × N , a sparsity level s and the size of the dictionary S we want to create a dictionary D ∈ R n × S and the sparserepresentations X ∈ R S × N then solveminimize D , X k Y − DX k F subject to X is s –sparse . (7)A classic approach to this problem, which we also use,is the iterative alternating optimization algorithm: keep thedictionary ﬁxed and update the representations and vice-versain a loop until convergence.In this paper, we constrain the dictionary D to the structurespreviously discussed: circulant and convolutional (includingunions in both cases) and wavelet-like. Our goal is to proposenumerically efﬁcient dictionary update rules for these struc-tures. While for the general dictionary learning problem thereare several online algorithms that have been proposed [34],[35], [36] which are computationally efﬁcient, in this paper,we consider only batch dictionary learning and we focus onthe computational complexity of the dictionary update step. A. Circulant dictionary learning

Given a dataset Y ∈ R n × N and a sparsity level s ≥ forthe representations X ∈ R n × N , the work in [15] introducesan efﬁcient way of learning a circulant dictionary C ∈ R n × n by approximately solving the optimization problem:minimize c , X k Y − CX k F subject to k vec ( X ) k ≤ sN, C = circ ( c ) , k c k = 1 . (8)For ﬁxed X , to update c we develop the objective function to k Y − CX k F = k FY − ΣFX k F = k ˜Y − Σ ˜X k F , (9)and in order to minimize it we set σ = ˜x H ˜y k ˜x k , σ k = ˜x Hk ˜y k k ˜x k k , σ n − k +2 = σ ∗ k , k = 2 , . . . , n, (10)where ˜y Tk and ˜x Tk are the rows of ˜Y = FY and ˜X = FX . Remark 1.

Given Y ∈ R n × N and X ∈ R n × N the bestcirculant dictionary C in terms of the Frobenius norm achievesminimum c k Y − CX k F = n X k =1 (cid:18) k y k k − | ˜x Hk ˜y k | k ˜x k k (cid:19) . (11) Proof.

Expand the objective of (8) using the optimal (10) as k Y − CX k F = k Y k F + k CX k F − tr ( CXY H )= k Y k F + k F H ΣFX k F − tr ( F H ΣFXY H )= k Y k F + k Σ ˜X k F − tr ( Σ ˜X ˜Y H )= k Y k F + n X k =1 | ˜x Hk ˜y k | k ˜x k k − n X k =1 | ˜x Hk ˜y k | k ˜x k k = k Y k F − n X k =1 | ˜x Hk ˜y k | k ˜x k k = n X k =1 (cid:18) k y k k − | ˜x Hk ˜y k | k ˜x k k (cid:19) . (12)In the development of (12) we use the deﬁnition of the Frobe-nius norm, the invariance of the Frobenius norm under unitarytransformations, i.e., k FX k F = k F H X k F = k X k F , and thecyclic permutation of the trace, i.e., tr ( ABC ) = tr ( BCA ) .By the Cauchy-Schwarz inequality we have that | ˜x Hk ˜y k | ≤k ˜x k k k ˜y k k which holds with equality if and only if therows ˜x k and ˜y k are multiples of each other. In this case,the objective function (12) develops to k Y − CX k F = P nk =1 k y k k − P nk =1 k ˜y k k . This is the only case where thecirculant dictionary can reach zero representation error.A necessary condition that the optimal circulant dictionary C obeys is k σ k = P nk =1 | ˜x Hk ˜y k | k ˜x k k = 1 , i.e., the ℓ norm ofthe optimal solution of (8) is one. (cid:4) In the context of dictionary learning, to obey the unit ℓ norm constraint on c we should normalize the optimal solution σ ← k σ k − σ . This is avoided because we can always groupthis normalizing factor with the representations X instead ofthe circulant dictionary, i.e., k σ k − CX = C ( k σ |k − X ) . Thisgrouping is correct because the Fourier transform preserves ℓ norms and we have that k σ k = k c k , i.e., all the columns of C are ℓ normalized after normalizing σ .The algorithm called C–DLA, ﬁrst introduced in [15],has low computational complexity that is dominated by the O ( nN log n ) computation of ˜Y (once) and that of ˜X (at each Algorithm 1 – UCirc–DLA–SU.Input:

The dataset Y ∈ R n × N , the number of circulant atoms L and the sparsity s ≤ n . Output:

The union of L circulant dictionaries D ∈ R n × nL as in (13) and the sparse representations X ∈ R nL × N suchthat k Y − DX k F is reduced. Initialization: compute the singular value decompositionof the dataset Y = UΣV T , set c ( ℓ ) = u ℓ for ℓ ≤ n , set c ( ℓ ) to random ℓ normalized vectors of size n for L ≥ ℓ > n and compute the representations X = OMP ( D , Y , s ) . Compute Fourier transform ˜Y , set its ﬁrst row to zero. For , . . . , K : • Update dictionary: – Compute all the L Fourier transforms ˜X ( ℓ ) . – Construct each optimal { σ ( ℓ ) } Lℓ =1 from (15)separately: { σ ( ℓ )1 } Lℓ =1 = 0 and compute { σ ( ℓ ) k } Lℓ =1 , { σ ( ℓ ) n − k +2 } Lℓ =1 = { ( σ ( ℓ ) k ) ∗ } Lℓ =1 , k =2 , . . . , (cid:6) n (cid:7) + 1 , by (10) and then normalize the L circulants σ ( ℓ ) ← k σ ( ℓ ) k − σ ( ℓ ) . • Update sparse representations X = OMP ( D , Y , s ) .iteration). The calculations in (10) take approximately nN operations: there are n components in σ to be computed while k ˜x k k and ˜x H ˜y k take N operations each.We can limit which and how many of all the possible n shifts of c are allowed. We achieve this by ensuring that rowsof X corresponding to unwanted shifts are zero. Remark 2 (Approximating linear operators by circulantmatrices).

Given Y ∈ R n × n , let us consider the specialcase of (8) when N = n and we ﬁx X = I . Now wecalculate the closest, in Frobenius norm, circulant matrixto a given linear transformation Y . Because the Frobeniusnorm is elementwise the optimal solution is directly c k = n P ( i − j ) mod n =( k − y ij , k = 1 , . . . , n . Unfortunately, ingeneral, circulant matrices do not approximate all linear trans-formations with high accuracy. The result is intuitive sincematrices have n degrees of freedom while circulants haveonly n . Furthermore, if we add other constraints, such asorthogonality for example, the situation is even worse: [37]shows that the set of orthonormal circulants is ﬁnite andconstructs it. Therefore, researchers proposed approximatinga linear transformation by a product of O ( n ) circulant, orToeplitz, and diagonal matrices [9], [38]. (cid:4) B. Union of circulant dictionary learning

Let us now consider overcomplete dictionaries (matricesthat have, signiﬁcantly, more columns than rows) that areunions of circulant matrices. In particular, take a dictionarywhich is the union of L circulants: D = (cid:2) C (1) C (2) . . . C ( L ) (cid:3) ∈ R n × nL , (13)where each C ( ℓ ) = circ ( c ( ℓ ) ) = F H Σ ( ℓ ) F , Σ ( ℓ ) = diag ( σ ( ℓ ) ) , σ ( ℓ ) = Fc ( ℓ ) , is a circulant matrix. Given trainingdata Y ∈ R n × N , with this structure, the dictionary learning problem has the objective: k Y − DX k F = k Y − (cid:2) C (1) C (2) . . . C ( L ) (cid:3) X k F = k Y − (cid:2) F H Σ (1) F . . . F H Σ ( L ) F (cid:3) X k F = k Y − F H (cid:2) Σ (1) F . . . Σ ( L ) F (cid:3) X k F = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) FY − L X ℓ =1 Σ ( ℓ ) FX ( ℓ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜Y − L X ℓ =1 Σ ( ℓ ) ˜X ( ℓ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F , (14)where the tilde matrices indicate the Fourier transforms (takencolumnwise) and the representations X ∈ R nL × N are sepa-rated row-wise into L continuous non-overlapping blocks ofsize n denoted X ( ℓ ) ∈ R n × N . A way to update all circulantcomponents using the Fourier transforms ˜Y and ˜X presentsitself. Denote by ˜y Tk the k th row of ˜Y and by ( ˜x ( ℓ ) k ) T the k th row of ˜X ( ℓ ) . The objective function of the dictionary learningproblem separates into k = 1 , . . . , (cid:6) n (cid:7) + 1 (given real valuedtraining data Y ) distinct least squares problems like:minimize σ (1) k ,...,σ ( L ) k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜y Tk − L X ℓ =1 σ ( ℓ ) k ( ˜x ( ℓ ) k ) T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F . (15)Therefore, for a ﬁxed k , the diagonal entries ( k, k ) of all Σ ( ℓ ) (which are denoted σ ( ℓ ) k ) are updated simultaneously bysolving the least squares problems (15). Given real-valueddata, mirror relations σ ( ℓ ) n − k +2 = ( σ ( ℓ ) k ) ∗ , k = 2 , . . . , n, holdanalogously to (10) for all ℓ = 1 , . . . , L . Notice that thisformulation is just a natural extension of the one dimensionalleast squares problems in (10). To compute all the componentsof all σ ( L ) we solve this least squares problem n times – thecomputational complexity is O ( nL N ) .In comparison, the union of circulants dictionary learningmethod presented in [15], UC–DLA, updates each circulantblock C ( ℓ ) sequentially and separately (this can be seen as ablock coordinate descent approach). The new proposed learn-ing method, called Union of Circulant Dictionary LearningAlgorithm with Simultaneous Updates (UCirc–DLA–SU), isdescribed in Algorithm 1. Remark 3 (Updating an unused circulant component).

Assuming that X ( ℓ ) = n × N , i.e., the ℓ th circulant matrixis never used in the representations, then we use the update c ( ℓ ) = arg max z ; k z k =1 (cid:13)(cid:13)(cid:13)(cid:16) Y − P Li =1 ,i = ℓ C ( i ) X ( i ) (cid:17) z (cid:13)(cid:13)(cid:13) . This isthe block update method use in UC–DLA [15]. Furthermore,similarly to [39], this update could be used also when atomsof block ℓ have a lower contribution to the reconstruction thanatoms from other blocks, i.e., k X ( ℓ ) k F ≪ k X ( i ) k F , ∀ i = ℓ . (cid:4) C. Union of convolutional dictionary learning

Analogously to (13), we deﬁne the dictionary which is aunion of L convolutional matrices as D conv = h C (1) conv C (2) conv . . . C ( L ) conv i ∈ R p × pL , (16)and the objective function to minimize with respect to thisunion dictionary given the ﬁxed representations X conv ∈ R pL × N (separated into L continuous non-overlapping blocks denoted X ( ℓ ) conv = (cid:20) X ( ℓ ) ( n − × N (cid:21) ∈ R p × N ) is developed as k Y − D conv X conv k F = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) vec ( ˜Y ) − L X ℓ =1 vec ( Σ ( ℓ ) conv ˜X ( ℓ ) conv ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜y − L X ℓ =1 A ( ℓ ) F : , n c ( ℓ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F = k ˜y − Bc k F , (17)where B = (cid:2) A (1) F : , n . . . A ( L ) F : , n (cid:3) ∈ R pN × nL , A ( ℓ ) = h ˜x ( ℓ )1 ⊗ e . . . ˜x ( ℓ ) p ⊗ e p i ∈ C pN × p , with therows of ˜X ( ℓ ) conv , c = (cid:2) c (1) c (2) . . . c ( L ) (cid:3) T ∈ R nL and F : , n ∈ C p × n is the p × p Fourier matrix restricted to its ﬁrst n columns. The solution here is given by the least squares c = ( B H B ) \ B H ˜y , (18)where B H B ∈ R nL × nL is a positive deﬁnite block symmetricmatrix, where each n × n block is a Toeplitz matrix like T ℓ ℓ = F H : , n ( A ( ℓ ) ) H A ( ℓ ) F : , n for the ( ℓ , ℓ ) th block –the diagonal blocks are symmetric positive deﬁnite Toeplitz.Therefore, B H B is determined by nL + (2 n − L ( L − parameters – the ﬁrst term covers the parameters of the L symmetric Toeplitz diagonal blocks and the second termscovers the parameters of all L ( L − non-diagonal Toeplitzblocks. The computational burden is highest in order tocalculate the diagonals W ( ℓ ,ℓ ) = ( A ( ℓ ) ) H A ( ℓ ) with en-tries w ( ℓ ,ℓ ) k = ( ˜x ( ℓ ) k ) H ˜x ( ℓ ) k , i.e., the inner product of thecorresponding rows from ˜X ( ℓ ) conv and ˜X ( ℓ ) conv respectively. Thesecalculations take O ( pL N ) operations. The inverse Fouriertransforms of W ( ℓ ,ℓ ) to recover the entries of T ℓ ℓ takeonly O ( p log p ) operations.The inverse problem in (18) can be solved exactly in O ( n L ) via block Cholesky factorization [40, Chapter 4.2].When nL is large, an alternative approach is to use some iter-ative procedure like the Conjugate Gradient approach (alreadyused in convolutional problems [41]) where the computationalburden falls on computing matrix-vector products with thematrix B H B which take only O ( pL ) operations. The vector v = B H ˜y has the structure v = (cid:2) v (1) . . . v ( L ) (cid:3) T where v ( ℓ ) = F H : , n z ( ℓ ) ∈ R n and z ( ℓ ) ∈ C p with entries z ( ℓ ) j = ( ˜x ( ℓ ) j ) H ˜y j , i.e., the j th entry of the ℓ th component isthe inner product of the corresponding rows from ˜X ( ℓ ) conv and ˜Y respectively. The cost of computing v ∈ R nL is O ( pLN ) sincethe inverse Fourier transforms are only O ( p log p ) . Also, weneed to compute once the Fourier transform of the dataset( ˜Y ) and at each iteration all the L Fourier transforms ofthe sparse representations ˜X ( ℓ ) conv which take O ( N p log p ) and O ( LN p log p ) overall operations respectively.The new proposed learning method, called Union of Con-volutional Dictionary Learning Algorithm with SimultaneousUpdates (UConv–DLA–SU), is described in Algorithm 2.The theoretical and algorithmic importance of dictionariesthat are unions of convolutional matrices where the ﬁrstcolumns have different (and potentially overlapping) supportshas been highlighted in [13] (see in particular the clear pre-sentation of the convolutional sparse model in [13, Figure 1]): Algorithm 2 – UConv–DLA–SU.Input:

The dataset Y ∈ R p × N , the number of convolutionalatoms L , the length of c denoted n , the length of theinput signals m ≥ n (both n and m are chosen such that p = n + m − ) and the sparsity s ≤ m . Output:

The union of L convolutional dictionaries D conv ∈ R p × pL as in (16) and the sparse representations X conv ∈ R pL × N such that k Y − D conv X conv k F is reduced. Initialization: set c ( ℓ ) for ℓ = 1 , . . . , L, to random ℓ normalized vectors of size p with non-zeros only in the ﬁrst n entries and compute X = OMP ( D conv , Y , s ) . Compute Fourier transform ˜Y , set its ﬁrst row to zero. For , . . . , K : • Update dictionary: – Compute all the L Fourier transforms ˜X ( ℓ ) conv . – Efﬁciently compute (18): → Construct v = (cid:2) v (1) . . . v ( L ) (cid:3) : for ℓ =1 , . . . , L set z ( ℓ )1 = 0 and compute z ( ℓ ) k =( ˜x ( ℓ ) k ) H ˜y k , z ( ℓ ) p − k +2 = ( z ( ℓ ) k ) ∗ , k = 2 , . . . , (cid:6) p (cid:7) + 1 and compute the inverse Fourier transform of z ( ℓ ) and keep only its ﬁrst n entries: v ( ℓ ) = F H : , n z ( ℓ ) . → Explicitly construct B H B : for ℓ = 1 , . . . , L and ℓ = ℓ , . . . , L compute ﬁrst column and row of theblock T ℓ ℓ ( T ℓ ℓ = T Tℓ ℓ ) by the inverse Fouriertransform of w ( ℓ ,ℓ ) k = ( ˜x ( ℓ ) k ) H ˜x ( ℓ ) k , w ( ℓ ,ℓ ) p − k +2 =( w ( ℓ ,ℓ ) k ) ∗ , k = 1 , . . . , (cid:6) p (cid:7) + 1 . → Get c , solve (18) by the Cholesky decomposition andnormalize the L convolutions c ( ℓ ) ← k c ( ℓ ) k − c ( ℓ ) . • Update sparse representations X = OMP ( D conv , Y , s ) .group together the ﬁrst columns of each C ( ℓ ) conv into the localdictionary D (1) of size p × L , do the same with the secondcolumns into the local dictionary D (2) and so on until D ( p ) .These dictionaries are called local because they are localizedto the reconstruction of only a few (depending on the size ofthe support n ) grouped entries from a signal, as compared toa global signal model.Notice that (17) makes use of the matrix F : , n due to thesparsity pattern in c conv , i.e., only the ﬁrst n entries are non-zero. This can be generalized so that if the support of c conv isdenoted by S ( c conv ) then the least squares problem (17) canbe solved only on this support my making use of F : , S ( c conv ) ∈ C p ×|S ( c conv ) | , i.e., the Fourier matrix of size p × p restricted tothe columns indexed in S ( c conv ) . Alternatively, if we do notwant to (or cannot) decide the support a priori, we can add an ℓ penalty to (17) to promote sparse solutions.Without the sparsity constraints in c conv , UConv–DLA–SU essentially reduces to Ucirc–DLA–SU but with a majornumerical disadvantage: the decoupling that allows for (15)is no longer valid and therefore the least squares problem(18) is signiﬁcantly larger and harder to solve. This extracomputational effort seems unavoidable if we want to imposethe sparsity in c conv . This motivates us to discuss the possibilityof updating each convolutional dictionary sequentially, justlike [15] does for circulant dictionaries. Remark 4 (The special case of L = 1 – the computationalsimpliﬁcations of constructing a single convolutional dic-tionary). Following a similar development as [15, Remark 2],consider again the objective function of (8) and develop k Y − C conv X conv k F = k ˜y − AF : , n c k F , (19)where we have deﬁned A = (cid:2) ˜x ⊗ e . . . ˜x p ⊗ e p (cid:3) ∈ C pN × p , with { e k } pk =1 the standard basis for R p , i.e., A iscomposed of columns from ( ˜X T conv ⊗ I ) corresponding to thenon-zero diagonal entries of Σ conv (see also the Khatri-Raoproduct [31]).By the construction in (3) only the ﬁrst n entries of c conv are non-zero and therefore the optimal minimizer of (19) is c = ( AF : , n ) \ ˜y . (20)The large matrix AF : , n is never explicitly constructed inthe computation of c = ( F H : , n A H AF : , n ) − F H : , n A H ˜y =( F H : , n WF : , n ) − v , where we have deﬁned W = diag ( (cid:2) k ˜x k . . . k ˜x p k (cid:3) ) , v = F H : , n A H ˜y . (21)Observe that T = F H : , n WF : , n is a real-valued symmetricpositive deﬁnite Toeplitz matrix – it is the upper left (theleading principal) submatrix of the p × p circulant matrix F H WF . The matrix T is never explicitly computed, butits ﬁrst column is contained in the n entries of the vector F H diag ( W ) . Also, notice that A H ˜y is a vector whose j th entry is the inner product of the corresponding rows from ˜X and ˜Y respectively, i.e., ˜x Hj ˜y j .The computation of W and v take approximately O ( pN ) operations each – because X is real-valued, the Fourier trans-forms exhibit symmetries and only half the ℓ norms in W andof the entries in A H ˜y need to be computed. These calculationsdominate the computational complexity since they depend onthe size of the dataset N ≫ p – together with the computationsof the Fourier transforms ˜X conv (at each iteration) and ˜Y (onlyonce) which take O ( LN p log p ) and O ( N p log p ) operationsrespectively. The least squares problem (20) is solved viathe Levinson-Durbin algorithm [40, Section 4.7.3], whosecomplexity is n instead of the regular O ( n ) computationalcomplexity for unstructured inverse problems.Also, observe that this least squares solution whenapplied to minimizing (9) leads to the same optimalsolution from (10): c = ( F H A H AF ) − F H A H ˜y = F H W − FF H A H ˜y = F H W − A H ˜y = F H σ , with W = diag ( (cid:2) k ˜x k . . . k ˜x n k (cid:3) ) . The approach in (10) is pre-ferred to (20) since in the Fourier domain the overall problemis decoupled into a series of smaller size independent subprob-lems that can be efﬁciently solved in parallel.Therefore, each single convolutional dictionary can be up-dated efﬁciently and an algorithm in the style of UC–DLA[15] can be proposed whenever running time is of concernor the dataset is large. In this case, because the convolutionalcomponents would not be updated simultaneously, we wouldexpect worse results on average. (cid:4) There are several places where structures related to unionsof circulant and convolutional dictionaries appear. We brieﬂydiscuss next Gabor frames and then, in the following section,wavelet-like dictionaries.

Remark 5 (A special case of time-frequency synthesisdictionary).

Union of circulant matrices also show up whenstudying discrete time-frequency analysis/synthesis matrices[42]. Consider the Gabor synthesis matrix given by G = (cid:2) D (1) C D (2) C . . . D ( m ) C (cid:3) ∈ C m × m , (22)where C = circ ( g ) for g ∈ C m which is called the Gaborwindow function, i.e., the matrix G contains circular shiftsand modulations of g . The matrices D ( ℓ ) , ℓ = 1 , . . . , m, arediagonal with entries d ( ℓ ) kk = ω ( ℓ − k − and ω = e πi/m .In the context of compressed sensing with structured ran-dom matrices, Gabor measurement matrices have been usedfor sparse signal recovery [43]: the Gabor function of length m is chosen with random entries independently and uniformlydistributed on the torus { z ∈ C | | z | = 1 } [43, Theorem 2.3].Here, our goal is to learn the Gabor function g from a givendataset Y such that the dictionary G allows good sparserepresentations. Now the objective function develops to: k Y − GX k F = k Y − D ( I ⊗ C ) X k F = k y − ((( I ⊗ F ) X ) T ⊗ D ( I ⊗ F H )) vec ( I ⊗ Σ ) k F = (cid:13)(cid:13)(cid:13)(cid:13) y − (cid:20) m P ℓ =1 ˜x ( ℓ )1 ⊗ ˜d ( ℓ )1 . . . m P ℓ =1 ˜x ( ℓ ) m ⊗ ˜d ( ℓ ) m (cid:21) σ (cid:13)(cid:13)(cid:13)(cid:13) F = k y − A σ k F = k y − AFg k F , (23)where we have A ∈ C mN × m and we have denoted y = vec ( Y ) , D = (cid:2) D (1) . . . D ( m ) (cid:3) , ( ˜x ( ℓ ) k ) T is the k th row of ˜X ( ℓ ) and ˜d ( ℓ ) k is the k th column of D ( ℓ ) F H . Similarly to theconvolutional dictionary learning case, Gabor atoms typicallyhave compact support and we can add the sparse structure to g (the support of size n ≤ m ) and ﬁnd the minimizer by solvingthe least squares problem which this time is unstructured (thereis no Toeplitz structure). (cid:4) D. Wavelet-like dictionary learning

Starting from (6), in the spirit of dictionary learning, ourgoal is to learn a transformation from given data such that ithas sparse representations. The strategy we use is to updateeach W ( p ) k (actually, the C ( p ) k component) while keepingall the other transformations ﬁxed. Therefore, for the k th component we want to minimize k Y − WX k F = k Y − W A W ( p ) k W B X k F = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Y − (cid:2) W A , W A , (cid:3) " C ( p k − ) k

00 I ¯X ¯X (cid:21)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F = k ¯Y − W A , C ( p k − ) k ¯X k F = k ¯Y − W A , (cid:2) G k S H k S (cid:3) ¯X k F = k ¯Y − W A , F H (cid:2) Σ g k Σ h k (cid:3) ( I ⊗ FS ) ¯X k F = k ¯y − ((( I ⊗ FS ) ¯X ) T ⊗ W A , F H ) vec ( (cid:2) Σ g k Σ h k (cid:3) ) k F = (cid:13)(cid:13)(cid:13)(cid:13) ¯y − A (cid:20) Fg k Fh k (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13)(cid:13) ¯y − A ( I ⊗ F ) (cid:20) g k h k (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) F , (24)where F is the Fourier matrix of size p k − , we denoted W A = W ( p )1 · · · W ( p ) k − , W B = W ( p ) k +1 · · · W ( p ) m and ¯X = W B X , ¯Y = Y − W A , ¯X , ¯y = vec ( ¯Y ) , W A , are the ﬁrst p k − columns of W A , ¯X are the ﬁrst p k − rows of ¯X . We havealso denote Σ g k = diag ( Fg k ) and similarly for Σ g k . Thematrix A ∈ C pN × p k − is made up of a subset of the columnsfrom (( I ⊗ FS ) ¯X ) T ⊗ W A , F H corresponding to the non-zero entries from vec ( (cid:2) Σ g k Σ h k (cid:3) ) . Minimizing the quantityin (24) leads to a least squares problem where both g k and h k have a ﬁxed non-zero support of known size n . It obeys n ≤ p m − such that the circulants for C ( p ) m can be constructed.Similarly to the union of circulants cases described before,some computational beneﬁts arise when minimizing (24),i.e., computing ( I ⊗ F H ) A H A ( I ⊗ F ) \ ( I ⊗ F H ) A H ¯y .For convenience we will denote Q = ((( I ⊗ FS ) ¯X ) T and R = W A , F H , such that A = Q ⊙ R . Notice that A H A = p k − (cid:20) D (1) D (2) ∗ D (2) D (3) (cid:21) , where the blocks are diagonalwith entries d (1) ii = k q i k k r i k , d (3) ii = k q p k − + i k k r i k and d (2) ii = q H p k − + i q i k r i k where q i and r i are columnsof Q and R , respectively, i = 1 , . . . , p k − . Therefore, ( I ⊗ F H ) A H A ( I ⊗ F ) is a × block matrix whose blocksare real-valued circulant matrices (and the diagonal blocks arealso symmetric). Also, because A H A is symmetric positivedeﬁnite it allows for a Cholesky factorization LL T wherethe matrix L has only the main diagonal and the secondarylower diagonal of size p k − of non-zero values. A furthercomputational beneﬁt comes from when n ≪ p and we solve aleast squared problem in n variables, as compared to p , i.e., A ( I ⊗ F ) (cid:20) g k h k (cid:21) = A ( I ⊗ F : , n ) (cid:20) ¯g k ¯h k (cid:21) , with both ¯g k , ¯h k ∈ R n . Finally, notice that ( I ⊗ F H : , n ) A H A ( I ⊗ F : , n ) hasToeplitz blocks, like in the case of UConv–DLA–SU.The linear transformation (6) has two major advantages: i)the computational complexity of matrix-vector multiplications Wx with a ﬁxed x ∈ R p is controlled by the number ofstages m and the length of the ﬁlters n , instead of the ﬁxedmatrix-vector multiplication complexity which is O ( p ) andii) it allows for learning atoms that capture features from thedata at different scales, i.e., atoms of different sparsity levels.The new proposed procedure, called Wavelet-like DictionaryLearning Algorithm (W–DLA), is described in Algorithm 3. Remark 6 (Extending and constraining the wavelet-likestructure).

Unlike wavelets that use the same ﬁlters g and h (known as the low and high pass ﬁlters, respectively) at eachstage k of the transformation, we learn different ﬁlters g k and h k . Also, the support of the ﬁlters (and their size) can bedecided dynamically at every stage and the downsampling canbe replaced with a general column selection matrix S . Notethat if we allow full support then the least squares problemcan be solved in the Fourier domain with the complex-valuedunknowns ˜g k = Fg k and ˜h k = Fh k . Furthermore, each stagein (5) applies only to the decomposition of the left-most (so-called low-frequency) components from the previous stage. Inthe spirit of optimal sub-band tree structuring (also known aswavelet packet decompositions) [44], we can propose a trans-formation W = C ( p )1 · · · C ( p ) m − C ( p ) m factored into m stagesall of the form (4) (where again each stage has its own ﬁlters g k and h k which now can have support n ≤ p ) and use thesame optimization procedure. Lastly, we could also propose Algorithm 3 – W–DLA.Input:

The dataset Y ∈ R p × N such that m divides p exactly,the number of stages m ≤ log p , the size of the support ofthe ﬁlters denoted n ≤ p m − and the sparsity s ≤ p . Output:

The wavelet-like dictionary W ∈ R p × p as in (6),the diagonal D ∈ R p × p such that WD has unit ℓ normcolumns and the sparse representations X ∈ R p × N such that k Y − WDX k F is reduced. Initialization: set all stages C ( n ) k = I , k = 1 , . . . , m ;compute the singular value decomposition of the dataset Y = UΣV T and compute the sparse representations X = T s ( U T Y ) , i.e., project and keep the s largest entries inmagnitude for each element (column) in the dataset Y . For , . . . , K : • Update dictionary: with all other components ﬁxed,update only the k th non-trivial component of W de-noted C ( p k − ) k (by computing both g k and h k on thesupport of size n ) for each k = 1 , . . . , m, at a time byminimizing the least squares problem (24). • Update D such that WD has unit ℓ norm columns andupdate sparse representations X = OMP ( WD , Y , s ) .a structured transformation similar to (4) but based on morethan two ﬁlters, e.g., C ( p ) k = (cid:2) G k S H k S J k S (cid:3) ∈ R p × p ,for a new selection (downsampling by 3) matrix S ∈ R p × p .Heuristics to choose m , n (maybe at each stage m , i.e., having n k ), the location of the n non-zero entries in each ﬁlter,the structure and sizes of C k and S may be proposed tofurther improve the accuracy of the algorithm (or the trade-off between numerical efﬁciency and accuracy in terms of therepresentation error).Orthonormal wavelets, i.e., in our case meaning that W andall C ( p ) k are orthonormal, are also extensively used in manyapplication. With these orthogonality constraints the objectivefunction develops into a simpler form than (24) as k Y − W A W ( p ) k W B X k F = k W T A Y − W ( p ) k W B X k F = k W T A Y − ¯X − W ( p k − ) ¯X k F = k W TA Y − ¯X − F H (cid:2) Σ g k Σ h k (cid:3) ( I ⊗ F ) ¯X k F = k F ( W T A Y − ¯X ) − (cid:2) Σ g k Σ h k (cid:3) ( I ⊗ FS ) ¯X k F = k ¯y − ((( I ⊗ FS ) ¯X ) T ⊗ I n ) vec ( (cid:2) Σ g k Σ h k (cid:3) ) k F , (25)where we have now denoted ¯y = vec ( F ( W T A Y − ¯X )) and of course the Kronecker products are never explicitlybuilt. We minimize this quadratic objective with the ad-ditional orthogonality constraint g Tk h k = 0 (in order tokeep C ( p ) k orthogonal). Minimizing quadratic functions underquadratic equality constraints has been studied in the pastand numerical procedures are available [45]. This constraintensures orthogonality between the columns of GS and HS .To ensure orthogonality among the columns of GS (and HS ,respectively) we can explicitly add symmetry constraints tothe ﬁlter coefﬁcients, e.g., if g k has a non-zero support of sizefour we have g k = g k and g k = − g k . Alternatively, bothorthogonality constraints can be added leading to a quadratic

10 20 30 No noise102030405060708090

SNR (dB) D e t e c t i on r a t e ( % ) UCirc−DLA−SUUC−DLAM−DLASI−ILS−DLASI−K−SVDK−SVD

Fig. 1. Average kernel recovery rate as a function of noise level for differentshift invariant dictionary learning methods. We compare against SI–K–SVD[20], SI–ILS–DLA [22], M–DLA [25], UC–DLA [15]. The K–SVD approachserves as a baseline performance indicator since it is not explicitly developedto recover shift-invariant structures.

Shifts of kernel F r equen cy o f u t ili z a t i on UC−DLAUCirc−DLA−SU

Fig. 2. Average frequency of utilization for n = 20 atoms of the L = 45 circulants C ℓ . With perfect recovery the q = 3 peaks should be NsLq ≈ . optimization problem with two quadratic constraints [46]. (cid:4) IV. E

XPERIMENTAL RESULTS

We now discuss numerical results that show how well theproposed methods extract shift-invariant structures from data.Since we are dealing with the dictionary learning problem ourgoal is to build good representations of the datasets we con-sider. In our proposed algorithms, in the sparse recovery phase,we use the orthogonal matching pursuit (OMP) algorithm [47],but any other sparse approximation method could be chosen.Also, because the sparse approximation steps are numericallyexpensive (at least quadratic complexity in general and appliedfor all N ≫ n data points) and repetitive operations (done ateach iterative step of the alternating optimization algorithms),OMP is chosen from practical considerations, as numericallyefﬁcient implementations are available (for example [48]).Notice from the description of all the proposed algorithmsthat each dictionary update step necessarily decrease theobjective function but, unfortunately, the overall algorithmsmay not be monotonically convergent to a local minimumsince OMP is not guaranteed in general to always reduce theobjective function. As such, in this section, we also providesome experimental results where we empirically observe theconverge of the proposed methods.For datasets with a strong DC component the learnedcirculant dictionary might be C ≈ √ n n × n . Therefore,preprocessing the dataset Y by removing the mean componentis necessary and we have σ = 0 in (10) since ˜y = N × .This operation is assumed performed before the application ofany algorithm developed in this paper. Number of bases L

Lea r n i ng t i m e ( s e c ond s ) UC−DLA, s=4UCirc−DLA−SU, s=4UC−DLA, s=8UCirc−DLA−SU, s=8UC−DLA, s=12UCirc−DLA−SU, s=12

Fig. 3. Average learning times for UC–DLA [15] and the proposed UCirc–DLA–SU over 100 random realizations. For the same parameters, the runningtime of UConv–DLA–SU is several times higher and therefore not shown inthis plot – this highlights the importance of solving the learning problem inthe Fourier domain, when possible.

A. Synthetic experiments

We create a synthetic dataset generated from a ﬁxed numberof atoms and their shifts and the proceed to measure how wellwe recover them from the dictionary learning perspective. Theexperimental setup follows: generate N = 2000 signals oflength n = 20 , that are linear combinations (with sparsity s = 4 ) of L = 45 randomly generated kernel columns whichare allowed to have only q = 3 circular shifts (out of thepossible n = 20 ), i.e., Y ∈ R n × N where each columns is y i = P Lℓ =1 α iℓ P q iℓ c ℓ + n i for i = 1 , . . . , N with ﬁxed k c ℓ k = 1 , k α i k = s where α iℓ ∈ [ − , and q iℓ ∈ { , . . . , q − } arerandomly uniformly distributed and n i is a random Gaussianvector representing noise.First, using the synthetic dataset, we show in Figure 1how the UCirc–DLA–SU outperforms previously proposedmethods in the task of recovering the atoms used in creatingthe dataset. This shows the beneﬁt (as compared to UC–DLA)of updating all the circulant components simultaneously witheach step of the algorithm. We observe that UCirc–DLA–SU achieves lower error approximately of the time. Thetypical counter-example is one where UC–DLA convergesslower (in more iterations) to a slightly lower representationerror, i.e., sub-optimal block calculations ultimately lead toa better ﬁnal result. This observation is not surprising sinceboth heuristic methods only approximately solve the overalloriginal dictionary learning problem (with unknowns both C and X ). To show this, with the same synthetic dataset for noiselevel SNR = 30 dB in Figure 2 we calculate how many timeson average each atom in all circulant components (from allthe L = 45 ) is used in the sparse representations. On average,UCirc–DLA–SU recovers the correct supports (in effect, theindices of the shifts used) more often than UC–DLA.Figure 3 shows the learning times for the union of circulantsalgorithms, with blocks and with simultaneous updates, for aﬁxed number of K = 100 iterations. For this test, we createda synthetic dataset Y of size n = 64 with N = 8192 datapoints and we vary the sparsity level s and the number of basesin the union L . For s ∈ { , } , in our experiments, UCirc–DLA–SU is always faster than UC–DLA [15] and the speedup Time (samples) C en t e r ed E C G s i gna l ( m V ) OriginalReconstructed

Fig. 4. Original ECG sample and reconstruction by UConv–DLA–SU withsupport n = 12 , sparsity s = 4 and L = 2 circulants. With these parameters,the approximation error (26) for the whole training dataset is . , showingthat such data can be indeed well represented in a simple (in terms of small L ) shift-invariant dictionary. is on average while for s = 4 the average speedup isonly and for large L there are cases where UCirc–DLA–SU is slower than UC–DLA. In principle, UCirc–DLA–SUshould always be faster but in practice, the algorithm involvesmemory manipulations in order to build all the matrices forall the subproblems (15). This is the reason why the runningtime difference is not larger. Furthermore, for small s theblocks used by UC–DLA are calculated rapidly (calculationsare similar to the matrix operations in Remark 5) because X is very sparse and the overall running time is, therefore, lower. B. Experiments on ECG data

Electrocardiography (ECG) signals [49] have many repeti-tive sub-structures that could be recovered by shift-invariantdictionary learning. Therefore, in this section, we use theproposed UConv–DLA–SU to ﬁnd in ECG signals short(compact support) features that are repeated. We use the MIT-BIH arrhythmia database from which we extract a normalsinus rhythm signal composed of equality length samples fromﬁve different patients, all sampled at 128 Hz. This signal isreshaped into a matrix of centered, non-overlapping sections oflength p = 64 , leading to the training dataset Y ∈ R × .Because we are searching for sub-signals with limitedsupport, we use the UConv–DLA–SU with parameters n = 12 , s = 2 and L = 2 to recover the shift-invariant structure. InFigure 4 we show the original ECG signal and its reconstruc-tion in the union of convolutional dictionaries. Of course, thereconstruction is not perfect but it is able to accurately capturethe spikes in the data and remove some of the high-frequencyfeatures, i.e., the signal looks ﬁltered (denoised). The secondplot, Figure 5, shows the L = 2 learned atoms from the datawhich capture the spiky nature of the training signal. C. Experiments on image data

The training data Y that we consider are taken from populartest images from the image processing literature (pirate, pep-pers, boat, etc.). The test dataset Y ∈ R p × N consists of ×

10 20 30 40 50 6000.10.20.30.40.5

Time (samples) C en t e r ed E C G s i gna l ( m V ) Fig. 5. The L = 2 kernel atoms of length p = 64 learned by UConv–DLA–SU with support n = 12 and sparsity s = 4 . Iteration R ep r e s en t a t i on e rr o r ( ε % ) W−DLAW−DLA, Haar initialization

Fig. 6. A typical convergence of the proposed W–DLA algorithm with thenumber of iterations. Since n = 2 , m = log p we also have the opportunityto initialize the ﬁlters with the Haar values. non-overlapping image patches with their means removed. Weconsider N = 12288 and we have p = 64 . To evaluate thelearning algorithms, in this section we consider the relativerepresentation error of the dataset Y in the dictionary D giventhe sparse representations X as ǫ = k Y − DX k F k Y k − F (%) . (26)We consider image data because there are well-knownwavelet transforms that efﬁciently encode such data. Wewill use the ﬁlters of the Haar and Daubechies D4 wavelettransforms, i.e., with n = 2 , m = log p , ¯h k = h √ − √ i , ¯g k = h √ √ i and n = 4 , m = − p , ¯h k = h −√ √ − −√ √ √ √ − √ √ i , ¯g k = h √ √ √ √ −√ √ −√ √ i , respectively, for all k . Theﬁlters are chosen such that the resulting W is orthonormal.First, we show in Figure 6 the experimental convergence ofthe proposed W–DLA. If we also impose an orthogonalityconstraint on W then we can avoid OMP for the sparserepresentations and just a projection operation guaranteesoptimal sparse representations. The ﬁgure shows that, whenavailable, it is convenient to initialize the W–DLA with well-known wavelet ﬁlters since the algorithm converges faster andto slightly lower representation errors. Still, the differences arenot signiﬁcant and wavelets do not exist for every n, m .Then, in Figure 7 we show how the representation errorvaries with the sparsity level s . We also run W–DLA initializedwith the well-known wavelet ﬁlter coefﬁcients Haar and D4,respectively. W–DLA is always able to improve the repre-sentation performance, even when starting with the waveletcoefﬁcients. In the Haar case the improvement is small due HaarW−DLA, Haar initializationW−DLA

Sparsity level (s) D4W−DLA, D4 initializationW−DLA R ep r e s en t a t i on e rr o r ( ε % ) Fig. 7. Representation errors as a function of the sparsity level for waveletand W–DLA transformations. The top ﬁgure has parameters n = 2 , m =log p and therefore allows a Haar initialization while the bottom ﬁgure hasparameters n = 4 , m = − p and allows a D4 initialization. Size of filter (n) R ep r e s en t a t i on e rr o r ( ε % ) W−DLA, maximum mW−DLA, m = 1Circ−DLA

Fig. 8. Representation error for W–DLA as a function of the size of the ﬁltersupport n . For reference we show C–DLA [15] while W–DLA runs twice:once with ﬁxed m = 1 and with largest m such that n ≤ p m − is obeyed. to the small number of free ﬁlter parameters to learn, i.e.,only 24: m = log p = 6 stages each with 2 ﬁlters and eachwith 2 coefﬁcients. In the D4 case, the representation error isimproved signiﬁcantly. Both ﬁgures show that, when available,wavelet coefﬁcients provide an excellent initialization evenslightly better than the proposed W–DLA (also conﬁrmed inFigure 6). Since these wavelets are not available for all choices n, m the purpose of this plot is to show that the proposedinitialization provides very good results in general.Finally, in Figure 8 we show the effect that parameters n and m have on the representation error. For reference, weshow the representation error of C–DLA which has p = 64 free parameters to learn. The performance of this dictionaryis approximately matched by W–DLA with n = 4 and m = − p which has free parameters to learn: m = 5 stages each with 2 ﬁlters of support 4 each. Noticethat the representation error plateaus after n = 8 . In general,dictionaries built with W–DLA have nm degrees of freedom.We also show a version of W–DLA where we keep m = 1 and vary only m in which case, of course, the representationerror decreases. Note that in this case each run of W–DLAis initialized with a random set of coefﬁcients. To showmonotonic convergence it would help to initialize the ﬁltersof size n with those previously computed of support n − . V. C

ONCLUSIONS

In this paper, we propose several algorithms that learn,under different constraints, shift-invariant structures from data.Our work is based on using circulant matrices and ﬁndingnumerically efﬁcient closed-form solutions to the dictionaryupdate steps, by least-squares. We analyze the behavior of thealgorithms on various data sources, we compare and show weoutperform previously proposed algorithms from the literature.R

EFERENCES[1] R. M. Gray, “Toeplitz and circulant matrices: a review,”

Found. TrendsCommun. Inf. Theory , vol. 2, no. 3, pp. 155–239, 2006.[2] A. Hero, H. Messer, J. Goldberg, D. Thomson, M. Amin, G. Giannakis,A. Swami, J. Tugnait, A. Nehorai, A. Swindlehurst, J.-F. Cardoso,L. Tong, and J. Krolik, “Highlights of statistical signal and arrayprocessing,”

IEEE Signal Processing Magazine , vol. 15, no. 5, pp. 21–64, 1998.[3] H. Hassanieh, F. Adib, D. Katabi, and P. Indyk, “Faster GPS viathe sparse Fourier transform,” in

Proc. ACM 8th Annual InternationalConference on Mobile Computing and Networking (Mobicom) , 2012,pp. 353–364.[4] G. C. Carter, “Coherence and time delay estimation,”

Proceedings of theIEEE , vol. 75, no. 2, pp. 236–255, 1987.[5] H. Ohlsson, Y. C. Eldar, A. Y. Yang, and S. S. Sastry, “Compressiveshift retrieval,”

IEEE Trans. Sig. Proc. , vol. 62, no. 16, pp. 4105–4113,2014.[6] B. Lucas and T. Kanade, “An iterative image registration technique withan application in stereo vision,” in

Int. J. Conf. Artiﬁcial Intelligence ,1981, pp. 674–679.[7] S. Jain and J. Haupt, “Convolutional approximations to linear dimen-sionality reduction operators,” in

Proc. IEEE Int. Conf. Acoust. SpeechSignal Process. (ICASSP) , 2017, pp. 5885–5889.[8] O. Chabiron, F. Malgouyres, H. Wendt, and J.-Y. Tourneret,“Optimization of a fast transform structured as a convolutionaltree,” working paper or preprint , 2016. [Online]. Available:https://hal.archives-ouvertes.fr/hal-01258514[9] K. Ye and L.-H. Lim, “Every matrix is a product of Toeplitz matrices,”

Found. of Comp. Math. , vol. 16, no. 3, pp. 577–598, 2016.[10] H. Bristow, A. Eriksson, and S. Lucey, “Fast convolutional sparsecoding,” in

Proc. IEEE Conf. Comp. Vis. Pat. Recog. (CVPR) , 2013,pp. 391–398.[11] B. Kong and C. C. Fowlkes, “Fast convolutional sparse coding (FCSC),”

Tech. Rep., University of California, Irvine , 2014.[12] C. Garcia-Cardona and B. Wohlberg, “Convolutional dictionary learning:A comparative review and new algorithms,”

IEEE Trans. on Comp.Imag. , vol. 4, no. 3, pp. 366–381, 2018.[13] V. Papyan, J. Sulam, and M. Elad, “Working locally thinking globally:Theoretical guarantees for convolutional sparse coding,”

IEEE Trans.Sig. Process. , vol. 65, no. 21, pp. 5687–5701, 2017.[14] G. Pope, C. Aubel, and C. Studer, “Learning phase-invariant dictionar-ies,” in

Proc. IEEE Int. Conf. Acoust. Speech Signal Process. , 2013, pp.5979–5983.[15] C. Rusu, B. Dumitrescu, and S. A. Tsaftaris, “Explicit shift-invariantdictionary learning,”

IEEE Sig. Proc. Let. , vol. 21, no. 1, pp. 6–9, 2014.[16] C. Rusu, R. Morisi, D. Boschetto, R. Dharmakumar, and S. A. Tsaftaris,“Synthetic generation of myocardial blood-oxygen-level-dependent MRItime series via structural sparse decomposition modeling,”

IEEE Trans.Med. Imaging , vol. 33, no. 7, pp. 1422–1433, 2014.[17] B. Mailhe, S. Lesage, R. Gribonval, F. Bimbot, and P. Vandergheynst,“Shift-invariant dictionary learning for sparse representations: extendingK-SVD,” in

European Signal Processing Conference , 2008.[18] T. Blumensath and M. Davies, “Sparse and shift-invariant representationsof music,”

IEEE Trans. Audio Speech Lang. Process. , vol. 14, no. 1, pp.50–57, 2006.[19] B. Wohlberg, “Efﬁcient algorithms for convolutional sparse representa-tions,”

IEEE Trans. Image Process. , vol. 25, no. 1, pp. 301–315, 2016.[20] M. Aharon, “Overcomplete dictionaries for sparse representation ofsignals,”

Ph.D. thesis, Technion - Israel Institute of Technology , 2006.[21] J. J. Thiagarajan, K. N. Ramamurthy, and A. Spanias, “Shift-invariantsparse representation of images using learned dictionaries,” in

IEEEMachine Learning for Signal Processing , 2008.[22] K. Skretting, J. Husoy, and S. Aase, “General design algorithm for sparseframe expansions,”

Signal Process. , vol. 86, pp. 117–126, 2006. [23] P. Jost, P. Vandergheynst, S. Lesage, and R. Gribonval, “MoTIF: Anefﬁcient algorithm for learning translation invariant dictionaries,” in

Proc. IEEE Int. Conf. Acoust. Speech Signal Process. , 2006.[24] G. Zheng, Y. Yang, and J. Carbonell, “Efﬁcient shift-invariant dictionarylearning,” in

Proceedings of the 22Nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , ser. KDD ’16.New York, NY, USA: ACM, 2016, pp. 2095–2104.[25] Q. Barthelemy, A. Larue, A. Mayoue, D. Mercier, and J. I. Mars, “Shift& 2D rotation invariant sparse coding for multivariate signals,”

IEEETrans. Sig. Process. , vol. 60, no. 4, pp. 1597–1611, 2012.[26] F. Barzideh, K. Skretting, and K. Engan, “Imposing shift-invarianceusing ﬂexible structure dictionary learning (FSDL),”

Digital SignalProcessing , vol. 69, pp. 162–173, 2017.[27] R. Grosse, R. Raina, H. Kwong, and A. Y. Ng, “Shift-invariant sparsecoding for audio classiﬁcation,” in

Uncertainty in Artiﬁcial Intelligence ,2007, pp. 149–158.[28] F. Heide, W. Heidrich, and G. Wetzstein, “Fast and ﬂexible convolutionalsparse coding,” in

IEEE Computer Vision and Pattern Recognition , 2015,pp. 5135–5143.[29] H. Bristow and S. Lucey, “Optimization methods for convolutionalsparse coding,”

Technical Report , 2014.[30] P. Vardan, Y. Romano, and M. Elad, “Convolutional neural networksanalyzed via convolutional sparse coding,” arXiv:1607.08194 , 2016.[31] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,”

SIAM Review , vol. 51, no. 3, pp. 455–500, 2009.[32] G. Strang and T. Nguyen,

Wavelets and ﬁlter banks . Wellesley, MA :Wellesley-Cambridge Press, 1997.[33] B. Dumitrescu and P. Irofti,

Dictionary Learning Algorithms and Appli-cations . Springer, 2018.[34] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learningfor sparse coding,” in

Proceedings of the 26th Annual InternationalConference on Machine Learning . ACM, 2009, pp. 689–696.[35] Y. Naderahmadian, S. Beheshti, and M. A. Tinati, “Correlation basedonline dictionary learning algorithm,”

IEEE Transactions on SignalProcessing , vol. 64, no. 3, pp. 592–602, 2016.[36] F. Giovanneschi, K. V. Mishra, M. A. Gonz´alez-Huici, Y. C. Eldar,and J. H. G. Ender, “Dictionary learning for adaptive GPR targetclassiﬁcation,”

CoRR , vol. abs/1806.04599, 2018.[37] A. Bottcher, “Orthogonal symmetric Toeplitz matrices,”

Complex Anal-ysis and Operator Theory , vol. 2, no. 2, pp. 285–298, 2008.[38] M. Huhtanen and A. Per¨am¨aki, “Factoring matrices into the product ofcirculant and diagonal matrices,”

J. of Fourier Anal. and App. , vol. 21,no. 5, pp. 1018–1033, 2015.[39] C. Rusu and B. Dumitrescu, “Stagewise K-SVD to design efﬁcientdictionaries for sparse representations,”

IEEE Signal Processing Letters ,vol. 19, no. 10, pp. 631–634, 2012.[40] G. H. Golub and C. F. Van Loan,

Matrix Computations . Johns HopkinsUniversity Press, 1996.[41] B. Wohlberg, “Efﬁcient algorithms for convolutional sparse representa-tions,”

IEEE Trans. Image Process. , vol. 25, no. 1, pp. 301–315, 2016.[42] H. G. Feichtinger and T. Strohmer,

Advances in Gabor Analysis .Birkhauser, 2003.[43] G. Pfander and H. Rauhut, “Sparsity in time-frequency representations,”

J. Fourier Anal. Appl. , vol. 16, pp. 233–260, 2010.[44] R. R. Coifman and M. V. Wickerhauser, “Entropy-based algorithms forbest basis selection,”

IEEE Trans. Inf. Theory , vol. 38, no. 2, pp. 713–718, 1992.[45] H. Hmam, “Quadratic optimisation with one quadratic equality con-straint,”

Electronic Warfare and Radar Division DSTO Defence Scienceand Technology Organisation, Australia, Report DSTO-TR2416 , 2010.[46] S. M. Guu and Y. C. Liou, “On a quadratic optimization problem withequality constraints,”

Journal of Optimization Theory and Applications ,vol. 98, no. 3, pp. 733–741, 1998.[47] Y. Pati, R. Rezaiifar, and P. Krishnaprasad, “Orthogonal matchingpursuit: recursive function approximation with application to waveletdecomposition,” in

Asilomar Conf. on Signals, Systems and Comput. ,1993, pp. 40–44.[48] R. Rubinstein, M. Zibulevsky, and M. Elad, “Efﬁcient implementationof the K-SVD algorithm using batch orthogonal matching pursuit,”

CSTechnion , 2008.[49] G. B. Moody and R. G. Mark, “The impact of the MIT-BIH arrhythmiadatabase,”