On-the-fly Approximation of Multivariate Total Variation Minimization
Jordan Frecon, Nelly Pustelnik, Patrice Abry, Laurent Condat
aa r X i v : . [ c s . L G ] A ug On-the-fly Approximation ofMultivariate Total Variation Minimization
Jordan Frecon ∗ , Nelly Pustelnik ∗ , Patrice Abry ∗ , and Laurent Condat † Abstract
In the context of change-point detection, addressed by Total Variation minimiza-tion strategies, an efficient on-the-fly algorithm has been designed leading to exactsolutions for univariate data. In this contribution, an extension of such an on-the-flystrategy to multivariate data is investigated. The proposed algorithm relies on the localvalidation of the Karush-Kuhn-Tucker conditions on the dual problem. Showing thatthe non-local nature of the multivariate setting precludes to obtain an exact on-the-flysolution, we devise an on-the-fly algorithm delivering an approximate solution, whosequality is controlled by a practitioner-tunable parameter, acting as a trade-off betweenquality and computational cost. Performance assessment shows that high quality so-lutions are obtained on-the-fly while benefiting of computational costs several ordersof magnitude lower than standard iterative procedures. The proposed algorithm thusprovides practitioners with an efficient multivariate change-point detection on-the-flyprocedure. ∗ J. Frecon (Corresponding author), N. Pustelnik and P. Abry are with the Physics Department of theENS Lyon in France (email: fi[email protected]). † L. Condat is with the GIPSA-lab, University of Grenoble, France (email: [email protected]). Introduction
Total Variation (TV) has been involved in a variety of signal processing problems, such asnonparametric function estimation [13, 26] or signal denoising [12, 19, 32]. The first contri-butions on this subject were formulated within the framework of taut string theory [13, 26]while the term TV had first been introduced in image restoration [9, 30]. The equivalencebetween both formalisms has been clarified in [17].Formally, the univariate TV framework aims at finding a piece-wise constant estimate b x ∈ R N of a noisy univariate signal y ∈ R N by solving the following non-smooth convexoptimization problem, b x = arg min x =( x k ) ≤ k ≤ N k x − y k + λ N − X k =1 | x k +1 − x k | , (1)where λ > Related works: recent developments and issues.
It is well known and docu-mented that the unique solution of the optimization problem (1) can be reached by iter-ative fixed-point algorithms. On the one hand, solving this problem in the primal spacerequires to deal with the non-differentiability of the ℓ -norm that is either handled byadding a small additional smoothing parameter [33] or by considering proximal algorithms[ ? , 1–3, 5, 10–12, 27, 29, 31]. On the other hand, one can make use of the Fenchel-Rockafellardual formulation [8, 31] or Lagrangian duality [23, 34] that can be solved with quadraticprogramming techniques [4, 31]. Both primal and dual solutions suffer from high computa-tional loads, stemming from their iterative nature. To address the computational load issue,alternative procedures were investigated, such as the taut string algorithm of common usein the statistics literature [26]. Very recently, elaborating on the dual formulation and thor-oughly analysing the related Karush-Kuhn-Tucker (KKT) conditions, a fast algorithm hasbeen proposed by one of the authors in [12] to solve the univariate optimization problem (1).Compared to the taut string strategy, it permits to avoid running sum potentially leadingto overflow values and thus numerical errors. Another specificity concerns its on-the-fly behavior that does not require the observation of the whole time sequence before a solutioncan be obtained. On-the-fly algorithms might be of critical interest for real-time monitoringsuch as in medical applications [7, 15].Along another line, extension of the univariate optimization problem (1) to multivariatepurposes has been recently investigated in [6, 20, 32]. The multivariate extension arises verynaturally in numerous contexts, such as biomedical applications, for which the purposeis to extract simultaneous change points from multivariate data, e.g., EEG data [16]. Italso encompasses denoising of complex-valued data, which can naturally be interpreted asbivariate data. Multivariate optimization is known as the group fused Lasso in the statisticsliterature [24, 35]. From a Bayesian point of view, elegant solutions have been proposedin [14, 28] and efficient iterative strategies have recently been proposed in [6, 21].2 utivariate on-the-fly TV. In this context, the present contribution elaborates on [12]to propose an on-the-fly algorithm solving the multivariate extension of (1). In Section 2, thegroup fused Lasso problem is first detailed. It is then illustrated that the multivariate pro-cedure has a non-local behavior as opposed to the local nature of the univariate problem (1).Consequently, any on-the-fly algorithm solving the multivariate minimization problem willonly lead to an approximate solution. The KKT conditions resulting from the dual for-mulation of the multivariate problem are specified in Section 3. From such conditions, afast and on-the-fly, yet approximate, algorithm is derived in Section 4. The performancein terms of achieved solution and computational gain are presented in Section 5. A videodemonstrating the on-the-fly behavior of the algorithm as well as a
Matlab toolbox areavailable at http://perso.ens-lyon.fr/jordan.frecon . Notations.
Let u = ( u m,k ) ≤ m ≤ M, ≤ k ≤ N ∈ R M × N denote a multivariate signal, wherefor every m ∈ { , . . . , M } , u m = ( u m,k ) ≤ k ≤ N ∈ R N stands for the m -th component whilethe k -th values will be shortened as u k = ( u m,k ) ≤ m ≤ M ∈ R M . For every k ∈ { , . . . , N } ,use will also be made of the following functions: abs( u k ) = ( | u m,k | ) ≤ m ≤ M ∈ R M andsgn( u k ) = (cid:0) sgn( u m,k ) (cid:1) ≤ m ≤ M ∈ R M . We denote y the multivariate signal of interest. A multivariate extension of (1) reads: b x = arg min x =( x ,..., x M ) M X m =1 k x m − y m k + λ N − X k =1 vuut M X m =1 | ( L x m ) k | , (2) where λ > L ∈ R ( N − × N denotes the first orderdifference operator, that is, for m ∈ { , . . . , M } and k ∈ { , . . . , N − } ,( L x m ) k = x m,k +1 − x m,k . (3)Despite formal similarity, there is however a fundamental difference in nature between theunivariate ( M = 1) and multivariate ( M >
1) cases: The former is intrinsically local [12,25]while the latter is non-local . To make explicit such a notion, we have designed the followingexperiment, whose results are illustrated in Fig. 1. The results associated to the univariate(resp. bivariate) case are presented on the right plots (resp. left plots). A univariate signal y ∈ R N with N = 180, consisting of the additive sum of a piece-wise constant signal andwhite Gaussian noise (in gray, in Fig. 1, right top plot), is considered first. The solutionof the minimization problem (1) is displayed in solid red lines in Fig.1. Also, we search forthe solution of the minimization problem (1) applied to two partitions of y , obtained bysplitting it in half, i.e., y − = ( y k ) ≤ k ≤ N/ and y + = ( y k ) N/ ≤ k ≤ N . The solutions x − and x + of (1) respectively associated to y − and y + are concatenated and displayed with dashedblue lines in Fig. 1. There is strictly no difference between x and the concatenation of x − and x + , as reported in Fig.1 (bottom right plot), except for the segment that contains theconcatenation point. The difference around the concatenation point is expected as x makesuse of an information (the continuity between y − and y + ) that is not available to compute In this article, we denote a problem as local if the solution at a given location does not depend on thesignal located earlier (later) than the previous (next) change-point. −4 −2 E rr o r
020 60 120 18010 −4 −2 E rr o r Figure 1:
Non-local vs. local nature.
Left: bivariate TV (upper plots: first component, lowerplots: second component). Right: univariate TV. Observations y (gray), solution b x (red),concatenation of solutions b x − and b x + (dashed blue). x − and x + . The fact that there is no difference elsewhere shows the local nature of theunivariate solution to Problem (1).This experiment is now repeated for M = 2 (as the simplest representative of M > y = ( y , y ) ∈ R × N with N = 180, consisting of the additive sum of piece-wise constant signals and white Gaussian noises (in gray, in Fig. 1, left plots, 1st and 3rdlines), is considered. Two partitions, obtained by splitting in half, y − = ( y ,k , y ,k ) ≤ k ≤ N/ and y + = ( y ,k , y ,k ) N/ ≤ k ≤ N are also considered. The corresponding solutions of (2),applied to y , y − , y + , labeled b x , b x − and b x + are obtained by means of the primal-dualalgorithm proposed in [ ? ] with λ = 20. Solutions b x − and b x + of (2) respectively associatedto y − and y + are concatenated and displayed with dashed blue lines in Fig. 1, while b x isshown in red. Contrary to the case M = 1, differences between b x and concatenated b x − and b x + , shown in black in bottom plots, differ unambiguously from 0 over the entire support of y , clearly showing the non-local nature of b x when M > local nature of the solution permits to design anefficient taut string algorithm, that consists in finding the string of minimal length (i.e., tautstring) that holds in the tube of radius λ around the antiderivative of y . The solution b x of(1) is then obtained by computing the derivative of the taut string. An efficient strategy hasbeen proposed in [13] in order to straightforwardly compute b x by determining the points ofcontact between the taut string and the tube. Even though this approach can be generalizedto multivariate signals, the detection of points of contact additionally requires the angle ofcontact between the taut string and the tube. However, this information is non-local andthus the on-the-fly minimization problem results in a challenging contact problem whichcan not be solved locally. This interpretation will be further discussed in Section 3.The non-local nature of the multivariate ( M >
1) Problem (2) implies that one cannotexpect to find an exact multivariate on-the-fly algorithm. Therefore, in the present work,we will derive an approximate on-the-fly algorithm that provides us a good-quality approxi-mation of the exact solution to Problem (2). A control parameter |Q| , defined in Section 3,will control the trade-off between the quality of the approximation and the computationalcost. 4
Multivariate Total Variation Minimization
Fenchel-Rockafellar dual formulation of (2) reads :minimize u ∈ R M × ( N − M X m =1 k y m + L ∗ u m k subject to ( ∀ k = { , . . . , N − } ) k u k k ≤ λ, (4)where, for every m ∈ { , . . . , M } and k = { , . . . , N − } ,( L ∗ b u m ) k = b u m,k − − b u m,k (5)and ( ( L ∗ b u m ) = − b u m, , ( L ∗ b u m ) N = b u m,N − . (6)The optimal solutions b u ∈ R M × ( N − and b x ∈ R M × N of the dual problem and of the primalproblem respectively are related by ( ( ∀ m ∈ { , . . . , M } ) b x m = y m + L ∗ b u m , ( ∀ k ∈ { , . . . , N − } ) b u k ∈ − λ∂ k · k ( b x k +1 − b x k ) . (7)From (7), we directly obtain the following necessary and sufficient conditions. Proposition 3.1
The solutions of the primal problem (2) and the dual problem (4) satisfy,for every m ∈ { , . . . , M } , b x m = y m + L ∗ b u m , (8) and, for every k = { , . . . , N − } , ( if b x k = b x k +1 then k b u k k ≤ λ, if b x k = b x k +1 then b u k = − λ b x k +1 − b x k k b x k +1 − b x k k . (9)The first condition corresponds to the configuration where every component keeps the samevalue from location k to k + 1. This configuration is illustrated in the bivariate case ( M = 2)in Fig. 2 (left plot). The second condition models situations where some components of b x admit change points between locations k and k + 1. An interesting configuration is thatof non-simultaneous change points as illustrated in Fig. 2 (right plot). In the presence ofnoise, this second situation is rarely encountered. Thus, in the sequel, we will only considersimultaneous change points. Remark 3.2
Proposition 3.1 for M = 1 leads to the usual KKT conditions associated tothe minimization problem (1): if b x k > b x k +1 then b u k = + λ, if b x k < b x k +1 then b u k = − λ, if b x k = b x k +1 then b u k ∈ [ − λ, + λ ] . (10)The on-the-fly univariate TV algorithm proposed in [12] is derived from Conditions (10). Note that, the usual dual formulation and the resulting stationarity conditions would involve u ratherthan − u . The choice made in this article enables us to be consistent with the results obtained in [12] forthe univariate case. u ,n > b u n λ b u ,n > b u n = (0 , λ ) λ Figure 2:
Comparing joint vs disjoint changes in the dual space.
Left: location k is suitablefor a joint negative amplitude change on both components. Right: configuration suitablefor introducing a negative amplitude change at k on the second component only. Contrary to Conditions (10), the multivariate conditions derived in Proposition 3.1 are notdirectly usable in practice to devise an on-the-fly algorithm because b x k +1 − b x k is a prioriunknown at instant k . Therefore, we propose to rewrite the second condition in (9) bymeans of auxiliary variables ( b z k ) ≤ k ≤ N − such that ( if b x k = b x k +1 then k b u k k ≤ λ, if b x k = b x k +1 then b u k = − sign( b x k +1 − b x k ) ◦ b z k , (11)with b z k = λ abs( b x k +1 − b x k ) k b x k +1 − b x k k and where ◦ denotes the Hadamard product. Then, Proposition 3.1,can be reformulated component-wise as follows. Proposition 3.3
The solutions of the primal problem (2) and of the dual problem (4) satisfy the following necessary and sufficient conditions. There exist nonnegative auxiliaryvariables ( b z k ) ≤ k ≤ N − such that, for every m = { , . . . , M } and k = { , . . . , N − } , if b x m,k > b x m,k +1 then b u m,k = + b z m,k , if b x m,k < b x m,k +1 then b u m,k = − b z m,k , if b x m,k = b x m,k +1 then b u m,k ∈ [ − b z m,k , + b z m,k ] , (12) with k b z k k = λ and b x m = y m + L ∗ b u m . Comparing Eqs. (10) and (12) highlights the similarity between the necessary conditionsof the univariate and multivariate minimization problems: Conditions involving λ in theunivariate case involve the auxiliary vector b z in the multivariate one. The fact that b z differsfor each pair ( m, k ) can be interpreted in taut string procedures as the fact that the point ofcontact with the taut string may vary on the tube of radius λ . This significantly increasesthe difficulty of deriving an on-the-fly algorithm.6 .3 Approximate solution If we first assume that b z is known and such that, for every k ∈ { , . . . , N − } , k b z k k = λ ,the primal problem associated to Conditions (12) readsmin x M X m =1 k y m − x m k + N − X k =1 b z m,k | ( L x m ) k | ! (13)and can be interpreted as M univariate minimization problems having time-varying regu-larization parameters ( b z m ) ≤ m ≤ M .The proposed approximation consists in restricting the estimation of b z to a predefinedset Q = { ζ (1) , . . . , ζ ( |Q| ) } chosen such as for every q ∈ { , . . . , |Q|} , ζ q = ( ζ ( q ) m ) ≤ m ≤ M ∈ R M satisfies k ζ ( q ) k = λ .The most naive strategy would consist in solving M univariate minimization problemsfor every |Q| candidate values of b z , i.e., find for every m = { , . . . , M } and q = { , . . . , |Q|} , b x ( q ) m = arg min x m k y m − x m k + ζ ( q ) m k L x m k (14)and to devise a method to pick the solution amongst the |Q| candidates. For instance, theone that maximizes some quality criterion f , i.e., b x = b x ( q ∗ ) with q ∗ = arg max ≤ q ≤|Q| f ( b x ( q ) ) . (15)Although it benefits from parallel on-the-fly implementations, this situation would corre-spond to a constant estimate ˜ z = ζ ( q ∗ ) N . Therefore, changes in the mean would beprocessed independently on all components and group-sparsity would not be enforced.In order to benefit from an on-the-fly implementation and to enforce group-sparsity, wepropose an algorithmic solution based on a piece-wise constant estimator of b z detailed inthe next section. In the following, we first extend the on-the-fly algorithm proposed in [12] to the multivariatecase, with b z assumed to be known a priori. This strong assumption, unrealistic in pratice,permits to describe clearly the behaviour of the multivariate on-the-fly algorithm. Then,we will focus on the question of the automated and on-the-fly estimation of b z taking itsvalues in Q , which consequently introduce a parameter |Q| controlling the quality of theapproximation. The main steps of the on-the-fly algorithm are summarized in Algorithm 1.It is based on the range control of both unknown primal and dual solutions b x and b u bylower and upper bounds updated with the incoming data stream.The design of Algorithm 1 results in specifying Rule 1 and Rule 2 allowing respectivelyto detect a change point and to find suitable change-point locations according to Proposi-tion 3.3. 7 lgorithm 1: On-the-fly scheme for multivariate TV
Data : Multivariate signal y = ( y , . . . , y M ) ∈ R M × N .Regularization parameter λ > k = 1. while k < N do Set k ← k Initialize primal/dual bounds while
Rule 1 is satisfied do Set k ← k + 1 for m ← to M do Update primal/dual bounds if Rule 2 is not satisfied then
Revise the update of primal/dual boundsEstimate the change point k rupt Estimate ( b x j ) k ≤ j ≤ k rupt Set k ← k rupt + 1 Result : Solution b x approx b z known According to Proposition 3.3, the solution of the primal problem, the solution of the dualproblem and the auxiliary variable have to satisfy, for every k ∈ { , . . . , N − } , b u k +1 = y k +1 + b u k − b x k +1 , abs( b u k +1 ) ≤ b z k +1 , k b z k +1 k = λ. (16)with b u = b u N = 0. Considering the two first conditions, the prolongation condition b x k +1 = b x k leads to ( y k +1 ≥ b x k − b z k +1 − b u k , y k +1 ≤ b x k + b z k +1 − b u k . (17)Following the solution proposed for the univariate case derived in [12], one can checkthat (17) is satisfied by reasoning on lower and upper bounds of b u k and b x k . For every k ∈ { , . . . , N − } , we define the lower and upper bounds of b x k , labeled x k and x k respec-tively, as: x k ≤ b x k ≤ x k , (18)and we set u k and u k as follows( ∀ m ∈ { , . . . , M } ) (b u m,k = u m,k if b x m,k = x m,k , b u m,k = u m,k if b x m,k = x m,k , (19)where u k and u k appear to be the upper and lower bounds of b x k respectively, i.e. u k ≤ b u k ≤ u k , (20)as detailed in Appendix 7.1. 8 .1.2 Updating rules & Rule 1 The prolongation condition b x k +1 = b x k , which has led to (17), becomes ( y k +1 ≥ x k − b z k +1 − u k , y k +1 ≤ x k + b z k +1 − u k . (21)If the latter condition, labeled as Rule 1, holds, then according to the primal-dual relation,we perform the update of the lower and upper bounds at location k + 1 as follows: ( u k +1 = y k +1 + u k − x k , u k +1 = y k +1 + u k − x k , (22)and ( x k +1 = x k , x k +1 = x k . (23) Remark 4.1
Equivalently, one can systematically update primal (resp. dual) bounds ac-cording to (22) (resp. (23)) and verify that the following rewriting of the prolongationcondition (21) holds: ( u k +1 ≥ − b z k +1 , u k +1 ≤ + b z k +1 . (24) If Rule 1 (i.e. Condition (21) or equivalently (24)) holds, then the assumption b x k +1 = b x k is valid. However, the upper and lower bounds may have to be updated in order to beconsistent with b u k +1 ∈ [ − b z k +1 , + b z k +1 ]. According to (20), this condition requires to verifythat the following Rule 2 holds: ( u k +1 ≤ + b z k +1 , u k +1 ≥ − b z k +1 . (25)For every m ∈ { , . . . , M } , three configurations can be encountered: • When both Conditions (25) are satisfied, the bounds are left unchanged. • When u m,k +1 = u m,k + y m,k +1 − x m,k > + b z m,k +1 , then the updating rules specifiedin (23) have under-evaluated the bound ν m ≡ x m,j ( ∀ j ∈ { k , . . . , k + 1 } ) (26)where k denotes the last starting location of a new segment. Since u m,k +1 is upper-bounded by + b z m,k +1 and, that for such a value it can be shown (see Appendix 7.2)that ν m = x m,k + u m,k +1 − b z m,k +1 k − k + 1 , (27)9e propose the following updates ( ( ∀ j ∈ { k , . . . , k + 1 } ) x m,j = ν m ,u m,k +1 = + b z m,k +1 . (28) • When u m,k +1 < − b z m,k +1 , then it results that the upper bound ν m ≡ x m,j ( ∀ j ∈ { k , . . . , k + 1 } ) (29)has been over-evaluated. Similarly, since u m,k +1 is lower bounded by − b z m,k +1 , we canshow that the upper bound ν m = x m,k + u m,k +1 + b z m,k +1 k − k + 1 , (30)permits to ensure the consistency of the following updates ( ( ∀ j ∈ { k , . . . , k + 1 } ) x m,j = ν m ,u m,k +1 = − b z m,k +1 . (31) k rupt When Rule 1 does not hold, a change point has to be created. For every m ∈ { , . . . , M } ,we can distinguish three cases: • When u m,k +1 = u m,k + y m,k +1 − x m,k < − b z m,k +1 , then, since u m,k is bounded, itmeans that x m,k is over-evaluated and therefore a negative amplitude change has tobe introduced on the m -th component in the time index set { k , . . . , k } in order todecrease its value. Following Proposition 3.3 and Eq. (20), the set of locations κ m suitable for a change-point on the m -th component reads: κ m = { j ∈ { k , . . . , k } | u m,j = + b z m,j } . (32)Such locations correspond to the indexes where the value of the bound u m,j has beenupdated in order to be consistent with the condition b u m,j ∈ [ − b z m,j , b z m,j ] (see theprevious paragraph) • When u m,k +1 > + b z m,k +1 , then a positive amplitude change has to be introduced inthe m -th component within the time index { k , . . . , k } . The set of locations suitablefor a change-point on the m -th component reads: κ m = { j ∈ { k , . . . , k } | u m,j = − b z m,j } . (33)This set of locations corresponds to indexes where the value of the bound u m,j wasupdated in order to be consistent with b u m,j ∈ [ − b z m,j , b z m,j ]. • Else, when component m does satisfy (17), then we set κ m = { k , . . . , k } .10he change-point location k rupt corresponds to the last location suitable for introducingthe adequate amplitude change on each component, i.e., k rupt = max j ∈∩ Mm =1 κ m j. (34)Once the change point location has been specified, we are able to assign a value to( b x j ) k ≤ j ≤ k rupt . When a negative amplitude change is detected on the m -th component, weset ( ∀ j ∈ { k , . . . , k rupt } ) b x m,j = x m,k +1 , (35)in consistence with (19). Similarly, when a positive amplitude change is detected, we set( ∀ j ∈ { k , . . . , k rupt } ) b x m,j = x m,k +1 . (36) When a segment has been created, we start the detection of a new segment considering k = k rupt + 1 as long as k < N .According to (7) and by definition of the bounds, for every k ∈ { , . . . , N } ( x k = y k − u k + b u k − , x k = y k − u k + b u k − . (37)In particular, for k = k , combining (12), (18), (19) and (22) allows us to find the followinginitialization procedure ( u k = + b z k , u k = − b z k , ( x k = y k − b z k + b u k − , x k = y k + b z k + b u k − , (38)where the value of b u k − is given according to Proposition 3.3. In addition, according tothe writing of (16), b u = 0. b z In order to describe the generic behavior of the multivariate on-the-fly algorithm, we haveso far assumed b z to be known a priori. We now focus on the simultaneous estimation of themultivariate vector b z and of the multivariate signal b x .To provide an on-the-fly approximate solution, we propose: • to build a piece-wise constant estimator e z of b z , • to only consider amplitude changes jointly on all components m ∈ { , . . . , M } .11 .2.1 Piece-wise constant estimator of b z The proposed estimate is assumed to be constant between each change-point with valuesbelonging to the predefined set Q defined in Section 3.3. For each candidate value ζ ( q ) with q ∈ { , . . . , |Q|} , we create upper and lower bounds labeled u ( q ) k , u ( q ) k , x ( q ) k , and x ( q ) k . Theyare initialized at each new segment location k and are updated independently according to(22) and (23) until the prolongation condition ( u ( q ) k +1 ≥ − ζ ( q ) , u ( q ) k +1 ≤ + ζ ( q ) , (39)based on (24), does not hold anymore. In the following, we investigate how to modify thealgorithm described in Section 4.1, to account for the automated selection of e z in Q . Theresulting algorithm is reported in Algorithm 2. k ( q )rupt For every q ∈ { , . . . , |Q|} , we create change points as described in Section 4.1.4. Themain difference consists in the restriction to simultaneous change points. As detailed afterProposition 3.1, non-simultaneous changes have a zero probability to occur. The restrictionto simultaneous change-points will thus not impact the solution. It results that if there existsat least one component m ∈ { , . . . , M } such that u ( q ) m,k +1 < − ζ ( q ) m (resp. u ( q ) m,k +1 > ζ ( q ) m ),then κ ( q ) m = { j ∈ { k , . . . , k } | u ( q ) m,j = + ζ ( q ) m } (40)(resp. κ ( q ) m = { j ∈ { k , . . . , k } | u ( q ) m,j = − ζ ( q ) m } ) , (41)and, ∀ m − = m such that u ( q ) m − ,k +1 + u ( q ) m − ,k +1 <
0, then κ ( q ) m − = { j ∈ { k , . . . , k } | u ( q ) m − ,j = + ζ ( q ) m } (42)or, ∀ m + = m such that u ( q ) m + ,k +1 + u ( q ) m + ,k +1 ≥
0, then κ ( q ) m + = { j ∈ { k , . . . , k } | u ( q ) m + ,j = + ζ ( q ) m } . (43)A bivariate example of these configurations where the second component breaks Condi-tion (39) is provided in Fig. 3. The change-point location k ( q )rupt and the assignment of b x ( q ) on the current segment follow (34), (35) and (36). k rupt According to the previous paragraph, the piece-wise estimation procedure leads to severalpossible change-point locations (at most |Q| ). Here we select the solution indexed by q ∗ with tightest bounds x ( q ∗ ) and x ( q ∗ ) , i.e., q ∗ ∈ Argmin ≤ q ≤|Q| (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) x ( q ) k ( q )rupt − x ( q ) k ( q )rupt (cid:19) σ − (cid:13)(cid:13)(cid:13)(cid:13) , (44)12 ( q )2 ,n +1 ▽△ u ( q )2 ,n +1 ⊲u ( q )1 ,n +1 u ( q )1 ,n +1 ⊳ u ( q )2 ,n +1 ▽△ ⊳⊲ u ( q )2 ,n +1 u ( q )1 ,n +1 u ( q )1 ,n +1 Figure 3:
Example of configurations leading to the detection of a change-point . In thisexample M = 2, ζ ( q )1 = ζ ( q )2 = λ/ √
2. Since u ( q )2 ,k +1 > ζ ( q )2 , condition (39) is violated.The left (resp. right) plot displays the configuration u ( q )1 ,k +1 + u ( q )1 ,k +1 < u ( q )1 ,k +1 + u ( q )1 ,k +1 ≥
0) described in Section 4.2.2. with σ = diag( σ , . . . , σ M ) , (45)where, for every m ∈ { , . . . , M } , σ m stands for the standard deviation of y m . The factor σ − permits to ensure that every component contributes equally to the criterion (44). Whenthe minimizer of (44) is not unique, we select the index q ∗ yielding the largest k ( q ∗ )rupt . Inother words, we choose the set of auxiliary variables which permits to hold the prolongationcondition (39) as long as possible.Therefore, it finally leads to an index q ∗ which permits to estimate k rupt = k ( q ∗ )rupt and,( ∀ j ∈ { k , . . . , k rupt } ) ˜ z j = ζ ( q ∗ ) , b x j = b x ( q ∗ ) j . (46)The starting location for the next segment is then, k = k rupt + 1, and the algorithmiterates as long as k < N . Let us consider the location k of a new segment. For every q ∈ { , . . . , |Q|} , the initial-ization step can be recast into ( u ( q ) k = + ζ ( q ) , u ( q ) k = − ζ ( q ) , x ( q ) k = y k − ζ ( q ) + b u k − , x ( q ) k = y k + ζ ( q ) + b u k − , (47)with b u = 0. Remark 4.2
The initialization step (47) implicitly depends on the estimation of b z madeon the last segment through the term b u k − . Simulations have shown that (47) may leadto an inconsistent solution b x as soon as b z has been poorly estimated on a segment. Empiri-cally, a better approximation of the iterative solution is observed if each segment is treatedindependently, i.e., ( u ( q ) k = + ζ ( q ) , u ( q ) k = − ζ ( q ) , x ( q ) k = y k − ζ ( q ) , x ( q ) k = y k + ζ ( q ) . (48)13 lgorithm 2: On-the-fly Multivariate TV
Data : Multivariate signal y = ( y , . . . , y M ) ∈ R M × N .Regularization parameter λ > Q = { ζ (1) , . . . , ζ ( |Q| ) } .Starting location k = 1. while k < N dofor q ← to |Q| do Set k ← k Initialize primal/dual bounds according to (48) while (39) is satisfied do Set k ← k + 1 for m ← to M do Update primal/dual bounds if u ( q ) m,k +1 > + ζ ( q ) m or u ( q ) m,k +1 < − ζ ( q ) m then Revise the update of primal/dual boundsEstimate k ( q )rupt and ( b x ( q ) j ) k ≤ j ≤ k ( q )rupt according to Section (4.2.2)Estimate k rupt ∈ ( k ( q )rupt ) ≤ q ≤|Q| according to Section 4.2.3Estimate ( b x j ) k ≤ j ≤ k rupt and (˜ z j ) k ≤ j ≤ k rupt according to (46)Set k ← k rupt + 1 Result : Solution b x Unless specified otherwise, we consider that data consist of a M -multivariate piece-wiseconstant signal x ∈ R N (solid black), to which a centered Gaussian noise ǫ is additivelysuperimposed: y = x + ǫ ∈ R M × N .Signal x is generated as follows. First the length of each segment is drawn according toa folded Gaussian distribution N (12 . , . m ∈ { , . . . , M } , the ampli-tudes of the corresponding changes are drawn independently from a Gaussian distribution N (2 , . b x , is computed by means of the ADMM algorithmproposed in [34]. Iterations are stopped when the relative criterion error is lower than 10 − .The proposed solution computed with the predefined set Q is denoted b x approx , Q .In a second set of experiments (see 5.4), the proposed on-the-fly algorithmic solution willbe compared to an on-the-fly ADMM solution. Q We propose to compare solutions b x approx , Q obtained with two sets Q = { ζ (1) , . . . , ζ ( |Q| ) } inthe bivariate case (i.e., M = 2) for N = 10 . For both configurations, we choose( ∀ q ∈ { , . . . , |Q|} ) ζ ( q ) = ( λ cos( θ q ) , λ sin( θ q )) (49)with θ q ∈ [0 , π/ ℓ ball suchthat, for some some parameter R ∈ N ∗ , θ q = qπ/ R +1 and |Q| = P R − q ′ =0 q ′ . The second14
100 200 300 400 5000.20.40.60.8 |Q| |Q| θ θ Figure 4:
Influence of the design of Q . Comparison over 100 realizations of an homogeneouscovering of the ℓ -ball (blue) against a random covering (red). Two experimental settingsare considered depending on whether if y is one order of magnitude larger than y (left)or not (right). Top: MSE( b x approx , Q , b x ) for different set sizes |Q| . Bottom (2nd and 3rdlines): distributions of θ q ∗ where q ∗ has been selected by criterion (44) for |Q| = 127. solution draws a set of the same size whose values ( θ q ) ≤ q ≤|Q| follow a uniform distributionon [0 , π/ y is one order of magnitudelarger than y (Fig. 4, left plots) whereas in the second one, both are of the same order ofmagnitude (Fig. 4, right plots).Estimation performances in terms of mean squared error MSE( b x approx , Q , b x ) = b E [ N k b x approx , Q − b x k ] (where b E stands for the sample mean estimator computed over 100 realizations) arereported on the first line. It shows that a random covering of the ℓ -ball provides solutionsas good as the homogeneous covering up to the limit of |Q| small.On the 2nd and 3rd lines, the distributions of θ q ∗ , where q ∗ has been selected by criterion(44), are reported for |Q| = 127. These histograms show the impact of the relative amplitudeof the components on the distribution θ q ∗ : components with same order of magnitude yielda symmetric distribution while unbalanced components yield an asymmetric distribution.For instance, in Fig. 4 (right plots), it appears more meaningful to draw θ q according to aGaussian distribution than to a uniform distribution. Therefore, if one has some knowledgeof components amplitudes, this can be incorporated to better design the set Q . This willalso decrease the computational cost discussed in section 5.4.In the following, we restrict ourselves to a random covering of the ℓ ball. In this section, we focus on the comparison of offline performance, extended for M = 10,for two different signal-to-noise ratios (SNRs), namely 4dB and 10dB. Qualitative impact of |Q| on b x approx , Q . For a single realization of noise, b x approx , Q and b x are plotted Fig. 5 for λ = 29, adjusted to provide the best visual (qualitative) performance.Solution b x approx , Q for |Q| = 5 × (light orange) provides a visually better approximationof b x (dashed blue) than for |Q| = 10 (mixed red).15 yx b x approx , | Q | =5 × b x approx , | Q | =10 b x −5050 50 100 150 200 250−505 Figure 5:
Qualitative impact of |Q| on b x approx , Q . For visibility, only 3 components out of M = 10 are displayed. SNR = 4dB. b x approx , Q for |Q| = 5 × is more satisfying thanfor |Q| = 10 since it has more discontinuities in common with b x . Estimation performance b x approx , Q vs. b x. The quality of the approximation is furtherquantified Fig. 6 in terms of MSE( b x approx , Q , b x ) as a function of λ for different |Q| .It shows that the MSE systematically decreases when |Q| increases. Further, on theexamples considered here and depending on λ , using |Q| ≥ no longer yields significantlyimproved solutions, thus showing that the selection of |Q| does not require a complicatedtuning procedure. Estimation performance b x vs. x and b x approx ,Q vs. x. Let us now compare theabsolute quality of the solutions against x .MSE( b x , x ) and MSE( b x approx , Q , x ) for different |Q| , are reported in Fig. 7. MSEs are con-sistent with the previous paragraph: it shows that increasing |Q| up to a certain value per-mits to significantly lower the MSE. However, b x has a lower estimation error than b x approx ,Q . In this section we focus on the comparison between two online solutions. The first oneis derived from the proposed on-the-fly algorithm whereas the second one is based on aniterative algorithm.Comparison is made for different values of λ on a signal x ∈ R M × N ( N = 400) to which aGaussian noise is superimposed such that SNR = 3dB. Performance are provided for M = 2and M = 5 components. Proposed online solution b x online , Q . As the time step k increases, b x approx , Q is only com-puted up to the last k and the algorithm has not yet output a solution on { k + 1 , . . . , k } .In that sense, the solution is said to be ”on-the-fly”. However, a solution b x online , Q , providingan online approximation of x , can be output up to k by imposing limit conditions at k .16 indowed iterative solution b x win ,K . We consider a naive online ADMM version, whereat each time step k a solution b x win ,K is computed by optimizing over the previous K points.The choice of K is of critical importance. On the one hand, if this value is too small, theobserver may miss amplitude changes in the multivariate data stream. On the other hand,if the window size is too large, the computational cost may be too high to handle any on-line observation. Three window sizes have been investigated, respectively K = 20, 50 and 80. Computational cost.
Comparisons of median computational costs per incoming sample(in seconds), over 10 realizations of noise, are reported Fig. 8 (left plots) as functions of λ .As expected, we observe that the computational cost does increase along with the sizeof Q . Therefore, |Q| acts as a trade-off between the computational cost and the MSE.However, the computational cost of b x online , Q is still several orders of magnitude lower thanthe one associated to the online ADMM. Interestingly, computational costs are comparablefor M = 2 (top left plot) and M = 5 (bottom left plot). Note that a warm-up startingstrategy for online ADMM only reduces by a factor two the computational cost with respectto the implementation displayed in Fig. 8.The computational cost of b x online , Q could still be reduced in two ways. First, one coulddesign the set Q according to a priori knowledge of components amplitudes (see 5.2). Sec-ond, one could also benefit from the separable form of the algorithm and compute solutions b x ( q ) in parallel for every q ∈ { , . . . , |Q|} . Change-point detection accuracy.
The Jaccard index J ( α , β ) ∈ [0 ,
1] between any α and β ∈ [0 , N is defined as [18, 22] J ( α , β ) = P Ni =1 min( α i , β i ) P N ≤ i ≤ Nα i > ,β i > α i + β i + P ≤ i ≤ Nβ i =0 α i + P ≤ i ≤ Nα i =0 β i . (50)It varies from 0, when α ∩ β = ∅ , up to 1 when α = β . The Jaccard index is a demandingmeasure: As an example, if β ∈ { , } N is the truth and if α ∈ { , } N has correctlyidentified half non-zero values of β but has misidentified the other half, then J ( α , β ) = 1 / x and those obtained during the computation of b x win ,K and b x online , Q . To this end, weconsider the change-point indicator vector r = ( r i ) ≤ i ≤ N of x (as well as b r win ,K and b r online , Q respectively associated to b x win ,K and b x online , Q ), defined as r i = (cid:26) , if x has a change-point at location i ,0 , otherwise. (51)In order to incorporate a tolerance level on change-point locations, r , b r win ,K and b r online , Q are first convolved with a Gaussian kernel of size 10 with a standard deviation of 3. J ( b r win ,K , r ) and J ( b r online , Q , r ) are averaged over 10 realizations of noise and reportedin Fig. 8 (right plots) as functions of λ for different set size |Q| and window size K .Performance show that J ( b r online , Q , r ) ≥ J ( b r win ,K , r ) for almost all λ and |Q| . There-fore, b x online , Q provides a better online detection of change-points of x . It also show that J ( b r online , Q , r ) does not vary significantly with |Q| but slightly decreases with M . Indeed,as M increases, the prolongation condition (39) is more likely to be violated, thus leadingto more change-points. 17
10 20 3010 −1 λ | Q | = 500 | Q | = 10 | Q | = 5 × | Q | = 10 | Q | = 5 × −1 λ Figure 6:
Estimation performance b x approx , Q vs b x. MSE( b x approx , Q , b x ) for different |Q| . SNR isset to 4dB (resp. 10dB) on left plot (resp. right plot). −1 λ b x | Q | = 500 | Q | = 10 | Q | = 5 × | Q | = 10 | Q | = 5 × −1 λ Figure 7:
Estimation performance b x vs. x and b x approx ,Q vs. x. MSE( b x , x ) andMSE( b x approx , Q , x ) for different |Q| . SNR is set to 4dB (resp. 10dB) on left plot (resp.right plot). In this contribution, we have developed an algorithm which provides an on-the-fly approxi-mate solution to the multivariate total variation minimization problem. Besides a thoroughexamination of the KKT conditions, the key-step of the algorithm lies in updating and con-trolling the range of the upper and lower bounds of the dual solution within a tube of radius λ . An on-the-fly derivation is achieved by means of an auxiliary vector b z , which needs to beestimated, providing information on the angle of contact with the tube. The latter estima-tion strongly affects the quality of the solution and the proposed on-the-fly estimation of b z iscurrently achieved by assigning a value chosen within a predefined set Q . It has been shownthat the size of Q permits to achieve a desired trade-off between the targeted quality ofthe solution and the application-dependent affordable computational cost. In addition, theproposed method could also be extended to other ℓ ,p penalization norms in the right-handside of (2), for p >
1. However one would still face the issue of estimating b z which wouldhave to lie within a ℓ p ball of radius λ . Under current interest is the investigation of howto estimate b z in the case where the assumption of piece-wise constant behavior is a priorirelaxed. (20) According to the primal-dual relation (7), for every m ∈ { , . . . , M } and k ∈ { , . . . , N − } , b u m,k = y m,k + u m,k − − b x m,k , (52)18 −6 −4 −2 λ
10 20 30 40 500.10.20.30.40.5 λ K = 20 K = 50 K = 80 |Q| = 500 |Q| = 10 |Q| = 5 × |Q| = 10 |Q| = 5 ×
10 20 30 40 5010 −6 −4 −2 λ
10 20 30 40 5000.10.20.30.40.5 λ Figure 8:
Online Performance.
The proposed solution b x online , Q is displayed in solid line whilethe online ADMM solution b x win ,K is displayed in dashed line. Performance for M = 2(resp. M = 5) are illustrated top (resp. bottom). Left: median computational cost perincoming sample (in seconds). Right: J ( b r win ,K , r ) and J ( b r online , Q , r ) for different valuesof |Q| and K . and by definition of the lower and upper bounds of b x m,k and b u m,k , we have u m,k = y m,k + b u m,k − − x m,k , (53) u m,k = y m,k + b u m,k − − x m,k . (54)By subtracting (53) from (52) we obtain b u m,k − u m,k = x m,k − b x m,k (55)and, according to (18), b u m,k − u m,k ≤
0. The arguments are similar for proving that b u m,k > u m,k . (27) For every m ∈ { , . . . , M } and k ∈ { k , . . . , N − } , if u m,k +1 = u m,k + y m,k +1 − x m,k > + b z m,k +1 , (56)then updating rules of x m,k , specified in (23), have under-evaluated its value ν m . To modifythe lower bounds ( x m,j ) k ≤ j ≤ k +1 , on the one hand, we consider the cumulative sum of theobservations which, according to (7), leads to k +1 X j = k +1 y m,j = u m,k +1 − u m,k + ( k − k + 1) x m,k +1 , (57)and thus, if u m,k +1 = + b z m,k +1 , would lead to k +1 X j = k +1 y m,j = b z m,k +1 − u m,k + ( k − k + 1) ν m , (58)19y definition of x m,k +1 = ν m . On the other hand, the updating rules (22) and (23) have ledto u m,k +1 = u m,k + k +1 X j = k +1 y m,j − ( k − k + 1) x m,k . (59)The combinaison of (58) and (59) leads to ν m = x m,k + − u m,k + u m,k +1 − b z m,k +1 + b u m,k k − k + 1 . (60)Because x m,k have been under-evaluated and by definition b u m,k ≤ u m,k , we can proposethe following value ν m = x m,k + u m,k +1 − b z m,k +1 k − k + 1 , (61)in order to adjust the lower bounds, i.e.,( ∀ j ∈ { k , . . . , k + 1 } ) x m,j = ν m . (62)In addition, as a result of b u m,k +1 ∈ [ − b z m,k +1 , + b z m,k +1 ] and according to the inequality (20),we set u m,k +1 = + b z m,k +1 . (63) References [1] M. V. Afonso, J. M. Bioucas-Dias, and M. A. T. Figueiredo. An augmented Lagrangianapproach to the constrained optimization formulation of imaging inverse problems.
IEEE Trans. Image Process. , 20(3):681–695, Mar. 2011.[2] C. Alais, A. Barbero, and J. Dorronsoro. Group fused lasso. In V. Mladenov,P. Koprinkova-Hristova, G. Palm, A. Villa, B. Appollini, and N. Kasabov, editors,
Artificial Neural Networks and Machine Learning, ICANN 2013 , volume 8131 of
Lec-ture Notes in Computer Science , pages 66–73. 2013.[3] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducingpenalties.
Foundations and Trends in Machine Learning , 4(1):1–106, 2012.[4] A. Barbero and S. Sra. Fast Newton-type methods for total variation regularization.In
International Conference on Machine Learning , pages 313–320, June 2011.[5] H. H. Bauschke and P. L. Combettes.
Convex Analysis and Monotone Operator Theoryin Hilbert Spaces . Springer, New York, 2011.[6] K. Bleakley and J. P. Vert. The group fused Lasso for multiple change-point detection.Technical report, Preprint, Hal-00602121, 2011.[7] J. Cannon, P. A. Krokhmal, R. V. Lenth, and R. Murphey. An algorithm for onlinedetection of temporal changes in operator cognitive state using real-time psychophysi-ological data.
Biomedical Signal Processing and Control , 5(3):229–236, 2010.[8] A. Chambolle. An algorithm for total variation minimization and applications.
J. Math.Imag. Vis. , 20(1-2):89–97, Jan. 2004. 209] A. Chambolle, V. Caselles, D. Cremers, M. Novaga, and T. Pock. An introduction tototal variation for image analysis. In
Theoretical Foundations and Numerical Methodsfor Sparse Recovery , volume 9, pages 263–340. De Gruyter, 2010.[10] P. L. Combettes and J.-C. Pesquet. A proximal decomposition method for solvingconvex variational inverse problems.
Inverse Problems , 24(6):x+27, Dec. 2008.[11] P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. InH. H. Bauschke, R. Burachik, P. L. Combettes, V. Elser, D. R. Luke, and H. Wolkowicz,editors,
Fixed-Point Algorithms for Inverse Problems in Science and Engineering , pages185–212. Springer-Verlag, New York, 2010.[12] L. Condat. A direct algorithm for 1D total variation denoising.
IEEE Signal Process.Lett. , 20(11):1054–1057, 2013.[13] P. Davies and A. Kovac. Local extremes, runs, strings and multiresolution.
Ann. Stat. ,29(1):1–65, 2001.[14] N. Dobigeon, J.-Y. Tourneret, and J. D. Scargle. Joint segmentation of multivariateastronomical time series: Bayesian sampling with a hierarchical model.
IEEE Trans.Signal Process. , 55(2):414–423, Feb. 2007.[15] M. Godinez, A. Jimenez, R. Ortiz, and M. Pena. On-line fetal heart rate monitor byphonocardiography.
International Conference of the IEEE Engineering in Medicineand Biology Society , pages 3141–3144, 2003.[16] A. Gramfort, M. Kowalski, and M. H¨am¨al¨ainen. Mixed-norm estimates for the M/EEGinverse problem using accelerated gradient methods.
Physics in Medicine and Biology ,57(7):1937–1961, Mar 2012.[17] M. Grasmair. The equivalence of the taut string algorithm and BV-regularization.
J.Math. Imag. Vis. , 27:59–66, 2007.[18] R. Hamon, P. Borgnat, P. Flandrin, and C. Robardet. Discovering the structure of com-plex networks by minimizing cyclic bandwidth sum. arXiv preprint arXiv:1410.6108 ,2014.[19] Z. Harchaoui and C. Levy-Leduc. Catching change-points with Lasso. In
Proc. Ann.Conf. Neur. Inform. Proc. Syst. , volume 20, pages 617–624, Cambridge, MA, USA,2008.[20] W. Hinterberger, M. Hinterm¨uller, K. Kunisch, M. von Oehsen, and O. Scherzer. Tubemethods for BV regularization.
J. Math. Imag. Vis. , 19(3):219–235, 2003.[21] H. Hoefling. A path algorithm for the fused Lasso signal approximator. Technicalreport, Preprint, arXiv:0910.0526, 2009.[22] P. Jaccard. Distribution de la flore alpine dans le bassin des Dranses et dans quelquesr´egions voisines.
Bulletin de la Soci´et´e Vaudoise des Sciences Naturelles , (37):241–272,1901.[23] S.-J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky. An interior-point methodfor large-scale ℓ -regularized least squares. J. Mach. Learn. Res. , 1(4):606–617, 2007.2124] J. Liu, L. Yuan, and J. Ye. An efficient algorithm for a class of fused Lasso problems.In
Proc. ACM SIGKDD , pages 323–332, 2010.[25] C. Louchet and L. Moisan. Total variation as a local filter.
SIAM J. Imaging Sci. ,4(2):651–694, 2011.[26] E. Mammen and S. Van de Geer. Locally adaptive regression splines.
Ann. Stat. ,25(1):1–434, 1997.[27] J.-C. Pesquet and N. Pustelnik. A parallel inertial proximal optimization method.
Pac.J. Optim. , 8(2):273–305, Apr. 2012.[28] F. Picard, S. Robin, M. Lavielle, C. Vaisse, and J.-J. Daudin. A statistical approachfor array CGH data analysis.
BMC Bioinformatics , 6:27, 2005.[29] A. Rinaldo. Properties and refinements of the fused Lasso.
Ann. Stat. , 37(5B):2922–2952, 2009.[30] L. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removalalgorithms.
Phys.D , 60(1-4):259–268, Nov. 1992.[31] G. Steidl, S. Didas, and J. Neumann. Splines in higher order TV regularization.
Int.J. Comp. Vis. , 2006.[32] J.-P. Vert and K. Bleakley. Fast detection of multiple change-points shared by manysignals using group LARS. In
Proc. Ann. Conf. Neur. Inform. Proc. Syst. , volume 23,pages 2343–2352, Vancouver, Canada, 2010.[33] C. Vogel.
Computational Methods for Inverse Problems . Society for Industrial andApplied Mathematics, Philadelphia, PA, USA, 2002.[34] B. Wahlberg, S. Boyd, M. Annergren, and Y. Wang. An ADMM algorithm for aclass of total variation regularized estimation problems. In , pages 83–88, 2012. QC 20121112.[35] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped vari-ables.