Optimization viewpoint on Kalman smoothing, with applications to robust and sparse estimation
OOptimization viewpoint on Kalman smoothing,with applications to robust and sparseestimation.
Aleksandr Y. Aravkin and James V. Burke and Gianluigi Pillonetto
Abstract
In this chapter, we present the optimization formulation of the Kalmanfiltering and smoothing problems, and use this perspective to develop a variety ofextensions and applications. We first formulate classic Kalman smoothing as a leastsquares problem, highlight special structure, and show that the classic filtering andsmoothing algorithms are equivalent to a particular algorithm for solving this prob-lem. Once this equivalence is established, we present extensions of Kalman smooth-ing to systems with nonlinear process and measurement models, systems with linearand nonlinear inequality constraints, systems with outliers in the measurements orsudden changes in the state, and systems where the sparsity of the state sequencemust be accounted for. All extensions preserve the computational efficiency of theclassic algorithms, and most of the extensions are illustrated with numerical exam-ples, which are part of an open source Kalman smoothing Matlab/Octave package.
Kalman filtering and smoothing methods form a broad category of computationalalgorithms used for inference on noisy dynamical systems. Over the last fifty years,
Aleksandr Y. AravkinIBM T.J. Watson Research CenterYorktown Heights, NY, 10598e-mail: [email protected]
James V. BurkeDepartment of Mathematics,University of Washington, Seattle, WAe-mail: [email protected]
Gianluigi PillonettoControl and Dynamic Systems Department of Information Engineering,University of Padova, Padova, Italy,e-mail: [email protected] a r X i v : . [ m a t h . O C ] M a r Aleksandr Y. Aravkin and James V. Burke and Gianluigi Pillonetto these algorithms have become a gold standard in a range of applications, includingspace exploration, missile guidance systems, general tracking and navigation, andweather prediction. In 2009, Rudolf Kalman received the National Medal of Sciencefrom President Obama for the invention of the Kalman filter. Numerous books andpapers have been written on these methods and their extensions, addressing modifi-cations for use in nonlinear systems, smoothing data over time intervals, improvingalgorithm robustness to bad measurements, and many other topics.
X0 Z1 Z2 XNZN h NX1 h1 X2 g1 g2 h2 g N Fig. 1
Dynamic systems amenable to Kalman smoothing methods.
The classic Kalman filter [29] is almost always presented as a set of recursiveequations, and the classic Rauch-Tung-Striebel (RTS) fixed-interval smoother [42]is typically formulated as two coupled Kalman filters. An elegant derivation basedon projections onto spaces spanned by random variables can be found in [2]. Inthis chapter, we use the terms ‘Kalman filter’ and ‘Kalman smoother’ much morebroadly, including any method of inference on any dynamical system fitting thegraphical representation of Figure 1.
Specific mathematical extensions we con-sider include • Nonlinear process and measurement models. • Inequality state space constraints. • Different statistical models for process and measurement errors. • Sparsity constraints.We also show numerous applications of these extensions.The key to designing tractable inference methods for the above applications isan optimization viewpoint, which we develop in the classic Kalman smoothing caseand then use to formulate and solve all of the above extensions. Though it has beenknown for many years that the Kalman filter provides the maximum a posteriori ptimization viewpoint on Kalman smoothing 3 estimate for linear systems subject to Gaussian noise, the optimization perspectiveunderlying this idea has not been fully deployed across engineering applications.Notably, several groups (starting in 1977) have discovered and used variants of thisperspective to implement extensions to Kalman filtering and smoothing, includingsingular filtering ([39, 33, 40]), robust smoothing ([22, 7]), nonlinear smoothingwith inequality state space constraints ([9, 11]), and sparse Kalman smoothing [1].We focus exclusively on smoothing here, leaving online applications of theseideas to future work (see [41] for an example of using a smoother for an online ap-plication). We start by presenting the classic RTS smoothing algorithm in Section 2,and show that the well-known recursive equations are really an algorithm to solve aleast squares system with special structure. Once this is clear, it becomes much eas-ier to discuss novel extensions, since as long as special structure is preserved, theircomputational cost is on par with the classic smoother (or, put another way, the clas-sic smoothing equations are viewed as a particular way to solve key subproblems inthe extended approaches).In the subsequent sections, we build novel extensions, briefly review theory, dis-cuss the special structure, and present numerical examples for a variety of applica-tions. In Section 3, we formulate the problem for smoothing with nonlinear processand measurement models, and show how to solve it. In Section 4, we show howstate space constraints can be incorporated, and the resulting problem solved usinginterior point techniques. In Section 5, we review two recent Kalman smoothingformulations that are highly robust to measurement errors. Finally, in Section 6,we review recent work in sparse Kalman smoothing, and show how sparsity canbe incorporated into the other extensions. We end the chapter with discussion inSection 7.
The model corresponding to Figure 1 is specified as follows: x = g ( x ) + w , x k = g k ( x k − ) + w k k = , . . . , N , z k = h k ( x k ) + v k k = , . . . , N , (1)where w k , v k are mutually independent random variables with known positivedefinite covariance matrices Q k and R k , respectively. We have x k , w k ∈ R n , and z k , v k ∈ R m ( k ) , so measurement dimensions can vary between time points. The clas-sic case is obtained by making the following assumptions:1. x is known, and g k , h k are known linear functions, which we denote by g k ( x k − ) = G k x k − h k ( x k ) = H k x k (2) Aleksandr Y. Aravkin and James V. Burke and Gianluigi Pillonetto where G k ∈ R n × n and H k ∈ R m ( k ) × n ,2. w k , v k are mutually independent Gaussian random variables.In later sections, we will show how to relax these classic assumptions, and whatgains can be achieved once they are relaxed. In this section, we will formulate esti-mation of the entire state sequence, x , x , . . . , x N , as an optimization problem, andshow how the RTS smoother solves it. a posteriori formulation To begin, we formulate the maximum a posteriori (MAP) problem under linear andGaussian assumptions. Using Bayes’ theorem, we have P (cid:0) { x k } (cid:12)(cid:12) { z k } (cid:1) ∝ P (cid:0) { z k } (cid:12)(cid:12) { x k } (cid:1) P ( { x k } )= N ∏ k = P ( { v k } ) P ( { w k } ) ∝ N ∏ k = exp (cid:16) − ( z k − H k x k ) (cid:62) R − k ( z k − H k x k ) − ( x k − G k x k − ) (cid:62) Q − k ( x k − G k x k − ) (cid:17) . (3)A better (equivalent) formulation to (3) is minimizing its negative log posterior:min { x k } f ( { x k } ) : = N ∑ k = ( z k − H k x k ) (cid:62) R − k ( z k − H k x k ) + ( x k − G k x k − ) (cid:62) Q − k ( x k − G k x k − ) . (4)To simplify the problem, we now introduce data structures that capture the entirestate sequence, measurement sequence, covariance matrices, and initial conditions.Given a sequence of column vectors { u k } and matrices { T k } we use the notationvec ( { u k } ) = u u ... u N , diag ( { T k } ) = T · · · T . . . ...... . . . . . . 00 · · · T N . We now make the following definitions: ptimization viewpoint on Kalman smoothing 5 R = diag ( { R k } ) Q = diag ( { Q k } ) H = diag ( { H k } ) x = vec ( { x k } ) w = vec ( { g , , . . . , } ) z = vec ( { z , z , . . . , z N } ) G = I 0 − G I . . .. . . . . . 0 − G N I , (5)where g : = g ( x ) = G x .With definitions in (5), problem (4) can be writtenmin x f ( x ) = (cid:107) Hx − z (cid:107) R − + (cid:107) Gx − w (cid:107) Q − , (6)where (cid:107) a (cid:107) M = a (cid:62) Ma . We knew the MAP was a least squares problem already, butnow the structure is fully transparent. In fact, we can write down the closed formsolution by taking the gradient of (6) and setting it equal to 0:0 = H (cid:62) R − ( Hx − z ) + G (cid:62) Q − ( Gx − w )= ( H (cid:62) R − H + G (cid:62) Q − G ) x − H (cid:62) R − z − G (cid:62) Q − w . The smoothing estimate is therefore given by solving the linear system ( H (cid:62) R − H + G (cid:62) Q − G ) x = H (cid:62) R − z + G (cid:62) Q − w . (7) The linear system in (7) has a very special structure: it is a symmetric positive def-inite block tridiagonal matrix. This can be immediately observed from the fact thatboth G and Q are positive definite. To be specific, it is given by C = ( H (cid:62) R − H + G (cid:62) Q − G ) = C A T2 A C A T3
00 . . . . . . . . .0 A N C N , (8)with A k ∈ R n × n and C k ∈ R n × n defined as follows: A k = − Q − k G k , C k = Q − k + G (cid:62) k + Q − k + G k + + H (cid:62) k R − k H k . (9)The special structure of the matrix C in (8) can be exploited to solve the linearsystem equivalent to the Kalman smoother. While a structure-agnostic matrix in- Aleksandr Y. Aravkin and James V. Burke and Gianluigi Pillonetto version scheme has complexity O ( n N ) , exploiting the block tridiagonal structurereduces this complexity to O ( n N ) .A straightforward algorithm for solving any symmetric positive definite blocktridiagonal linear system is given in [10]. We review it here, since it is essential tobuild the connection to the standard viewpoint of the RTS smoother. Suppose for k = , . . . , N , c k ∈ R n × n , e k ∈ R n × (cid:96) , r k ∈ B n × (cid:96) , and for k = , . . . , N , a k ∈ R n × n . We define the corresponding block tridiagonal system of equations c a T2 ··· a c ...... ... 00 a N − c N − a T N ··· a N c N e e ... e N − e N = r r ... r N − r N (10) The following algorithm for (10) is given in [10, Algorithm 4].
Algorithm 2.1
The inputs to this algorithm are { a k } , { c k } , and { r k } . The output isa sequence { e k } that solves equation (10).
1. Set d = c and s = r .2. For k = , . . . , N , set d k = c k − a T k d − k − a k , s k = r k − a T k d − k − s k − .3. Set e N = d − N s N .4. For k = N − , . . . ,
1, set e k = d − k ( s k − a k + e k + ) .Note that after the first two steps of Algorithm 2.1, we have arrived at a linear systemequivalent to (10) but upper triangular: d a T2 ··· d ...... ... 00 0 d N − a T N ··· d N e e ... e N − e N = s s ... s N − s N (11) The last two steps of the algorithm then simply back-solve for the e k . (2.1) to Kalman Filter and RTSSmoother Looking at the very first block, we now substitute in the Kalman data structures (9)into step 2 of Algorithm 2.1: ptimization viewpoint on Kalman smoothing 7 d = c − a T2 d − a = Q − − (cid:0) Q − G (cid:1) (cid:62) Q − + H (cid:62) R − H (cid:124) (cid:123)(cid:122) (cid:125) P − | + G (cid:62) Q − G − (cid:0) Q − G (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) P − | + H (cid:62) R − H (cid:124) (cid:123)(cid:122) (cid:125) P − | + G (cid:62) Q − G (12) These relationships can be seen quickly from [5, Theorem 2.2.7]. The matrices P k | k , P k | k − are common to the Kalman filter framework: they represent covariances ofthe state at time k given the the measurements { z , . . . , z k } , and the covariance of thea priori state estimate at time k given measurements { z , . . . , z k − } , respectively.From the above computation, we see that d = P − | + G (cid:62) Q − G . By induction, it is easy to see that in fact d k = P − k | k + G (cid:62) k + Q − k + G k + . We can play the same game with s k . Keeping in mind that r = H (cid:62) R − z + G (cid:62) Q − w ,we have s = r − a T2 d − r = H (cid:62) R − z + (cid:0) Q − G (cid:1) (cid:62) Q − + H (cid:62) R − H (cid:124) (cid:123)(cid:122) (cid:125) P − | + G (cid:62) Q − G − (cid:16) H (cid:62) R − z + G (cid:62) P − | x (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) a | (cid:124) (cid:123)(cid:122) (cid:125) a | (13) These relationships also follow from [5, Theorem 2.2.7]. The quantities a | and a | are from the information filtering literature, and are less commonly known: they arepreconditioned estimates a k | k = P − k | k x k , a k | k − = P − k | k − x k | k − . (14)Again, by induction we have precisely that s k = a k | k .When you put all of this together, you see that step 3 of Algorithm 2.1 is givenby e N = d − N s N = (cid:16) P − N | N + (cid:17) − P − N | N x k | k = x k | k , (15)so in fact e N is the Kalman filter estimate (and the RTS smoother estimate) for timepoint N . Aleksandr Y. Aravkin and James V. Burke and Gianluigi Pillonetto
Step 4 of Algorithm 2.1 then implements the backward Kalman filter, computingthe smoothed estimates x k | N by back-substitution. Therefore the RTS smoother isAlgorithm 2.1 applied to (7).The consequences are profound — instead of working with the kinds expressionsseen in (13) and (12), we can think at a high level, focusing on (6), and simply usingAlgorithm 2.1 (or variants) as a subroutine. As will become apparent, the key to allextensions is preserving the block tridiagonal structure in the subproblems, so thatAlgorithm 2.1 can be used. (cid:239) (cid:239) (cid:239) (cid:239) Fig. 2
Tracking a smooth signal (sine wave) using a generic linear process model (16) and di-rect (noisy) measurements (18). Red solid line is true signal, blue dashed line is Kalman (RTS)smoother estimate. Measurements are displayed as circles.
In this example, we focus on a very useful and simple model: the process modelfor a smooth signal. Smooth signals arise in a range of applications: physics-basedmodels, biological data, and financial data all have some inherent smoothness.A surprisingly versatile technique for modeling any such process is to treat it asintegrated Brownian motion. We illustrate on a scalar time series x . We introducea new derivative state ˙ x , with process model ˙ x k + = ˙ x k + ˙ w k , and then model the ptimization viewpoint on Kalman smoothing 9 signal x or interest as x k + = x k + ˙ x k ∆ t + w k . Thus we obtain an augmented (2D)state with process model (cid:20) ˙ x k + x k + (cid:21) = (cid:20) I ∆ t I (cid:21) (cid:20) ˙ x k x k (cid:21) + (cid:20) ˙ w k w k (cid:21) . (16)Using a well-known connection to stochastic differential equations (see [26, 38,11]) we use covariance matrix Q k = σ (cid:20) ∆ t ∆ t / ∆ t / ∆ t / (cid:21) . (17)Model equations (16) and (17) can be applied as a process model for any smoothprocess. For our numerical example, we take direct measurements of the sin func-tion, which is very smooth. Our measurement model therefore is z k = H k x k + v k , H k = (cid:2) (cid:3) . (18)The resulting fit is shown in Figure 2.6. The measurements guide the estimateto the true smooth time series, giving very nice results. The figure was generatedusing the ckbs package [6], specifically using the example file affine_ok.m .Measurement errors were generated using R k = . , and this value was given to thesmoother. The σ in (17) was taken to be 1. The program and example are availablefor download from COIN-OR. In the previous section, we have shown that when g k and h k in model (1) are linear,and v k , w k are Gaussian, then the smoother is equivalent to solving a least squaresproblem (6). We have also shown that the filter estimates appear as intermediateresults when one uses Algorithm 2.1 to solve the problem.In this section, we turn to the case where g k and h k are nonlinear. We first formu-late the smoothing problem as a maximum a posteriori (MAP) problem, and showthat it is a nonlinear least squares (NLLS) problem. To set up for later sections, wealso introduce the broader class of convex composite problems.We then review the standard Gauss-Newton method in the broader context ofconvex composite models, and show that when applied to the NLLS problem, eachiteration is equivalent to solving (6), and therefore to a full execution of the RTSsmoother. We also show how to use a simple line search to guarantee convergenceof the method to a local optimum of the MAP problem.This powerful approach, known for at least 20 years [21, 12, 9], is rarely used inpractice; instead practitioners favor the EKF or the UKF [18, 28], neither of whichconverge to a (local) MAP solution. MAP approaches work very well for a broadrange of applications, and it is not clear why one would throw away an efficient MAP solver in favor of another scheme. To our knowledge, the optimization (MAP)approach has never been included in a performance comparison of ‘cutting edge’methods, such as [34]. While such a comparison is not in the scope of this work,we lay the foundation by providing a straightforward exposition of the optimizationapproach and a reproducible numerical illustration (with publicly available code)for smoothing the Van Der Pol oscillator, a well known problem where the processmodel is a nonlinear ODE.
In order to develop a notation analogous to (6), we define functions g : R nN → R n ( N + ) and h : R nN → R M , with M = ∑ k m k , from components g k and h k as follows. g ( x ) = x x − g ( x ) ... x N − g N ( x N − ) , h ( x ) = h ( x ) h ( x ) ... h N ( x N ) . (1)With this notation, the MAP problem, obtained exactly as in Section 2.2, is givenby min x f ( x ) = (cid:107) g ( x ) − w (cid:107) Q − + (cid:107) h ( x ) − z (cid:107) R − , (2)where z and w are exactly as in (5), so that z is the entire vector of measurements, and w contains the initial estimate g ( x ) in the first n entries, and zeros in the remaining n ( N − ) entries.We have formulated the nonlinear smoothing problem as a nonlinear least-squares (NLLS) problem — compare (2) with (6). We take this opportunity to notethat NLLS problems are a special example of a more general structure. Objective (2)may be written as a composition of a convex function ρ with a smooth function F : f ( x ) = ρ ( F ( x )) , (3)where ρ (cid:18) y y (cid:19) = (cid:107) y (cid:107) Q − + (cid:107) y (cid:107) R − , F ( x ) = (cid:20) g ( x ) − wh ( x ) − z (cid:21) . (4)As we show in the next sub-section, problems of general form (3) can be solvedusing the Gauss-Newton method, which is typically associated specifically withNLLS problems. Presenting the Gauss-Newton right away in the more general set-ting will make it easier to understand extensions in the following sections of thechapter. ptimization viewpoint on Kalman smoothing 11 The Gauss-Newton method can be used to solve problems of the form (3), and ituses a very simple strategy: iteratively linearizing the smooth function F [15].More specifically, the Gauss-Newton method is an iterative method of the form x ν + = x ν + γ ν d ν , (5)where d ν is the Gauss-Newton search direction, and γ ν is a scalar that guarantees f ( x ν + ) < f ( x ν ) . (6)The direction d ν is obtained by solving the subproblem d ν = arg min d ˜ f ( d ) : = ρ (cid:16) F ( x ν ) + ∇ F ( x ν ) (cid:62) d (cid:17) . (7)We then set ˜ ∆ f ( x ν ) = ˜ f ( d ν ) − f ( x ν ) . By [15, Lemma 2.3, Theorem 3.6], f (cid:48) ( x ν ; d ν ) ≤ ˜ ∆ f ( x ν ) ≤ , (8)with equality if and only if x ν is a first-order stationary point for f . This implies thata suitable stopping criteria for the algorithm is the condition ∆ f ( x ν ) ∼
0. Moreover, x ν is not a first-order stationary point for f , then the direction d ν is a direction ofstrict descent for f at x ν .Once the direction d ν is obtained with ˜ ∆ f ( x ν ) <
0, a step-size γ ν is obtained by astandard backtracking line-search procedure: pick a values 0 < λ < < κ < λ = . κ = . f ( x ν + λ s d ν ) , s = , , , . . . , until f ( x ν + λ s d ν ) ≤ f ( x ν ) + κλ s ˜ ∆ f ( x ν ) (9)is satisfied for some ¯ s , then set γ ν = λ ¯ s and make the GN update (5). The factthat there is a finite value of s for which (9) is satisfied follows from inequality f (cid:48) ( x ν ; d ν ) ≤ ˜ ∆ f ( x ν ) <
0. The inequality (9) is called the Armijo inequality. A gen-eral convergence theory for this algorithm as well as a wide range of others is foundin [15]. For the NLLS case, the situation is simple, since ρ is a quadratic, and stan-dard convergence theory is given for example in [27]. However, the more generaltheory is essential in the later sections. To implement the Gauss-Newton method described above, one must compute thesolution d ν to the Gauss-Newton subproblem (7) for (2). That is, one must compute d ν = arg min d ˜ f ( d ) = (cid:107) G ν d − w − g ( x ν ) (cid:124) (cid:123)(cid:122) (cid:125) w ν (cid:107) Q − + (cid:107) H ν d − z − h ( x ν ) (cid:124) (cid:123)(cid:122) (cid:125) z ν (cid:107) R − , (10)where G ν = I 0 − g ( ) ( x ν ) I . . .. . . . . . 0 − g ( ) N ( x ν N − ) I , H ν = diag { h ( ) ( x ) , . . . , h ( ) N ( x N ) } . (11)However, the problem (10) has exactly the same structure as (6); a fact that we haveemphasized by defining w ν : = w − g ( x ν ) , z ν = z − h ( x ν ) . (12)Therefore, we can solve it efficiently by using Algorithm 2.1.The linearization step in (10) should remind the reader of the EKF. Note, how-ever, that the Gauss-Newton method is iterative, and we iterate until convergence toa local minimum of (2). We also linearize along the entire state space sequence x ν at once in (10), rather than re-linearizing as we make our way through the x ν k ’s. The Van der Pol oscillator is a popular nonlinear process for comparing Kalmanfilters, see [24] and [30, Section 4.1]. The oscillator is governed by a nonlinearODE model˙ X ( t ) = X ( t ) and ˙ X ( t ) = µ [ − X ( t ) ] X ( t ) − X ( t ) . (13)In contrast to the linear model (16), which was a generic process for a smooth signal,we now take the Euler discretization of (13) to be the specific process model for thissituation.Given X ( t k − ) = x k − the Euler approximation for X ( t k − + ∆ t ) is g k ( x k − ) = (cid:18) x , k − + x , k − ∆ tx , k − + { µ [ − x , k ] x , k − x , k } ∆ t (cid:19) . (14) ptimization viewpoint on Kalman smoothing 13 (cid:239)
505 Time (s) X (cid:239) c o m ponen t (cid:239)
505 Time(s) X (cid:239) c o m ponen t Fig. 3
Tracking the Van Der Pol Osciallator using a nonlinear process model (14) and direct(noisy) measurements (16) of X -component only . Black solid line is true signal, blue dashed lineis nonlinear Kalman smoother estimate. Measurements are displayed as circles.4 Aleksandr Y. Aravkin and James V. Burke and Gianluigi Pillonetto For the simulation, the ‘ground truth’ is obtained from a stochastic Euler approxima-tion of the Van der Pol oscillator. To be specific, with µ = N =
80 and ∆ t = / N ,the ground truth state vector x k at time t k = k ∆ t is given by x = ( , − . ) T and for k = , . . . , N , x k = g k ( x k − ) + w k , (15)where { w k } is a realization of independent Gaussian noise with variance 0 .
01 and g k is given in (14). Our process model for state transitions is also (16), with Q k = . I for k >
1, and so is identical to the model used to simulate the ground truth { x k } .Thus, we have precise knowledge of the process that generated the ground truth { x k } . The initial state x is imprecisely specified by setting g ( x ) = ( . , − . ) T (cid:54) = x with corresponding variance Q = . I . For k = , . . . , N noisy measurements z k direct measurements of the first component only were used z k = x , k + v k , (16)with v k ∼ N ( , ) .The resulting fit is shown in Figure 3.4. Despite the noisy measurements of only X , we are able to get a good fit for both components. The figure was generated usingthe ckbs package [6], see the file vanderpol_experiment_simple.m . Theprogram and example are available for download from COIN-OR. In almost every real-world problem, additional prior information is known aboutthe state. In many cases, this information can be represented using state space con-straints . For example, in tracking physical bodies, we often know (roughly or ap-proximately) the topography of the terrain; this information can be encoded as asimple box constraint on the state. We may also know physical limitations (e.g.maximum acceleration or velocity) of objects being tracked, or hard bounds set bybiological or financial systems. These and many other examples can be formulatedusing state space constraints. The ability to incorporate this information is particu-larly useful when measurements are inaccurate or far between.In this section, we first show how to add affine inequality constraints to the affinesmoother formulation in Section 2. This requires a novel methodology: interiorpoint (IP) methods , an important topic in optimization [49, 32, 37]. IP methods workdirectly with optimality conditions, so we derive these conditions for the smoothingproblem. Rather than review theoretical results about IP methods, we give a gen-eral overview and show how they specialize to the linear constrained smoother. Theconstrained Kalman smoother was originally proposed in [11], but we improve onthat work here, and present a simplified algorithm, which is also faster and morenumerically stable. We illustrate the algorithm using a numerical example, buildingon the example in Section 2. ptimization viewpoint on Kalman smoothing 15
Once the linear smoother with linear inequality constraints is understood, we re-view the constrained nonlinear smoother (which can have nonlinear process, mea-surement, and constraint functions). Using [11] and references therein, we showthat the constrained nonlinear smoother is iteratively solved using linear constrainedsmoothing subproblems, analogously to how the nonlinear smoother in Section 3 isiteratively solved using linear smoothing subproblems from Section 2. Because ofthis hierarchy, the improvements to the affine algorithm immediately carry over tothe nonlinear case. We end with a nonlinear constrained numerical example.
We start with the linear smoothing problem (6), and impose linear inequality con-straints on the state space x : B k x k ≤ b k . (1)By choosing the matrix B k and b k appropriately, one can ensure x k lies in any poly-hedral set, since such a set is defined by a finite intersection of hyperplanes. Boxconstraints, one of the simplest and useful tools for modeling ( l k ≤ x k ≤ u k ) can beimposed via (cid:20) I − I (cid:21) x k ≤ (cid:20) u k − l k (cid:21) . In order to formulate the problem for the entire state space sequence, we define B = diag ( { B k } ) , b = vec ( { b k } ) , (2)and all of the constraints can be written simultaneously as Bx ≤ b . The constrainedoptimization problem is now given bymin x f ( x ) = (cid:107) Hx − z (cid:107) R − + (cid:107) Gx − w (cid:107) Q − subject to Bx + s = b , s ≥ . (3)Note that we have rewritten the inequality constraint as an equality constraint byintroducing a new ‘slack’ variable s .We derive the Karush-Kuhn-Tucker (KKT) conditions using the Lagrangian for-mulation. The Lagrangian corresponding to (2) is given by L ( x , u , s ) = (cid:107) Hx − z (cid:107) R − + (cid:107) Gx − w (cid:107) Q − + u (cid:62) ( Bx + s − b ) . (4)The KKT conditions are now obtained by differentiating L with respect to its argu-ments. Recall that the gradient of (6) is given by ( H (cid:62) R − H + G (cid:62) Q − G ) x − H (cid:62) R − z − G (cid:62) Q − w . As in (8) set C = H (cid:62) R − H + G (cid:62) Q − G , and for convenience set c = H (cid:62) R − z + G (cid:62) Q − w (5)The KKT necessary and sufficient conditions for optimality are given by ∇ x L = Cx + c + B (cid:62) u = ∇ q L = Bx + s − b = u i s i = ∀ i ; u i , s i ≥ . (6)The last set of nonlinear equations is known as complementarity conditions. Inprimal-dual interior point methods, the key idea for solving (3) is to successivelysolve relaxations of the system (6) that converge to a triplet ( ¯ x , ¯ u , ¯ s ) which satisfy (6). IP methods work directly to find solutions of (6). They do so by iteratively relaxingthe complementarity conditions u i s i = u i s i = µ as they drive the relaxationparameter µ to 0. The relaxed KKT system is defined by F µ ( s , u , x ) = s + Bx − bSU − µ Cx + B T u − c . (7)where S and U are diagonal matrices with s and u on the diagonal, and so the secondequation in F µ implements the relaxation u i s i = µ of (6). Note that the relaxationrequires that µ i , s i > i . Since the solution to (3) is found by driving the KKTsystem to 0, at every iteration IP methods attempt to drive F µ to 0 by Newton’smethod for root finding.Newton’s root finding method solves the linear system F ( ) µ ( s , u , x ) ∆ s ∆ u ∆ x = − F µ ( s , u , x ) . (8)It is important to see the full details of solving (8) in order to see why it is so effectivefor constrained Kalman smoothing. The full system is given by I BU S B T C ∆ s ∆ u ∆ x = − s + Bx − bSU − µ Cx + B T u − c . (9)Applying the row operations ptimization viewpoint on Kalman smoothing 17 row ← row − U row row ← row − B T S − row , we obtain the equivalent system I B S − UB C + B T S − UB ∆ s ∆ u ∆ x = − s + Bx − b − U ( Bx − b ) − µ Cx + B T u − c + B T S − ( U ( Bx − b ) + µ ) . In order to find the update for ∆ x , we have to solve the system (cid:0) C + B T S − UB (cid:1) ∆ x = Cx + B T u − c + B T S − ( U ( Bx − b ) + µ ) (10)Note the structure of the matrix in the LHS of (10). The matrix C is the same asin (6), so it is positive definite symmetric block tridiagonal. The matrices S − and U are diagonal, and we always ensure they have only positive elements. The matrices B and B (cid:62) are both block diagonal. Therefore, C + B T S − UB has the same structureas C , and we can solve (10) using Algorithm 2.1.Once we have ∆ x , the remaining two updates are obtained by back-solving: ∆ u = US − ( B ( x + ∆ x ) − b ) + µ s (11)and ∆ s = − s + b − B ( x + ∆ x ) . (12)This approach improves the algorithm presented in [11] solely by changing theorder of variables and equations in (7). This approach simplifies the derivation whilealso improving speed and numerical stability.It remains to explain how µ is taken to 0. There are several strategies, see [49, 32,37]. For the Kalman smoothing application, we use one of the simplest: for two outof every three iterations µ is aggressively taken to 0 by the update µ = µ /
10; whilein the remaining iterations, µ is unchanged. In practice, one seldom needs more than10 interior point iterations; therefore the constrained linear smoother performs at aconstant multiple of work of the linear smoother. In this section, we present to simple examples, both with linear constraints.Constant Box ConstraintsIn the first example, we impose box constraints in the example of Section 2.6.Specifically, we take advantage of the fact the state is bounded: (cid:239) (cid:239) (cid:239) (cid:239) (cid:239) (cid:239) (cid:239) (cid:239) Fig. 4
Two examples of linear constraints. Black solid line is true signal, magenta dash-dot lineis unconstrained Kalman smoother, and blue dashed line is the constrained Kalman smoother.Measurements are displayed as circles, and bounds are shown as green horizontal lines. In theleft panel, note that performance of the bounded smoother is significantly better around time 4-10 — the unconstrained is fooled by the measurements at times 4 and 8. In the right panel, as theoscillations die down due to damping, the measurement variance remains unchanged, so it becomesmuch more difficult to track the signal without the bound constraints.ptimization viewpoint on Kalman smoothing 19 (cid:2) − (cid:3) ≤ (cid:2) x (cid:3) (cid:2) (cid:3) (13)We can encode this information in form (1) with B k = (cid:20) − (cid:21) , b k = (cid:20) (cid:21) . (14)We contrast the performance of the constrained linear smoother with that of thelinear smoother without constraints. To show the advantages of modeling with con-straints, we increase the measurement noise in both situations to σ =
1. The resultsare show in Figure 4.2. The constrained smoother avoids some of the problems en-countered by the unconstrained smoother. Of particular interest are the middle andend parts of the track, where the unconstrained smoother goes far afield becauseof bad measurement. The constrained smoother is able track portions of the trackextremely well, having avoided the bad measurements with the aid of the boundconstraints. The figure was generated using the ckbs package [6], specifically us-ing the example file affine_ok_boxC.m .Variable Box ConstraintsIn the second example, we impose time-varying constraints on the state. Specifically,we track an exponentially bounded signal with a linear trend:exp ( − α t ) sin ( β t ) + . t using the ‘smooth signal’ process model and direct measurements, as in Section 2.6.The challenge here is that as the oscillations start to die down because of the expo-nential damping, the variance of the measurements remains the same. We can im-prove the performance by giving the smoother the exponential damping terms asconstraints.We included the second example to emphasize that ‘linearity’ of constraintsmeans ‘with respect to the state’; in fact, the constraints in the second example aresimply box constraints which are time dependent. The second example is no morecomplicated than the first one for the constrained smoother. We now consider the nonlinear constrained smoother, where we allow process func-tions g k , measurement functions h k to be nonlinear, and also allow nonlinear smoothconstraints ξ k ( x k ) ≤ b k . To be consistent with the notation we use throughout the pa-per, we define a new function ξ ( x ) = ξ ( x ) ξ ( x ) ... ξ N ( x N ) , (15)so that all the constraints can be written simultaneously as ξ ( x ) ≤ b .The problem we would like to solve now is a constrained reformulation of (2)min x f ( x ) = (cid:107) g ( x ) − w (cid:107) Q − + (cid:107) h ( x ) − z (cid:107) R − subject to ξ ( x ) − b ≤ . (16)At this point, we come back to the convex-composite representation describedin Section 3.1. The constraint ξ ( x ) − b ≤ δ ( ξ ( x ) − b | R − ) , (17)where δ ( x | C ) is the convex indicator function: δ ( x | C ) = (cid:40) x ∈ C ∞ x (cid:54)∈ C . (18)Therefore, the objective (16) can be represented as follows: f ( x ) = ρ ( F ( x )) ρ y y y = (cid:107) y (cid:107) Q − + (cid:107) y (cid:107) R − + δ ( y | R − ) F ( x ) = g ( x ) − wh ( x ) − z ξ ( x ) − b . (19)The approach to nonlinear smoothing in [11] is essentially the Gauss-Newtonmethod described in Section 3.2, applied to (19). In other words, at each iteration ν , the function F is linearized, and the direction finding subproblem is obtained bysolving min d (cid:107) G ν d − w − g ( x ν ) (cid:124) (cid:123)(cid:122) (cid:125) w ν (cid:107) Q − + (cid:107) H ν d − z − h ( x ν ) (cid:124) (cid:123)(cid:122) (cid:125) z ν (cid:107) R − , subject to B ν d ≤ b − ξ ( x ν ) (cid:124) (cid:123)(cid:122) (cid:125) b ν , (20) ptimization viewpoint on Kalman smoothing 21 where G ν and H ν are exactly as in (10), B ν = ∇ x ξ ( x ν ) is a block diagonal matrixbecause of the structure of ξ (15), and we have written the indicator function in (19)as an explicit constraint to emphasize the structure of the subproblem.Note that (20) has exactly the same structure as the linear constrained smoothingproblem (3), and therefore can be solved using the interior point approach in theprevious section. Because the convex-composite objective (19) is not finite valued(due to the indicator function of the feasible set), to prove convergence of the non-linear smoother, [11] uses results from [14]. We refer the interested reader to [11,Lemma 8, Theorem 9] for theoretical convergence results, and to [11, Algorithm 6]for the full algorithm, including line search details.Because of the hierarchical dependence of the nonlinear constrained smootheron the linear constrained smoother, the simplified improved approach we presentedin Section 4.2 pays off even more in the nonlinear case, where it is used repeatedlyas a subroutine. (cid:239) (cid:239) (cid:239) Fig. 5
Smoother results for ship tracking example with linear process model, nonlinear measure-ment model, and nonlinear constraints (with respect to the state). Black solid line is true state,red triangles denote the constraint, magenta dash-dot line is the unconstrained estimate, and bluedashed line gives the constrained nonlinear smoothed estimate.2 Aleksandr Y. Aravkin and James V. Burke and Gianluigi Pillonetto
The example in this section is reproduced from [11]. Consider the problem oftracking a ship traveling close to shore where we are given distance measurementsfrom two fixed stations to the ship as well as the location of the shoreline. Distanceto fixed stations is a nonlinear function, so the measurement model here is nonlinear.In addition, the corresponding constraint functions { f k } are not affine becausethe shoreline is not a straight line. For the purpose of simulating the measurements { z k } , the ship velocity [ X ( t ) , X ( t )] and the ship position [ X ( t ) , X ( t )] are given by X ( t ) = [ , t , − cos ( t ) , . − sin ( t ) ] (cid:62) Both components of the ship’s position are modeled using the smooth signal modelin Section 2.6. Therefore we introduce two velocity components, and the processmodel is given by G k = ∆ t ∆ t , Q k = ∆ t ∆ t / ∆ t / ∆ t / ∆ t ∆ t /
20 0 ∆ t / ∆ t / . The initial state estimate is given by g ( x ) = X ( t ) and Q = I where I is thefour by four identity matrix. The measurement variance is constant for this exampleand is denoted by σ . The distance measurements are made from two stationarylocations on shore. One is located at ( , ) and the other is located at ( π , ) . Themeasurement model is given by h k ( x k ) = (cid:113) x , k + x , k (cid:113) ( x , k − π ) + x , k , R k = (cid:18) σ σ (cid:19) . We know that the ship does not cross land, so X ( t ) ≥ . − sin [ X ( t )] . Thisinformation is encoded by the constraints ξ k ( x k ) = . − sin ( x , k ) − x , k ≤ . The initial point for the smoother is [ , , , ] (cid:62) ,which is not feasible. The results areplotted in Figure 4.5. The constrained smoother performs significantly better thanthe unconstrained smoother in this example. The experiment was done using the ckbs program, specifically see sine_wave_example.m . In many applications, the probalistic model for the dynamics and/or the observa-tions (1) is not well described by a Gaussian distribution. This occurs in the modelfor the observations when they are contaminated by outliers, or more generally, ptimization viewpoint on Kalman smoothing 23 when the measurement noise v k is heavy tailed [44], and it occurs in the model forthe dynamics when tracking systems with rapidly changing dynamics, or jumps inthe state values [31]. A robust Kalman filter or smoother is one that can obtain anacceptable estimate of the state when Gaussian assumptions are violated, and whichcontinues to perform well when they are not violated.We show how to accommodate non-Gaussian densities by starting with a simplecase of non-Gaussian heavy tailed measurement noise v k [7]. However, this generalapproach can be extended to w k as well. Heavy tailed measurement noise occurs inapplications related to glint noise [25], turbulence, asset returns, and sensor failureor machine malfunction. It can also occur in the presence of secondary noise sourcesor other kinds of data anomalies. Although it is possible to estimate a minimum vari-ance estimate of the state using stochastic simulation methods such as Markov chainMonte-Carlo (MCMC) or particle filters [24, 35], these methods are very computa-tionally intensive, and convergence often relies on heuristic techniques and is highlyvariable. The approach taken here is very different. It is based on the optimizationperspective presented in the previous sections. We develop a method for computingthe MAP estimate of the state sequence under the assumption that the observationnoise comes from the (cid:96) -Laplace density often used in robust estimation, e.g., see[23, equation 2.3]. As we will see, the resulting optimization problem will againbe one of convex composite type allowing us to apply a Gauss-Newton strategy forcomputing the MAP estimate. Again, the key to a successful computational strategyis the preservation of the underlying tri-diagonal structure. (cid:96) -Laplace Smoother For u ∈ R m we use the notation (cid:107) u (cid:107) for the (cid:96) norm of u ; i.e., (cid:107) u (cid:107) = | u | + . . . + | u m | . The multivariate (cid:96) -Laplace distribution with mean µ and covariance R has thefollowing density: p ( v k ) = det ( R ) − / exp (cid:104) −√ (cid:13)(cid:13)(cid:13) R − / ( v k − µ ) (cid:13)(cid:13)(cid:13) (cid:105) , (1)where R / denotes a Cholesky factor of the positive definite matrix R ; i.e., R / ( R / ) T = R . One can verify that this is a probability distribution with covariance R using thechange of variables u = R − / ( v k − µ ) . A comparison of the Gaussian and Laplacedistributions is displayed in Figure 6. This comparison includes the densities, nega-tive log densities, and influence functions, for both distributions. a posteriori formulation Assume that the model for the dynamics and the observations is given by (1), where w k is assumed to be Gaussian and v k is modeled by the (cid:96) -Laplace density (1). Underthese assumptions, the MAP objective function is given by − − − − v N (0 , L (0 , − − − − v . ∗ v k √ ∗ | v k | − − − − v v k √ ∗ v k / | v k | Fig. 6
Gaussian and Laplace Densities, Negative Log Densities, and Influence Functions (forscalar v k ) P (cid:0) { x k } (cid:12)(cid:12) { z k } (cid:1) ∝ P (cid:0) { z k } (cid:12)(cid:12) { x k } (cid:1) P ( { x k } )= N ∏ k = P ( { v k } ) P ( { w k } ) ∝ N ∏ k = exp (cid:18) −√ (cid:13)(cid:13)(cid:13) R − / ( z k − h k ( x k )) (cid:13)(cid:13)(cid:13) − ( x k − g k ( x k − )) (cid:62) Q − k ( x k − g k ( x k − )) (cid:19) . (2)Dropping terms that do not depend on { x k } , minimizing this MAP objective withrespect to { x k } is equivalent to minimizing f ( { x k } ) : = √ N ∑ k = (cid:13)(cid:13)(cid:13) R − / k [ z k − h k ( x k )] (cid:13)(cid:13)(cid:13) + N ∑ k = [ x k − g k ( x k − )] T Q − k [ x k − g k ( x k − )] , where, as in (1), x is known and g = g ( x ) . Setting R = diag ( { R k } ) Q = diag ( { Q k } ) x = vec ( { x k } ) w = vec ( { g , , . . . , } ) z = vec ( { z , z , . . . , z N } ) , g ( x ) = x x − g ( x ) ... x N − g N ( x N − ) , h ( x ) = h ( x ) h ( x ) ... h N ( x N ) , (3)as in (5) and (1), the MAP estimation problem is equivalent tominimize x ∈ R Nn f ( x ) = (cid:107) g ( x ) − w (cid:107) Q − + √ (cid:13)(cid:13)(cid:13) R − / ( h ( x ) − z ) (cid:13)(cid:13)(cid:13) . (4) The objective in (4) can again be written as a the composition of a convex function ρ with a smooth function F : ptimization viewpoint on Kalman smoothing 25 f ( x ) = ρ ( F ( x )) , (5)where ρ (cid:18) y y (cid:19) = (cid:107) y (cid:107) Q − + √ (cid:107) R − / y (cid:107) , F ( x ) = (cid:20) g ( x ) − wh ( x ) − z (cid:21) . (6)Consequently, the generalized Gauss-Newton methodology described in Section 3.2again applies. That is, given an approximate solution x ν to (4), we compute a newapproximate solution of the form x ν + = x ν + γ ν d ν , where d ν solves the subproblemminimize d ∈ R n ρ ( F ( x ν ) + F (cid:48) ( x ν ) d ) , (7)and γ ν is computed using the backtracking line-search procedure described in Sec-tion 3.2. Following the pattern described in (10), the subproblem (7), where ρ and F are given in (6), has the form d ν = arg min d ˜ f ( d ) = (cid:107) G ν d − w − g ( x ν ) (cid:124) (cid:123)(cid:122) (cid:125) w ν (cid:107) Q − + √ (cid:107) R − / ( H ν d − z − h ( x ν ) (cid:124) (cid:123)(cid:122) (cid:125) z ν ) (cid:107) , (8)where G ν = I 0 − g ( ) ( x ν ) I . . .. . . . . . 0 − g ( ) N ( x ν N − ) I , H ν = diag { h ( ) ( x ) , . . . , h ( ) N ( x N ) } . (9) By (8), the basic subproblem that must be solved takes the formmin d (cid:107) Gd − w (cid:107) Q − + √ (cid:107) R − / ( Hd − z ) (cid:107) , (10)where, as in (5), R = diag ( { R k } ) Q = diag ( { Q k } ) H = diag ( { H k } ) x = vec ( { x k } ) w = vec ( { w , w , . . . , w N } ) z = vec ( { z , z , . . . , z N } ) G = I 0 − G I . . .. . . . . . 0 − G N I . (11)Using standard optimization techniques, one can introduce a pair of auxiliary non-negative variables p + , p − ∈ R M ( M = ∑ Nk = m ( k ) ) so that this problem can be rewrit-ten as minimize d (cid:62) Cd + c (cid:62) d + √ (cid:62) ( p + + p − ) w.r.t. d ∈ R nN , p + , p − ∈ R M subject to Bd + b = p + − p − , (12)where C = G (cid:62) Q − G = C A (cid:62) A C A (cid:62)
00 . . . . . . . . .0 A N C N , A k = − Q − k G k C k = Q − k + G (cid:62) k + Q − k + G k + c = G (cid:62) wB = R − / Hb = − R − / z . The problem (12) is a convex quadratic program. If we define F µ ( p + , p − , s + , s − , d ) = p + − p − − b − Bd diag ( p − ) diag ( s − ) − µ s + + s − − √ diag ( p + ) diag ( s + ) − µ Cd + c + B T ( s − − s + ) / , (13)for µ ≥
0, then the KKT conditions for (12) can be written as F ( p + , p − , s + , s − , d ) = . The set of solutions to F µ ( p + , p − , s + , s − , d ) = µ > µ = µ ↓
0. At each iteration of the interior point method we need to solve a system ofthe form F µ ( p + , p − , s + , s − , d ) + F (cid:48) µ ( p + , p − , s + , s − , d ) ∆ p + ∆ p − ∆ s + ∆ s − ∆ y = , ptimization viewpoint on Kalman smoothing 27 where the vectors p + , p − , s + , and s − are componentwise strictly positive. Usingstandard methods of Gaussian elimination (as in Section 4.2), we obtain the solution ∆ y = [ C + B T T − B ] − ( ¯ e + B T T − ¯ f ) ∆ s − = T − B ∆ y − T − ¯ f ∆ s + = − ∆ s − + √ − s + − s − ∆ p − = diag ( s − ) − [ τ − diag ( p − ) ∆ s − ] − p − ∆ p + = ∆ p − + B ∆ y + b + By − p + + p − , where ¯ d = τ / s + − τ / s − − b − By + p + ¯ e = B T ( √ − s − ) − Cy − c ¯ f = ¯ d − diag ( s + ) − diag ( p + )( √ − s − ) T = diag ( s + ) − diag ( p + ) + diag ( s − ) − diag ( p − ) . Since the matrices T and B are block diagonal, the matrix B (cid:62) T B is also block di-agonal. Consequently, the key matrix C + B T T − B has exactly the same form as theblock tri-diagonal matrix in (10) with c k = Q − k + G (cid:62) k + Q − k + G k + + H (cid:62) k T − k H k k = , . . . , N , a k = − Q − k G k k = , . . . , N , where T k = diag ( s + k ) − diag ( p + k ) + diag ( s − k ) − diag ( p − k ) . Algorithm 2.1 can be ap-plied to solve this system accurately and stably with O ( n N ) floating point opera-tions which preserves the efficiency of the classical Kalman Filter algorithm.Further discussion on how to incorporate approximate solutions to the quadraticprogramming subproblems can be found in [7, Section V]. In the linear case, the functions g k and h k is (1) are affine so that they equal theirlinearizations. In this case, the problems (4) and (7) are equivalent and only onesubproblem of the form (10), or equivalently (12), needs to be solved. We illustratethe (cid:96) -Laplace smoother described in Section 5.1.1 by applying it to the examplestudied in Section 2.6, except now the noise term v k is modeled using the (cid:96) -Laplacedensity. The numerical experiment described below is take from [7, Section VI].The numerical experiment uses two full periods of X ( t ) generated with N = ∆ t = π / N ; i.e., discrete time points equally spaced over the interval [ , π ] .For k = , . . . , N the measurements z k were simulated by z k = X ( t k ) + v k . In orderto test the robustness of the (cid:96) model to measurement noise containing outlier data,we generate v k as a mixture of two normals with p denoting the fraction of outlier Table 1
Median MSE and 95% confidence intervals for the different estimation methods p φ GKF IGS ILS0 − .34 (.24, .47) .04(.02, .1) .04(.01, .1) . . . . contamination; i.e., v k ∼ ( − p ) N ( , . ) + p N ( , φ ) . (14)This was done for p ∈ { , . } and φ ∈ { , , , } . The model for the meanof z k given x k is h k ( x k ) = ( , ) x k = x , k . Here x , k denotes the second componentof x k . The model for the variance of z k given x k is R k = .
25. This simulates alack of knowledge of the distribution for the outliers; i.e, p N ( , φ ) . Note that weare recovering estimates for the smooth function − sin ( t ) and its derivative − cos ( t ) using noisy measurements (with outliers) of the function values. F un c t i on un i t s Fig. 7
Simulation: measurements (+), outliers (o) (absolute residuals more than three standarddeviations), true function (thick line), (cid:96) -Laplace estimate (thin line), Gaussian estimate (dashedline), Gaussian outlier removal estimate (dotted line)ptimization viewpoint on Kalman smoothing 29 We simulated 1000 realizations of the sequence { z k } keeping the ground truthfixed, and for each realization, and each estimation method, we computed the corre-sponding state sequence estimate { ˆ x k } . The Mean Square Error (MSE) correspond-ing to such an estimate is defined byMSE = N N ∑ k = [ x , k − ˆ x , k ] + [ x , k − ˆ x , k ] , (15)where x k = X ( t k ) . In Table 1, the Gaussian Kalman Filter is denoted by (GKF),the Iterated Gaussian Smoother (IGS), and the Iterated (cid:96) -Laplace Smoother (ILS).For each of these estimation techniques, each value of p , and each value of φ , thecorresponding table entry is the median MSE followed by the centralized 95% con-fidence interval for the MSE. For this problem, the model functions { g k ( x k − ) } and { h k ( x k ) } are linear so the iterated smoothers IGS and ILS only require one iterationto estimate the sequence { ˆ x k } .Note the (cid:96) -Laplace smoother performs nearly as well as the Gaussian smootherat the nominal conditions ( p = (cid:96) -Laplace smoother performs better andmore consistently in cases with data contamination ( p ≥ . φ ≥ { z k } where p = . φ = (cid:96) -Laplace smoother, doesnot suffer from these difficulties. We now illustrate the behavior of the (cid:96) -Laplace smoother on the Van Der Pol Oscil-lator described in Section 3.4. The numerical experiment we describe is taken from[7, Section VI]. The corresponding nonlinear differential equation is˙ X ( t ) = X ( t ) and ˙ X ( t ) = µ [ − X ( t ) ] X ( t ) − X ( t ) . Given X ( t k − ) = x k − the Euler approximation for X ( t k − + ∆ t ) is g k ( x k − ) = (cid:18) x , k − + x , k − ∆ tx , k − + { µ [ − x , k ] x , k − x , k } ∆ t (cid:19) . For this simulation, the ‘ground truth’ is obtained from a stochastic Euler ap-proximation of the Van der Pol oscillator. To be specific, with µ = N = ∆ t = / N , the ground truth state vector x k at time t k = k ∆ t is given by x = ( , − . ) T and for k = , . . . , N , x k = g k ( x k − ) + w k , (16)where { w k } is a realization of independent Gaussian noise with variance 0 .
01. Ourmodel for state transitions (1) uses Q k = . I for k >
1, and so is identical to themodel used to simulate the ground truth { x k } . Thus, we have precise knowledge ofthe process that generated the ground truth { x k } . The initial state x is impreciselyspecified by setting g ( x ) = ( . , − . ) T (cid:54) = x with corresponding variance Q = . I . Table 2
Median MSE over 1000 runs and confidence intervals containing 95% of MSE results p φ IGS ILS0 − . . . . . . . . . For k = , . . . , N the measurements z k were simulated by z k = x , k + v k . Themeasurement noise v k was generated as follows: v k ∼ ( − p ) N ( , . ) + p N ( , φ ) . (17)This was done for p ∈ { , . , . , . } and φ ∈ { , , } . The model for themean of z k given x k is h k ( x k ) = ( , ) x k = x , k . As in the previous simulation, wesimulated a lack of knowledge of the distribution for the outliers; i.e, p N ( , φ ) . In(1), the model for the variance of z k given x k is R k = . { x k } and thecorresponding measurement sequence { z k } . For each realization, we computed thecorresponding state sequence estimate { ˆ x k } using both the IGS and IKS procedures.The Mean Square Error (MSE) corresponding to such an estimate is defined byequation (15), where x k is given by equation (16). The results of the simulation ap-pear in Table 2. As the proportion and variance of the outliers increase, the Gaussiansmoother degrades, but the (cid:96) -Laplace smoother is not affected.Figure 8 provides a visual illustration of one realization { x k } and its corre-sponding estimates { ˆ x k } . The left two panels demonstrate that, when no outliersare present, both the IGS and ILS generate accurate estimates. Note that we onlyobserve the first component of the state and that the variance of the observation isrelatively large (see top two panels). The right two panels show what can go wrong ptimization viewpoint on Kalman smoothing 31 (cid:239)
20% of measurement errors are N(0, 100)0 5 10 15 (cid:239)
505 Time(s) (cid:239)
Nominal Measurement Errors x (cid:239) c o m ponen t (cid:239)
505 Time(s) x (cid:239) c o m ponen t stochastic realizationGaussian SmootherL1 SmootherData Fig. 8
The left two panels show estimation of x , (top) and x (bottom) with errors from the nomi-nal model. The stochastic realization is represented by a thick black line; the Gaussian smoother isthe blue dashed line, and the (cid:96) -smoother is the magenta dash-dotted line. Right two panels showthe same stochastic realization but with measurement errors now from ( p , φ ) = ( . , ) . Outliersappear on the top and bottom boundary in the top right panel. when outliers are present. The Van der Pol oscillator can have sharp peaks as a resultof the nonlinearity in its process model, and outliers in the measurements can ‘trick’the IGS into these modes when they are not really present. In contrast, the Iterated (cid:96) -Laplace Smoother avoids this problem. Let us step back for a moment and examine a theme common to all of the variationson the Kalman smoother that we have examined thus far and compare the objectivefunctions in (4), (6), (2), (3), and (16). In all cases, the objective function takes theform N ∑ k = V k ( h ( x k ) − z k ; R k ) + J k ( x k − g ( x k − ) ; Q k ) , (18)where the mappings V k and J k are associated with log-concave densities of the form p v , k ( z ) ∝ exp ( − V k ( z : R k ))) and p w , k ( x ) ∝ exp ( − J k ( x ; Q k )) with p v , k and p w , k having covariance matrices R k and Q k , respectively. The choiceof the penalty functions V k and J k reflect the underlying model for distribution ofthe observations and the state, respectively. In many applications, the functions V k and J k are a members of the class of extended piecewise linear-quadratic penaltyfunctions. For a nonempty polyhedral set U ⊂ R m and a symmetric positive-semidefinite matrix M ∈ R m × m (possibly M = θ U , M : R m →{ R ∪ ∞ } : = R by θ U , M ( w ) : = sup u ∈ U (cid:26) (cid:104) u , w (cid:105) − (cid:104) u , Mu (cid:105) (cid:27) . (19)Given and injective matrix B ∈ R m × n and a vector b ∈ R m , define ρ : R n → R as θ U , M ( b + By ) : ρ U , M , b , B ( y ) : = sup u ∈ U (cid:8) (cid:104) u , b + By (cid:105) − (cid:104) u , Mu (cid:105) (cid:9) . (20)All functions of the type specified in (19) are called piecewise linear-quadratic (PLQ) penalty functions, and those of the form (20) are called extended piecewiselinear-quadratic (EPLQ) penalty functions. Remark 1.
PLQ penalty functions are extensively studied by Rockafellar and Wetsin [43]. In particular, they present a full duality theory for optimizations problemsbased on these functions.It is easily seen that the penalty functions arising from both the Gaussian and (cid:96) -Laplace distributions come from this EPLQ class. But so do other important den-sities such as the Huber and Vapnik densities. Examples:
The (cid:96) , (cid:96) , Huber, and Vapnik penalties are representable in the notationof Definition 1. ptimization viewpoint on Kalman smoothing 33 − K Ky x V ( x ) = − Kx − K ; x < − KV ( x ) = x ; − K ≤ x ≤ KV ( x ) = Kx − K ; K < x − (cid:15) (cid:15)y x V ( x ) = − x − (cid:15) ; x < − (cid:15)V ( x ) = 0; − (cid:15) ≤ x ≤ (cid:15)V ( x ) = x − (cid:15) ; (cid:15) ≤ x Fig. 9
Huber (left) and Vapnik (right) Penalties L : Take U = R , M = b =
0, and B =
1. We obtain ρ ( y ) = sup u ∈ R (cid:28) uy − u (cid:29) . The function inside the sup is maximized at u = y , whence ρ ( y ) = y .2. (cid:96) : Take U = [ − , ] , M = b =
0, and B =
1. We obtain ρ ( y ) = sup u ∈ [ − , ] (cid:104) uy (cid:105) . The function inside the sup is maximized by taking u = sign ( y ) , whence ρ ( y ) = | y | .3. Huber: Take U = [ − K , K ] , M = b =
0, and B =
1. We obtain ρ ( y ) = sup u ∈ [ − K , K ] (cid:28) uy − u (cid:29) . Take the derivative with respect to u and consider the following cases:a. If y < − K , take u = − K to obtain − Ky − K .b. If − K ≤ y ≤ K , take u = y to obtain y .c. If y > K , take u = K to obtain a contribution of Ky − K .This is the Huber penalty with parameter K , shown in the left panel of Fig. 1.4. Vapnik: take U = [ , ] × [ , ] , M = (cid:2) (cid:3) , B = (cid:2) − (cid:3) , and b = (cid:2) − ε − ε (cid:3) , for some ε >
0. We obtain ρ ( y ) = sup u , u ∈ [ , ] (cid:28)(cid:20) y − ε − y − ε (cid:21) , (cid:20) u u (cid:21)(cid:29) . We can obtain an explicitrepresentation by considering three cases:a. If | y | < ε , take u = u =
0. Then ρ ( y ) = y > ε , take u = u =
0. Then ρ ( y ) = y − ε .c. If y < − ε , take u = u =
1. Then ρ ( y ) = − y − ε .This is the Vapnik penalty with parameter ε , shown in the right panel of Fig.5.2.1. We caution that not every EPLQ function is the negative log of a density function.For an ELQP function ρ to be associated with a density, the function exp ( − ρ ( x )) must be integrable on R n . The integrability of exp ( − ρ ( x )) can be established undera coercivity hypothesis. Definition 2.
A function ρ : R n → R ∪ { + ∞ } = R is said to be coercive (or 0-coercive) if lim (cid:107) x (cid:107)→ ∞ ρ ( x ) = + ∞ .Since the functions ρ U , M , b , B defined in (20) are not necessarily finite-valued, theircalculus must be treated with care. An important tool in this regard is the essentialdominion. The essential domain of ρ : R n → R is the setdom ( ρ ) : = { x : ρ ( x ) < + ∞ } . The affine hull of dom ( ρ ) is the smallest affine set containing dom ( ρ ) , where a setis affine if it is the translate of a subspace. Theorem 1. [4, Theorem 6] (PLQ Integrability). Let ρ : = ρ U , M , b , B be defined asin (20) . Suppose ρ ( y ) is coercive, and let n aff denote the dimension of aff ( dom ρ ) .Then the function f ( y ) = exp ( − ρ ( y )) is integrable on aff ( dom ρ ) with the n aff -dimensional Lebesgue measure. (cid:4) Theorem 2. [4, Theorem 7] (Coercivity of ρ ). The function ρ U , M , b , B defined in (20) is coercive if and only if [ B T cone ( U )] ◦ = { } . (cid:4) If ρ : = ρ U , M , b , B is coercive, then, by Theorem 1, then the function f ( y ) = exp ( − ρ ( y )) is integrable on aff ( dom ρ ) with the n aff -dimensional Lebesgue mea-sure. If we define p ( y ) = (cid:40) c − exp ( − ρ ( y )) y ∈ dom ρ , (21)where c = (cid:18) (cid:90) y ∈ dom ρ exp ( − ρ ( y )) dy (cid:19) , and the integral is with respect to the Lebesgue measure with dimension n aff , then p is a probability density on dom ( ρ ) . We call these PLQ densities. We now show how to build up the penalty functions V k and J k in (18) using PLQdensities. We will do this for the linear model (1)-(2) for simplicity. The nonlin- ptimization viewpoint on Kalman smoothing 35 ear case can be handled as before by applying the Gauss-Newton strategy to theunderlying convex composite function.Using the notion given in (5), the linear model (1)-(2) can be written as w = Gx + w z = Hx + v . (22)A general Kalman smoothing problem can be specified by assuming that thenoises w and v in the model (22) have PLQ densities with means 0, variances Q and R (5). Then, for suitable { U wk , M wk , b wk , B wk } and { U vk , M vk , b vk , B vk } , we have p ( w ) ∝ exp ( − θ U w , M w ( b w + B w Q − / w )) p ( v ) ∝ exp ( − θ U v , M v ( b v + B v R − / v )) , (23)where U w = N ∏ k = U wk ⊂ R nN U v = N ∏ k = U vk ⊂ R M , M w = diag ( { M wk } ) M v = diag ( { M vk } ) , B w = diag ( { B wk } ) B v = diag ( { B vk } ) b w = vec ( { b wk } ) b v = vec ( { b vk } ) . Then the MAP estimator for x in the model (22) isarg min x ∈ R nN (cid:40) θ U w , M w ( b w + B w Q − / ( Gx − w ))+ θ U v , M v ( b v + B v R − / ( Hx − z )) (cid:41) . (24)Note that since w k and v k are independent, problem (24) is decomposable intoa sum of terms analogous to (18). This special structure follows from the blockdiagonal structure of H , Q , R , B v , B w , the bidiagonal structure of G , and the productstructure of sets U w and U v , and is key in proving the linear complexity of thesolution method we propose. Recall that, when the sets U w and U v are polyhedral, (24) is an Extended LinearQuadratic program (ELQP), described in [43, Example 11.43]. We solve (24) byworking directly with its associated Karush-Kuhn-Tucker (KKT) system. Lemma 1. [4, Lemma 3.1] Suppose that the sets U wk and U vk are polyhedral, that is,they can be given the representationU wk = { u | ( A wk ) T u ≤ a wk } , U vk = { u | ( A vk ) T u ≤ a vk } . Then the first-order necessary and sufficient conditions for optimality in (24) aregiven by = ( A w ) T u w + s w − a w ; 0 = ( A v ) T u v + s v − a v = ( s w ) T q w ; 0 = ( s v ) T q v = ˜ b w + B w Q − / Gx − M w u w − A w q w = ˜ b v − B v R − / Hx − M v u v − A v q v = G T Q − T / ( B w ) T u w − H T R − T / ( B v ) T u v ≤ s w , s v , q w , q v . , (25) where ˜ b w = b w − B w Q − / w and ˜ b v = b v − B v R − / z. (cid:4) We propose solving the KKT conditions (25) by an Interior Point (IP) method. IPmethods work by applying a damped Newton iteration to a relaxed version of (25)where the relaxation is to the complementarity conditions. Specifically, we replacethe complementarity conditions by ( s w ) T q w = → Q w S w − µ = ( s v ) T q v = → Q v S v − µ = , where Q w , S w , Q v , S v are diagonal matrices with diagonals q w , s w , q v , s v respectively.The parameter µ is aggressively decreased to 0 as the IP iterations proceed. Typi-cally, no more than 10 or 20 iterations of the relaxed system are required to obtaina solution of (25), and hence an optimal solution to (24). The following theoremshows that the computational effort required (per IP iteration) is linear in the num-ber of time steps whatever PLQ density enters the state space model. Theorem 3. [4, Theorem 3.2] (PLQ Kalman Smoother Theorem) Suppose that allw k and v k in the Kalman smoothing model (1) - (2) come from PLQ densities thatsatisfy Null ( M ) ∩ U ∞ = { } . Then an IP method can be applied to solve (24) with aper iteration computational complexity of O ( Nn + Nm ) . (cid:4) The proof, which can be found in [4], shows that IP methods for solving (24) pre-serve the key block tridiagonal structure of the standard smoother. General smooth-ing estimates can therefore be computed in O ( Nn ) time, as long as the number ofIP iterations is fixed (as it usually is in practice, to 10 or 20).It is important to observe that the motivating examples all satisfy the conditions ofTheorem 3. Corollary 1. [4, Corollary 3.3] The densities corresponding to L , L , Huber, andVapnik penalties all satisfy the hypotheses of Theorem 3. Proof:
We verify that Null ( M ) ∩ Null ( A T ) = L case, M has full rank. For the L , Huber, and Vapnik penalties, the respective sets U are bounded, so U ∞ = { } . ptimization viewpoint on Kalman smoothing 37 ! !" ! ! '!' ! ! '!' Fig. 10
Simulation: measurements ( · ) with outliers plotted on axis limits (4 and − ε -insensitive loss (dashed line, right panel) In this section we present a numerical example to illustrate the use of the Vap-nik penalty (see Figure 5.2.1) in the Kalman smoothing context, for a functionalrecovery application.We consider the following function f ( t ) = exp [ sin ( t )] taken from [19]. Our aim is to reconstruct f starting from 2000 noisy samples col-lected uniformly over the unit interval. The measurement noise v k was generated using a mixture of two normals with p = . v k ∼ ( − p ) N ( , . ) + p N ( , ) , where N refers to the Normal distribution. Data are displayed as dots in Fig. 10.Note that the purpose of the second component of the normal mixture is to simulateoutliers in the output data and that all the measurements exceeding vertical axislimits are plotted on upper and lower axis limits (4 and -2) to improve readability.The initial condition f ( ) = f ( · ) −
1) is modeled as aGaussian process given by an integrated Wiener process. This model captures theBayesian interpretation of cubic smoothing splines [48], and admits a 2-dimensionalstate space representation where the first component of x ( t ) , which models f ( · ) − ∆ t = / G k = (cid:20) ∆ t (cid:21) , k = , , . . . , H k = (cid:2) (cid:3) , k = , , . . . , w k given by Q k = λ (cid:34) ∆ t ∆ t ∆ t ∆ t (cid:35) , k = , , . . . , , where λ is an unknown scale factor to be estimated from the data.The performance of two different Kalman smoothers are compared. The first (clas-sical) estimator uses a quadratic loss function to describe the negative log of themeasurement noise density and contains only λ as unknown parameter. The sec-ond estimator is a Vapnik smoother relying on the ε -insensitive loss, and so de-pends on two unknown parameters λ and ε . In both of the cases, the unknownparameters are estimated by means of a cross validation strategy where the 2000measurements are randomly split into a training and a validation set of 1300 and700 data points, respectively. The Vapnik smoother was implemented by exploitingthe efficient computational strategy described in the previous section, see [8] forspecific implementation details. In this way, for each value of λ and ε containedin a 10 ×
20 grid on [ . , ] × [ , ] , with λ logarithmically spaced, the func-tion estimate was rapidly obtained by the new smoother applied to the training set.Then, the relative average prediction error on the validation set was computed, seeFig. 11. The parameters leading to the best prediction were λ = . × and ε = .
45, which give a sparse solution defined by fewer than 400 support vectors.The value of λ for the classical Kalman smoother was then estimated followingthe same strategy described above. In contrast to the Vapnik penalty, the quadraticloss does not induce any sparsity, so that, in this case, the number of support vectors ptimization viewpoint on Kalman smoothing 39 equals the size of the training set.The left and right panels of Fig. 10 display the function estimate obtained using thequadratic and the Vapnik losses, respectively. It is clear that the Gaussian estimate isheavily affected by the outliers. In contrast, as expected, the estimate coming fromthe Vapnik based smoother performs well over the entire time period, and is virtuallyunaffected by the presence of large outliers. (cid:104) Average prediction error on the validation set (cid:161)
Fig. 11
Estimation of the smoothing filter parameters using the Vapnik loss. Average predictionerror on the validation data set as a function of the variance process λ and ε . In recent years, sparsity promoting formulations and algorithms have made a tremen-dous impact in signal processing, reconstruction algorithms, statistics, and inverseproblems (see e.g. [13] and the references therein). In some contexts, rigorous math-ematical theory is available that can guarantee recovery from under-sampled sparsesignals [20]. In addition, for many inverse problems, sparsity promoting optimiza-tion provides a way to exploit prior knowledge of the signal class as a way to im-prove the solution to an ill-posed problem, but conditions for recoverability have notyet been derived [36].In the context of dynamic models, several sparse Kalman filters have been re-cently proposed [17, 16, 47, 1]. In the applications considered, in addition to pro-cess and measurement models, the state space is also known to be sparse. The aim isto improve recovery by incorporating sparse optimization techniques. Reference [1]is very close to the work presented in this section, since they formulate a sparsity promoting optimization problem over the whole measurement sequence and solveit with an optimization technique shown to preserve computational efficiency.In this section, we formulate the sparse Kalman smoothing problem as an op-timization problem over the entire state space sequence, and suggest two new ap-proaches for the solution of such problems. The first approach is based on the inte-rior point methodology, and is a natural extension of the mathematics presented inearlier sections.The second approach is geared towards problems where the dimension n (stateat a single time point) is large. For this case, we propose a matrix free approach,using a different (constrained) Kalman smoothing formulation, together with theprojected gradient method. In both methods, the structure of the Kalman smoothingproblem is exploited to achieve computational efficiency.We present theoretical development for the two approaches, leaving applicationsand numerical results to future work. We consider only the linear smoother (6). A straight forward way to impose sparsityon the state is to augment this formulation with a 1-norm penalty:min x f ( x ) : = (cid:107) Hx − z (cid:107) R − + (cid:107) Gx − w (cid:107) Q − + λ (cid:107) W x (cid:107) , (1)where W is a diagonal weighting matrix included for modeling convenience. Forexample, the elements of W can be set to 0 to exclude certain parts of the state di-mension from the sparse penalty. A straightforward constrained reformulation of (1)is min x (cid:107) Hx − z (cid:107) R − + (cid:107) Gx − w (cid:107) Q − + λ T y s.t. − y ≤ W x ≤ y . (2)Note that this is different from the constrained problem (3), because we have intro-duced a new variable y , with constraints in x and y . Nonetheless, an interior pointapproach may still be used to solve the resulting problem. We rewrite the constraintin (8) using non-negative slack variables s , r : W x − y + s = − W x − y + r = , (3)and form the Lagrangian for the corresponding system: L ( s , r , q , p , y , x ) = x T Cx + c T x + λ T y + q T ( W x − y + s ) + p T ( − W x − y + r ) , (4)with C as in (8) and c as in (5)., and where q and p are the dual variables correspond-ing to the inequality constraints W x ≤ y and − W x ≤ − y , respectively. The (relaxed) ptimization viewpoint on Kalman smoothing 41 KKT system is therefore given by F µ ( s , r , q , p , y , x ) : = s − y + W xr − y − W xD ( s ) D ( q ) − µ D ( r ) D ( p ) − µ λ − q − pW q − W p + Cx + c = . (5)The derivative matrix F ( ) µ is given by F ( ) µ = I − I W I − I − WD ( q ) D ( s ) D ( p ) D ( r ) − I − I W − W C , (6)and it is row equivalent to the system I − I W I − I − W D ( s ) D ( q ) − D ( q ) W D ( r ) D ( p ) D ( p ) W Φ − Ψ W C + W Φ − (cid:0) Φ − Ψ (cid:1) W where Φ = D ( s ) − D ( q ) + D ( r ) − D ( p ) Ψ = D ( s ) − D ( q ) − D ( r ) − D ( p ) , (7)and the matrix Φ − Ψ is diagonal, with the ii th entry given by 4 q i r i . Therefore,the modified system preserves the structure of C ; specifically it is symmetric, blocktridiagonal, and positive definite. The Newton iterations required by the interiorpoint method can therefore be carried out, with each iteration having complexity O ( n N ) . Consider again the linear smoother (6), but now impose a 1-norm constraint ratherthan a penalty: min x f ( x ) : = (cid:107) Hx − z (cid:107) R − + (cid:107) Gx − w (cid:107) Q − s.t. (cid:107) W x (cid:107) ≤ τ . (8)This problem, which equivalent to (1) for certain values of λ and τ , is preciselythe LASSO problem [45], and can be writtenmin 12 x T Cx + c T x s.t. (cid:107) W x (cid:107) ≤ τ . (9)with C ∈ R nN × nN as in (8) and c ∈ R nN as in (5). When n is large, the interior pointmethod proposed in the previous section may not be feasible, since it requires exactsolutions of the system ( C + W Φ − (cid:0) Φ − Ψ (cid:1) W ) x = r , and the block-tridiagonal algorithm 2.1 requires the inversion of n × n systems.The problem (9) can be solved without inverting such systems, using the spectralprojected gradient method, see e.g. [46, Algorithm 1]. Specifically, the gradient Cx + c must be repeatedly computed, and then x ν − ( Cx ν + c ) is projected onto the set (cid:107) W x (cid:107) ≤ τ . (the word ‘spectral’ refers to the fact that the Barzilai-Borwein linesearch is used to get the step length).In the case of the Kalman smoother, the gradient Cx + c can be computed in O ( n N ) time, because of the special structure of C . Thus for large systems, theprojected gradient method that exploits the structure of C affords significant savingsper iteration relative to the interior point approach, O ( n N ) vs. O ( n N ) , and relativeto a method agnostic to the structure of C , O ( n N ) vs. O ( n N ) . The projection ontothe feasible set (cid:107) W x (cid:107) ≤ τ can be done in O ( nN log ( nN )) time. In this chapter, we have presented an optimization approach to Kalman smoothing,together with a survey of applications and extensions. In Section 2.5, we showedthat the recursive Kalman filtering and smoothing algorithm is equivalent to algo-rithm 2.1, an efficient method to solve block tridiagonal positive definite systems.In the following sections, we used this algorithm as a subroutine, allowing us topresent new ideas on a high level, without needing to explicitly write down modi-fied Kalman filtering and smoothing equations.We have presented extensions to nonlinear process and measurement models inSection 3, described constrained Kalman smoothing (both the linear and nonlinearcases) in Section 4, and presented an entire class of robust Kalman smoothers (de-rived by considering log-linear-quadratic densities) in Section 5. For all of theseapplications, nonlinearity in the process, measurements, and constraints can be han-dled by a generalized Gauss-Newton method that exploits the convex composite ptimization viewpoint on Kalman smoothing 43 structure discussed in Sections 3.1 and 4.4. The GN subproblem can be solved ei-ther in closed form or via an interior point approach; in both cases algorithm 2.1 wasused. For all of these extensions, numerical illustrations have also been presented,and most are available for public release through the ckbs package [6].In the case of the robust smoothers, it is possible to extend the density modelingapproach by considering densities outside the log-concave class [3], but we do notdiscuss this work here.We ended the survey of extensions by considering two novel approaches toKalman smoothing of sparse systems, for applications where modeling the sparsityof the state space sequence improves recovery. The first method built on the readers’familiarity with the interior point approach as a tool for the constrained extension inSection 4. The second method is suitable for large systems, where exact solution ofthe linear systems is not possible. Numerical illustrations of the methods have beenleft to future work.
References
1. D. Angelosante, S.I. Roumeliotis, and G.B. Giannakis. Lasso-kalman smoother for trackingsparse signals. In
Signals, Systems and Computers, 2009 Conference Record of the Forty-ThirdAsilomar Conference on , pages 181–185, nov. 2009.2. C.F. Ansley and R. Kohn. A geometric derivation of the fixed interval smoothing algorithm.
Biometrika , 69:486–487, 1982.3. A. Aravkin, James Burke, and Gianluigi Pillonetto. Robust and trend-following kalmansmoothers using student’s t. In
International Federation of Automaic Control (IFAC), 16thSymposium of System Identification , oct. 2011.4. A. Aravkin, James Burke, and Gianluigi Pillonetto. A statistical and computational theoryfor robust and sparse Kalman smoothing. In
International Federation of Automaic Control(IFAC), 16th Symposium of System Identification , oct. 2011.5. A.Y. Aravkin.
Robust Methods with Applications to Kalman Smoothing and Bundle Adjust-ment . PhD thesis, University of Washington, Seattle, WA, June 2010.6. A.Y. Aravkin, B.M. Bell, J.V. Burke, and G. Pillonetto. Matlab/Octave package for constrainedand robust Kalman smoothing.7. A.Y. Aravkin, B.M. Bell, J.V. Burke, and G. Pillonetto. An (cid:96) -laplace robust kalman smoother. Automatic Control, IEEE Transactions on , 56(12):2898–2911, dec. 2011.8. A.Y. Aravkin, B.M. Bell, J.V. Burke, and G. Pillonetto. Learning using state space kernelmachines. In
Proc. IFAC World Congress 2011 , Milan, Italy, 2011.9. B.M. Bell. The iterated Kalman smoother as a Gauss-Newton method.
SIAM J. Optimization ,4(3):626–636, August 1994.10. B.M. Bell. The marginal likelihood for parameters in a discrete Gauss-Markov process.
IEEETransactions on Signal Processing , 48(3):626–636, August 2000.11. B.M. Bell, J.V. Burke, and G. Pillonetto. An inequality constrained nonlinear kalman-bucysmoother by interior point likelihood maximization.
Automatica , 45(1):25–33, January 2009.12. B.M. Bell and F. Cathey. The iterated Kalman filter update as a Gauss-Newton method.
IEEETransactions on Automatic Control , 38(2):294–297, February 1993.13. Alfred M. Bruckstein, David L. Donoho, and Michael Elad. From sparse solutions of systemsof equations to sparse modeling of signals and images.
SIAM Rev. , 51(1):34–81, February2009.14. J. V. Burke and S-P. Han. A robust sequential quadratic programming method.
MathematicalProgramming , 43:277–303, 1989. 10.1007/BF01582294.4 Aleksandr Y. Aravkin and James V. Burke and Gianluigi Pillonetto15. James V. Burke. Descent methods for composite nondifferentiable optimization problems.
Mathematical Programming , 33:260–279, 1985.16. A. Carmi, P. Gurfil, and D. Kanevsky. Methods for sparse signal recovery using kalmanfiltering with embedded pseudo-measurement norms and quasi-norms.
IEEE Transactions onSignal Processing , In print.17. A. Carmi, P. Gurfil, and D. Kanevsky. A simple method for sparse signal recovery fromnoisy observations using kalman filtering. Technical Report RC24709, Human LanguageTechnologies, IBM, 2008.18. Rudoph Van der Merwe.
Sigma-Point Kalman Filters for Probabilistic Inference in DynamicState-Space Models . PhD thesis, OGI School of Science and Engineering, Oregon Health andScience University, April 2004.19. F. Dinuzzo, M. Neve, G. De Nicolao, and U. P. Gianazza. On the representer theorem andequivalent degrees of freedom of SVR.
Journal of Machine Learning Research , 8:2467–2495,2007.20. D.L. Donoho. Compressed sensing.
Information Theory, IEEE Transactions on , 52(4):1289–1306, april 2006.21. L. Fahrmeir and H. Kaufmann. On Kalman filtering, posterior mode estimation, and Fisherscoring in dynamic exponential family regression.
Metrika , pages 37–60, 1991.22. Ludwig Fahrmeir and Rita Kunstler. Penalized likelihood smoothing in robust state spacemodels.
Metrika , 49:173–191, 1998.23. Junbin Gao. Robust l1 principal component analysis and its Bayesian variational inference.
Neural Computation , 20(2):555–572, February 2008.24. S. Gillijns, O. Barrero Mendoza, J. Chandrasekar, B. L. R. De Moor, D. S. Bernstein, andA. Ridley. What is the ensemble Kalman filter and how well does it work? In
Proceedings ofthe American Control Conference , pages 4448–4453. IEEE, 2006.25. G.A. Hewer, R.D. Martin, and Judith Zeh. Robust preprocessing for Kalman filtering of glintnoise.
IEEE Transactions on Aerospace and Electronic Systems , AES-23(1):120–128, January1987.26. Andrew Jazwinski.
Stochastic Processes and Filtering Theory . Dover Publications, Inc, 1970.27. JR J.E. Dennis and Robert B. Schnabel.
Numerical Methods for Unconstrained Optimiationand Nonlinear Equations . Computational Mathematics. Prentice-Hall, 1983.28. Simon Julier, Jeffrey Uhlmann, and Hugh Durrant-White. A new method for the nonlineartransformation of means and covariances in filters and estimators.
IEEE Transactions onAutomatic Control , 45(3):477–482, March 2000.29. R. E. Kalman. A new approach to linear filtering and prediction problems.
Transactions ofthe AMSE - Journal of Basic Engineering , 82(D):35–45, 1960.30. R. Kandepu, B. Foss, and L. Imsland. Applying the unscented Kalman filter for nonlinearstate estimation.
J. of Process Control , 18:753–768, 2008.31. Seung-Jean Kim, Kwangmoo Koh, Stephen Boyd, and Dimitry Gorinevsky. (cid:96) trend filtering. Siam Review , 51(2):339–360, 2009.32. M. Kojima, N. Megiddo, T. Noma, and A. Yoshise.
A Unified Approach to Interior PointAlgorithms for Linear Complementarity Problems , volume 538 of
Lecture Notes in ComputerScience . Springer Verlag, Berlin, Germany, 1991.33. S. Kourouklis and C. C. Paige. A constrained least squares approach to the general Gauss-Markov linear model.
Journal of the American Statistical Association , 76(375):620–625,September 1981.34. T. Lefebvre, H. Bruyninckx, and J. De Schutter. Kalman filters for nonlinear systems: Acomparison of performance.
Intl. J. of Control , 77(7):639–653, May 2004.35. Jun S. Liu and Rong Chen. Sequential Monte Carlo methods for dynamic systems.
Journal ofthe American Statistical Association , 93:1032–1044, 1998.36. Hassan Mansour, Haneet Wason, Tim T.Y. Lin, and Felix J. Herrmann. Randomized marineacquisition with compressive sampling matrices.
Geophysical Prospecting , 60(4):648–662,2012.37. A. Nemirovskii and Y. Nesterov.
Interior-Point Polynomial Algorithms in Convex Program-ming , volume 13 of
Studies in Applied Mathematics . SIAM, Philadelphia, PA, USA, 1994.ptimization viewpoint on Kalman smoothing 4538. Bernt Oksendal.
Stochastic Differential Equations . Springer, sixth edition, 2005.39. C. C. Paige and M. A. Saunders. Least squares estimation of discrete linear dynamic systemsusing orthogonal transformations.
Siam J. Numer. Anal , 14(2):180–193, April 1977.40. Christopher C. Paige. Covariance matrix representation in linear filtering.
ContemporaryMathematics , 47, 1985.41. Gianluigi Pillonetto, Aleksandr Y. Aravkin, and Stefano Carpin. The unconstrained and in-equality constrained moving horizon approach to robot localization. In
IROS , pages 3830–3835, 2010.42. H. E. Rauch, F. Tung, and C. T. Striebel. Maximum likelihood estimates of linear dynamicsystems.
AIAA J. , 3(8):1145–1150, 1965.43. R. Tyrrell Rockafellar and Roger J-B. Wets.
Variational Analysis , volume 317 of
A Series ofComprehensive Studies in Mathematics . Springer, 1998.44. Irvin C. Schick and Sanjoy K. Mitter. Robust recursive estimation in the presence of heavy-tailed observation noise.
The Annals of Statistics , 22(2):1045–1080, June 1994.45. R. Tibshirani. Regression shrinkage and selection via the LASSO.
Journal of the RoyalStatistical Society. Series B (Methodological) , 58(1):267–288, 1996.46. E. van den Berg and M. P. Friedlander. Probing the pareto frontier for basis pursuit solutions.
SIAM Journal on Scientific Computing , 31(2):890–912, 2008.47. N. Vaswani. Kalman filtered compressed sensing. Proceedings of the IEEE InternationalConference on Image Processing (ICIP), October 2008.48. G. Wahba.
Spline models for observational data . SIAM, Philadelphia, 1990.49. S.J. Wright.