[PDF] Sparse Approximations with Interior Point Methods

Abstract

Large-scale optimization problems that seek sparse solutions have become ubiquitous. They are routinely solved with various specialized first-order methods. Although such methods are often fast, they usually struggle with not-so-well conditioned problems. In this paper, specialized variants of an interior point-proximal method of multipliers are proposed and analyzed for problems of this class. Computational experience on a variety of problems, namely, multi-period portfolio optimization, classification of data coming from functional Magnetic Resonance Imaging, restoration of images corrupted by Poisson noise, and classification via regularized logistic regression, provides substantial evidence that interior point methods, equipped with suitable linear algebra, can offer a noticeable advantage over first-order approaches.

Full PDF

aa r X i v : . [ m a t h . O C ] F e b SPARSE APPROXIMATIONS WITH INTERIOR POINT METHODS ∗ VALENTINA DE SIMONE † , DANIELA DI SERAFINO ‡ , JACEK GONDZIO § , SPYRIDONPOUGKAKIOTIS § , AND

MARCO VIOLA † VERSION 1 – February 24, 2021

Abstract.

Large-scale optimization problems that seek sparse solutions have become ubiquitous.They are routinely solved with various specialized ﬁrst-order methods. Although such methods areoften fast, they usually struggle with not-so-well conditioned problems. In this paper, specializedvariants of an interior point-proximal method of multipliers are proposed and analyzed for problemsof this class. Computational experience on a variety of problems, namely, multi-period portfoliooptimization, classiﬁcation of data coming from functional Magnetic Resonance Imaging, restorationof images corrupted by Poisson noise, and classiﬁcation via regularized logistic regression, providessubstantial evidence that interior point methods, equipped with suitable linear algebra, can oﬀer anoticeable advantage over ﬁrst-order approaches.

Key words.

Sparse Approximations, Interior Point Methods, Proximal Methods of Multipliers,Nonlinear Convex Programming, Solution of KKT Systems, Portfolio Optimization, Image Restora-tion, Classiﬁcation in Machine Learning.

AMS subject classiﬁcations.

1. Introduction.

We are concerned with the eﬃcient solution of a class of prob-lems which are very large and are expected to yield sparse solutions. In practice, thesparsity is often induced by the presence of ℓ norm terms in the objective. We assumethat a general problem of the following formmin x f ( x ) + τ k x k + τ k Lx k s.t. Ax = b, (1.1)needs to be solved, where f : R n R is a twice continuously diﬀerentiable convexfunction, L ∈ R l × n , A ∈ R m × n , b ∈ R m , and τ , τ >

0. We are particularly interestedin problems for which f ( x ) displays some level of separability. The terms k x k and k Lx k induce sparsity in the vector x and/or in some (possibly redundant) dictionary Lx . Numerous real-life problems can be recast into the form (1.1). Among the variousapplication areas, one can ﬁnd portfolio optimization [40], signal and image processing[16, 50], classiﬁcation in statistics [54] and machine learning [57], inverse problems [55]and compressed sensing [14], to mention just a few.Optimization problems arising in these applications are usually solved by diﬀer-ent specialized variants of ﬁrst-order methods. Indeed, highly specialized and tuned ∗ Funding: this work was funded by various institutes and research programs. V. De Simone,D. di Seraﬁno and M. Viola were supported by the Istituto Nazionale di Alta Matematica, GruppoNazionale per il Calcolo Scientiﬁco (INdAM-GNCS), and by the V:ALERE Program of the Uni-versity of Campania “L. Vanvitelli”, Italy. S. Pougkakiotis was supported by a Principal’s CareerDevelopment scholarship from the University of Edinburgh, as well as a scholarship from A. G. Lev-entis Foundation. J. Gondzio and S. Pougkakiotis were also supported by the Google project “Fast(1 + x )-order Methods for Linear Programming”. We wish to remark that this study does not haveany conﬂict of interest to disclose. † Department of Mathematics and Physics, University of Campania “L. Vanvitelli”, Caserta, Italy([email protected], [email protected]). ‡ Department of Mathematics and Applications, University of Naples Federico II, Napoli, Italy(daniela.diseraﬁ[email protected]). § School of Mathematics and Maxwell Institute for Mathematical Sciences, University of Edin-burgh, Edinburgh, United Kingdom ([email protected], [email protected]).1

V. DE SIMONE, D. DI SERAFINO, J. GONDZIO, S. POUGKAKIOTIS, AND M. VIOLA to a narrow class of problems, ﬁrst-order methods often outperform standard oﬀ-the-shelf second-order techniques; the latter might be too expensive or might strugglewith excessive memory requirements. Such comparisons are not fair though. Withthis paper we hope to change the incorrect opinion on the second-order methods.When eﬃciently implemented and specialized to a particular application, interiorpoint methods (IPMs) oﬀer an attractive alternative. They can be equally (or more)eﬃcient than the best ﬁrst-order methods available, and they deliver unmatched ro-bustness and reliability.The specializations of interior point methods proposed in this paper do not gobeyond what has been commonly exploited by the ﬁrst-order methods. Namely wepropose: • to exploit special features of the problems in the linear algebra of IPM, and • to take advantage of the expected sparsity of the optimal solution.In order to achieve our goals we propose to convert sparse approximation problemsto standard smooth nonlinear convex programming ones by replacing the ℓ normterms with a usual modeling trick in which an absolute value | a | is substituted withthe sum of two non-negative parts, | a | = a + + a − , where a + = max { a, } and a − =max {− a, } . By introducing the auxiliary variable d = Lx ∈ R l , problem (1.1) is thentransformed into the following one:min x + ,x − ,d + ,d − f ( x + − x − ) + τ ( e ⊤ n x + + e ⊤ n x − ) + τ ( e ⊤ l d + + e ⊤ l d − ) , s.t. A ( x + − x − ) = b,L ( x + − x − ) = d + − d − ,x + , x − , d + , d − ≥ , (1.2)where x + , x − ∈ R n are such that x = x + − x − , d + , d − ∈ R l are such that d = d + − d − ,and e n ∈ R n , e l ∈ R l are vectors with entries equal to 1. It is worth observing that(1.1) and (1.2) are equivalent; indeed the presence of linear terms which penalize forthe sum of positive and negative parts of vectors x and d guarantee that at optimalityonly one of the split variables can take a nonzero value. Although (1.2) is larger than(1.1) because the variables have been replicated and new constraints have been added,it is in a form eligible to a direct application of an Interior Point Method (IPM). Weexpect that the well-known ability of IPMs to handle large sets of linear equality andnon-negativity constraints will compensate for this increase of the problem dimension.IPMs employ Newton method to solve a sequence of logarithmic barrier subprob-lems. In standard implementations of IPMs this requires many involved linear algebraoperations (building and inverting the Hessian matrix), and for large problems it mightbecome prohibitively expensive. In this paper we demonstrate that the use of inexactNewton method [7, 13, 33] combined with a knowledgeable choice and appropriatetuning of linear algebra solvers (see [23, 26, 32, 35] and the references therein) is thekey to success when developing an IPM specialized to a particular class of problems.We also demonstrate an attractive ability of IPMs to select important variables andprematurely drop the irrelevant ones, a feature which is very well suited to solvingsparse approximation problems in which the majority of variables are expected to bezero at optimality. It is worth mentioning at this point that our understanding ofthe features of IPMs applied to sparse approximation problems beneﬁtted from theearlier studies which focused on compressed sensing problems [29, 30].Ultimately, we provide computational evidence that IPMs can be more eﬃcientthan methods which are routinely used for the solution of sparse approximation prob-lems by exploiting only ﬁrst-order information. PARSE APPROXIMATIONS WITH INTERIOR POINT METHODS Notation.

Throughout this paper we use lowercase Roman and Greek fonts toindicate scalars and vectors (the nature is clear from the context). Capital italicizedRoman fonts are used to indicate matrices. Superscripts are used to denote thecomponents of a vector/matrix. As an example, given M ∈ R m × n , v ∈ R n R ⊆{ , . . . , m } , and C ⊆ { , . . . , n } , we set v C := ( v i ) i ∈C and M R , C := (cid:0) m ij (cid:1) i ∈R ,j ∈C ,where v i is the i -th entry of v and m ij the ( i, j )-th entry of M . We use λ min ( B )( λ max ( B ), respectively) to denote the minimum (maximum) eigenvalue of an arbitrarysquare matrix B with real eigenvalues. Similarly, σ min ( B ) ( σ max ( B ), respectively)denotes the minimum (maximum) singular value of an arbitrary rectangular matrix B . We use e n and 0 n to denote a column vector of size n with entries equal to 1 or 0,respectively. Moreover, we use I n to indicate the identity matrix of size n and 0 m,n to denote the zero matrix of size m × n . We use subscripts to denote the elements ofa sequence, e.g., { x k } . Norms k · k are ℓ , unless stated otherwise. For any ﬁnite set A , we denote by |A| its cardinality. Finally, when referring to convex programmingproblems, we implicitly assume that the problems are linearly constrained. Structure of the article.

The rest of this article is organized as follows. InSection 2 we brieﬂy describe IPMs for convex programming, focusing in particularon the

Interior Point-Proximal Method of Multipliers (IP-PMM), which is used inthe subsequent sections. The choice of IP-PMM is motivated by the fact that itmerges an infeasible IPM with the PMM approach, in order to keep the fast andreliable convergence properties of IPMs and the strong convexity of the PMM sub-problems, thus achieving better eﬃciency and robustness than both methods. Inthis section we also outline the testing environment used throughout the paper. InSections 3 to 6 we present four applications formulated as optimization problemswith sparsity sought in the solutions, and recast them in the form (1.2). In detail,in Section 3 we focus on a multi-period portfolio selection strategy, in Section 4 onthe classiﬁcation of data coming from functional Magnetic Resonance Imaging, inSection 5 on the restoration of images corrupted by Poisson noise, and in Section 6 onlinear classiﬁcation through regularized logistic regression. The ﬁrst two applicationsyield convex quadratic programming problems, while the remaining ones yield generalnonlinear convex programming problems. For each application, we provide a briefdescription of its mathematical model and explain how IP-PMM is specialized forthat case in terms of linear algebra solvers, including variable dropping strategies tohelp sparsiﬁcation; we also show the results of computational experiments, includingcomparisons with state-of-the art methods widely used by the scientiﬁc communityon the selected problems.

2. Interior Point Methods for Convex Programming.

We consider thefollowing convex programming problem:(2.1) min x f ( x ) , s.t. Ax = b, x ≥ , where x ∈ R n , A ∈ R m × n , and f : R n R is a twice diﬀerentiable convex function.Using the Lagrangian duality theory [10], and deﬁning a function F ( w ) : R n + m R n + m , we write the KKT (optimality) conditions as follows:(2.2) F ( w ) =  ∇ f ( x ) − A ⊤ y − zAx − bXZe n  =  n m n  , V. DE SIMONE, D. DI SERAFINO, J. GONDZIO, S. POUGKAKIOTIS, AND M. VIOLA where y ∈ R m and z ∈ R n are the Lagrange multipliers corresponding to the equal-ity and inequality constraints respectively, while X, Z ∈ R n × n denote the diagonalmatrices with diagonal entries x i and z i (resp.), ∀ i ∈ { , . . . , n } .Problem (2.1) can be solved using a primal-dual IPM. There are numerous vari-ants of IPMs and the reader is referred to [32] for an extended literature review. IPMshandle the non-negativity constraints of the problems with logarithmic barriers in theobjective. That is, at each iteration k , we choose a barrier parameter µ k and formthe logarithmic barrier problem :(2.3) min x f ( x ) − µ k n X j =1 ln x j , s.t. Ax = b. Then, Newton method (possibly an inexact variant of it [7, 33, 47]) is usuallyemployed in order to approximately solve problem (2.3). Newton method is favoreddue to the self-concordance property of the logarithmic barrier [45]. Applying it tothe optimality conditions of (2.3), and further forming the augmented system (bymeans of pivoting), we obtain a system of the following form:(2.4) (cid:20) − ( ∇ f ( x k ) + Θ − k ) A ⊤ A m,n (cid:21) (cid:20) ∆ x k ∆ y k (cid:21) = (cid:20) ∇ f ( x k ) − A ⊤ y k − σ k µ k X − k eb − Ax k (cid:21) , where Θ k = X k Z − k . One can observe that the matrix Θ k contains some very largeand some very small elements close to optimality. Hence, the matrix in (2.4) becomesincreasingly ill-conditioned, as the method progresses. Notice that as µ k →

0, theoptimal solution of (2.3) converges to the optimal solution of (2.1). Polynomial con-vergence of such methods, for various classes of problems, has been proved multipletimes in the literature (see for example [45, 63]).System like (2.4) can either be solved “exactly” (using an appropriate factoriza-tion, as in [1, 31, 48]) or iteratively (using an appropriate Krylov subspace method, asin [8, 12, 23, 35]). While the former approach is very general, it becomes problematicas the problem size increases. On the other hand, iterative methods (accompaniedby appropriate preconditioners) may be diﬃcult to generalize. However, if appliedto speciﬁc classes of problems, they can allow one to solve huge-scale instances, byavoiding the explicit storage of the problem matrices.

In the context of IPMs, it is often beneﬁcial toinclude some regularization, in order to improve the spectral properties of the systemmatrix in (2.4). For example, notice that if the constraint matrix A is rank-deﬁcient,then the matrix in (2.4) might not be invertible. The latter can be immediatelyaddressed by the introduction of a dual regularization, say δ >

0, ensuring thatrank([

A δI m ]) = m . The introduction of a primal regularization, say ρ >

0, can ensurethat the matrix ∇ f ( x k )+Θ − k + ρI n , has eigenvalues that are bounded away from zero,and hence a signiﬁcantly better worst-case conditioning than that of ∇ f ( x k ) + Θ − k .To produce a diagonal term in the (2,2) block Vanderbei added artiﬁcial variables toall the constraints [56]. Saunders and Tomlin [51, 52] achieved a similar result for the(1,1) and (2,2) blocks, by adding Tikhonov-type regularization terms to the originalproblem.In later works, these Tikhonov-type regularization methods were replaced by algo-rithmic regularization schemes. In particular, one can observe that a very natural wayof introducing primal regularization to problem (2.1), is through the application ofthe primal proximal point method. Similarly, dual regularization can be incorporated PARSE APPROXIMATIONS WITH INTERIOR POINT METHODS ǫ -optimal solution in a polynomial number of it-erations for convex QP problems and for linear semideﬁnite programming problems(see [47]), including cases when the associated Newton systems are solved inexactly. In this subsection,we derive an IP-PMM suitable for solving convex programming problems. The methodis based on the developments in [48]. We consider the following primal problem (whichcan be equivalently formulated as (2.1), by adding some additional constraints):(2.5) min x f ( x ) , s.t. Ax = b, x I ≥ , x F free , where I ⊆ { , . . . , n } , F = { , . . . , n } \ I . In the above problem, we assume that thedimensions of the involved matrices are the same as those in (2.1). Eﬀectively, anIP-PMM arises by merging PMM with an infeasible IPM. For that purpose, assumethat, at some iteration k of the method, we have available an estimate η k for anoptimal Lagrange multiplier vector y ∗ associated to the equality constraints of (2.5).Similarly, we denote by ζ k an estimate of a primal solution x ∗ . Now, we deﬁne theproximal penalty function that has to be minimized at the k -th iteration of the PMM,for solving (2.5), given the estimates η k , ζ k : L P MMρ k ,δ k ( x ; ζ k , η k ) = f ( x ) − η ⊤ k ( Ax − b ) + 12 δ k k Ax − b k + ρ k k x − ζ k k , with { δ k } , { ρ k } two positive non-increasing sequences of penalty parameters. Follow-ing [48], we require that these parameters decrease at the same rate as µ k ; however,in practice we never allow these values to be reduced below a certain appropriatelychosen threshold. For more details on how to choose these constants for general prob-lems, the reader is referred to [46], where a perturbation analysis of regularization isperformed. In order to solve the PMM sub-problem, we apply one (or a few) iterationsof an infeasible IPM. To do that, we alter the previous penalty function, by includinglogarithmic barriers, that is(2.6) L IP − P MMρ k ,δ k ( x ; ζ k , η k ) = L P MMδ k ,ρ k ( x ; ζ k , η k ) − µ k X j ∈I ln x j , where µ k > L IP − P MMδ k ,ρ k with respect to x to the zerovector, i.e. ∇ f ( x ) − A ⊤ η k + 1 δ k A ⊤ ( Ax − b ) + ρ k ( x − ζ k ) − (cid:20) |F| µ k ( X I ) − e |I| (cid:21) = 0 n . V. DE SIMONE, D. DI SERAFINO, J. GONDZIO, S. POUGKAKIOTIS, AND M. VIOLA

Next, we deﬁne the variables: y = η k − δ k ( Ax − b ) and z ∈ R n , such that z I = µ k ( X I ) − e |I| , z F = 0, to get the following (equivalent) system of equations:  ∇ f ( x ) − A ⊤ y − z + ρ k ( x − ζ k ) Ax + δ k ( y − η k ) − bX I z I − µ k e |I|  =  n m n  . To solve the previous mildly nonlinear system of equations, at every iteration k , weemploy Newton method and alter its right-hand side, using a centering parameter σ k ∈ (0 , x k , y k , z k ) and we want to solve the following system of equations:(2.7)  − ( ∇ f ( x k ) + ρ k I n ) A ⊤ I n A δ k I m m,n Z k n,m X k   ∆ x ∆ y (cid:20) |F| ∆ z I (cid:21) =  ∇ f ( x k ) − A ⊤ y k + σ k ρ k ( x k − ζ k ) − z k b − Ax k − σ k δ k ( y k − η k ) (cid:20) |F| σ k µ k e |I| − X I k z I k (cid:21)  . From the third block-equation of (2.7) we have ∆ z F = 0 and,∆ z I = ( X I k ) − ( − Z I k ∆ x I + σ k µ k e |I| − X I k z I k ) . In light of the previous computations, (2.7) reduces to: (cid:20) − ( ∇ f ( x k ) + Θ − k + ρ k I n ) A ⊤ A δ k I m (cid:21) (cid:20) ∆ x ∆ y (cid:21) = (cid:20) r ,k r ,k (cid:21) , (2.8)where(2.9) (cid:20) r ,k r ,k (cid:21) =  ∇ f ( x k ) − A ⊤ y k + σ k ρ k ( x k − ζ k ) − (cid:20) |F| σ k µ k ( X I k ) − e |I| (cid:21) b − Ax k − σ k δ k ( y k − η k )  , and Θ − k = (cid:20) |F| , |F| |I| , |F| |F| , |I| ( X I k ) − ( Z I k ) (cid:21) . In the special case where ∇ f ( x k ) is a diagonal (or zero) matrix, it could be beneﬁcialto further reduce system (2.8), by eliminating variables ∆ x . The resulting normalequations yield a positive deﬁnite system of equations that reads as follows:(2.10) (cid:0) A ( ∇ f ( x k ) + Θ − k + ρ k I n ) − A ⊤ + δ k I m (cid:1) ∆ y = r ,k + A ( ∇ f ( x k ) + Θ − k + ρ k I n ) − ( r ,k ) . The parameters η k , ζ k are tuned as in [48]. In particular, we set η = y and ζ = x , where ( x , y , z ) is the starting point of IP-PMM. Then, at the end of everyiteration k , we set ( ζ k +1 , η k +1 ) = ( x k +1 , y k +1 ), only if the primal and dual residualsare decreased suﬃciently. If the latter is not the case, we set ( ζ k +1 , η k +1 ) = ( ζ k , η k ). PARSE APPROXIMATIONS WITH INTERIOR POINT METHODS ǫ -optimal solution in a polynomial number of iterations, if f is a convex quadratic function. Furthermore, the latter holds for linear semi-deﬁniteprogramming problems, even if one solves the Newton system inexactly (see [47]).Nevertheless, the previous is not proven to hold for the general convex (nonlinear)case. In the latter case, one would have to employ Newton method combined witha line-search or a trust-region strategy (e.g. see [58]), in order to guarantee theconvergence of the method. In all the cases analyzed in this work we make use of asimple Mehrotra-type (see [41]) predictor-corrector scheme approximating the centralpath, which in general is suﬃcient to produce good directions that allow the methodto converge quickly to the optimal solution. In the corrector stage, the right-handside is approximated by the linearization of the function that is being minimized (see[53]). The various specialiations of IP-PMM discussedin the following sections have all been implemented in MATLAB and compared withMATLAB implementations of state-of-the-art methods for each speciﬁc problem. Allthe tests were run with MATLAB R2019b on an Intel Xeon Platinum 8168 CPUwith 192 GB of RAM, available from the magicbox server at the Department ofMathematics and Physics of the University of Campania “L. Vanvitelli”.

3. Portfolio Selection Problem.

Portfolio selection is one of the most centraltopics in modern ﬁnancial economics. It deals with the decision problem of how toallocate resources among several competing assets in accordance with the investmentobjectives. For medium- and long-time horizons, the multi-period strategy is suitable,because it allows the change of the allocation of the capital across the assets, takinginto account the evolution of the available information. In a multi-period setting, theinvestment period is partitioned into m sub-periods, delimited by m + 1 rebalancingdates t j . The decisions are taken at the beginning of each sub-period [ t j , t j +1 ), j =1 , ..., m , and kept within it. The optimal portfolio is deﬁned by the vector w = [ w ⊤ , w ⊤ , . . . , w ⊤ m ] ⊤ , where w j ∈ R s is the portfolio of holdings at time t j and s is the number of assets.The mean-variance formulation proposed by Markowitz [40] was extended to amulti-period portfolio selection by Li and Ng [38], and in recent years there has beena signiﬁcant advancement of both theory and methodologies. In a multi-period mean-variance framework, we ﬁx a ﬁnal target expected return and adopt as risk measurethe function obtained by summing the single-period variance terms [17]: ρ ( w ) = m X j =1 w ⊤ j C j w j , where C j ∈ R s × s is the covariance matrix, assumed to be positive deﬁnite, estimatedat t j . A common strategy to estimate Markowitz model parameters is to use historicaldata as predictive of the future behavior of asset returns. Diﬀerent regularization tech-niques have been proposed to deal with ill-conditioning due to asset correlation; in thelast years the ℓ -regularization has been used to promote sparsity in the solution [21].It allows investors to reduce the number of positions to be monitored and the holdingand transaction costs. Another useful interpretation of the ℓ norm is related to theamount of short positions, which indicate an investment strategy where an investor V. DE SIMONE, D. DI SERAFINO, J. GONDZIO, S. POUGKAKIOTIS, AND M. VIOLA is selling borrowed stocks in the open market, expecting that the market will drop,in order to realize a proﬁt. A suitable tuning of the regularization parameter permitsa short controlling in both the single- and the multi-period case [18, 21]. However,in the multi-period case, the sparsity in the solution does not guarantee the controlof the transaction costs, especially if the pattern of the active positions completelychanges across periods. In this case, sparsity must be introduced in the variation,e.g. by adding an ℓ term involving the diﬀerences of the wealth values allocated onthe assets between two contiguous rebalancing times. This acts as a penalty on theportfolio turnover, which has the eﬀect of reducing the number of transactions andhence the transaction costs [19, 24].Thus, we consider the following fused lasso optimization problem for multi-periodportfolio selection [24]:(3.1) min w w ⊤ Cw + τ k w k + τ k Lw k , s.t. w ⊤ e s = ξ init ,w ⊤ j e s = ( e s + r j − ) ⊤ w j − , j = 2 , . . . , m, ( e s + r m ) ⊤ w m = ξ term , where n = m s , C = diag( C , C , . . . , C m ) ∈ R n × n is a block diagonal symmetricpositive deﬁnite matrix, τ , τ > L ∈ R ( n − s ) × n is the discrete diﬀerence operatorrepresenting the fused lasso regularizer, r j ∈ R s is the expected return vector at time t j , ξ init is the initial wealth, and ξ term is the target expected wealth resulting from theoverall investment. The ﬁrst constraint is the initial budget constraint. The strategyis assumed to be self-ﬁnancing, as constraints from 2 to m establish; this means thatthe value of the portfolio changes only because the asset prices change. The ( m + 1)-stconstraint deﬁnes the expected ﬁnal wealth. To deal with the non-separability of theobjective function in (3.1), we introduce an auxiliary variable d , which is constrainedto be equal to Lw , and we equivalently formulate problem (3.1) as follows:(3.2) min w,d w ⊤ Cw + τ k w k + τ k d k , s.t. ¯ Aw = ¯ b,Lw = d, where the constraint matrix ¯ A ∈ R ( m +1) × n can be interpreted as a ( m + 1) × m lowerbi-diagonal block matrix, with blocks of dimension 1 × s deﬁned by¯ A i,j =  e ⊤ s if i = j, − ( e s + r i − ) ⊤ if j = i + 1 , ⊤ s otherwise , and ¯ b = ( ξ init , , , ..., ξ term ) ⊤ ∈ R m +1 . Using the standard trick described in Section 1, we split w and d into twovectors of the same size, representing the non-negative and non-positive parts of theentries of w and d respectively, i.e., w = w + − w − and d = d + − d − . Then, prob-lem (3.2) is reformulated as the following QP problem:(3.3) min x x ⊤ Qx + c ⊤ x, s.t. Ax = b, x ≥ PARSE APPROXIMATIONS WITH INTERIOR POINT METHODS l = n − s , n = 2( n + l ) = 2 s (2 m − m = m + 1 + l = ( m + 1) + s ( m − x = [( w + ) ⊤ , ( w − ) ⊤ , ( d + ) ⊤ , ( d − ) ⊤ ] ⊤ ∈ R n , (3.4) Q =  (cid:20) C − C − C C (cid:21) n, l l, n l, l  ∈ R n × n , A = (cid:20) ¯ A − ¯ A ( m +1) , l L − L (cid:2) − I l I l (cid:3) (cid:21) ∈ R m × n ,c = [ τ , . . . , τ , τ , . . . , τ ] ⊤ ∈ R n , b = [¯ b , . . . , ¯ b m +1 , , . . . , ⊤ ∈ R m . The optimal solution of problem (3.2) isexpected to be sparse. On the other hand, in light of the reformulation given in (3.3),we anticipate (and verify in practice) that most of the primal variables w attain azero value close to optimality. Such variables may signiﬁcantly contribute to the illconditioning of the matrix in (2.8). In order to take advantage of this special propertydisplayed by problem (3.3), we employ the following heuristic method, which aims atdropping variables x j which are suﬃciently close zero, their seemingly optimal value.This results in better conditioning of the augmented system, whose dimension is alsosigniﬁcantly reduced close to optimality, thus decreasing the computational cost ofeach IPM iteration. In other words, as IP-PMM progresses, we project the problemonto a smaller space, with the assumption that the projected (smaller) problem hasthe same optimal solution as that of the original problem.In particular, we set a threshold value ǫ drop >

0, and a large constant ξ >

0. Atiteration k = 0, we deﬁne a set V = ∅ . Then, at every iteration k of IP-PMM, wecheck the following condition, for every j ∈ I \ V :(3.5) x jk ≤ ǫ drop and z jk ≥ ξ · ǫ drop and ( r d ) jk ≤ ǫ drop , where ( r d ) jk = ( c − A ⊤ y k + Qx k − z k ) j represents the dual infeasibility correspondingto the j -th variable. Any variable that satisﬁes the latter condition is dropped, thatis, we set x jk = 0, V = V ∪ { j } , G = F ∪ ( I \ V ), we drop z jk and solve (cid:20) − ( Q G , G + (Θ G , G k ) − + ρ k I |G| ) ( A H , G ) ⊤ A H , G δ k I m (cid:21) (cid:20) ∆ x ∆ y (cid:21) = (cid:20) r G ,k r ,k (cid:21) , (3.6)where, in the presence of free variables, (Θ F , F k ) − is considered to be the zero matrix, r ,k , r ,k are deﬁned in (2.9) (by substituting A H , G as the constraint matrix), and H = { , . . . , m } . We should note that this is a heuristic, since once a variable isdropped, it is not considered again until the method converges. Hence, one has tomake sure that none of the nonzero variables x jk is dropped. Nevertheless, at the endof the optimization process we can test whether the variables in V are indeed nonzero.More speciﬁcally, once an optimal solution ( x ∗ , y ∗ , z ∗ ) is found, we compute: z V = c V − ( A H , V ) ⊤ y ∗ + Q V , G ( x ∗ ) G . If there exists j such that ( z V ) j ≤

0, then we would identify that a variable x j wasincorrectly dropped. Otherwise, the optimal solution of the reduced problem coincideswith the optimal solution of (3.3). We notice that this methodology is not new. Inparticular, a similar strategy was employed in [34], where a special class of linearprogramming problems was solved using a primal-dual barrier method.0 V. DE SIMONE, D. DI SERAFINO, J. GONDZIO, S. POUGKAKIOTIS, AND M. VIOLA

We test the eﬀectiveness of the IP-PMMapplied to the fused lasso model on the following real-world data sets:1. FF48-FF100 (Fama & French 48-100 Industry portfolios, USA), containing48-100 portfolios considered as assets, from July 1926 to December 2015.2. ES50 (EURO STOXX 50), containing 50 stocks from 9 Eurozone countries(Belgium, Finland, France, Germany, Ireland, Italy, Luxembourg, the Nether-lands and Spain), from January 2008 to December 2013.3. FTSE100 (Financial Times Stock Exchange, UK), containing 100 assets, fromJuly 2002 to April 2016.4. SP500 (Standard & Poors, USA), containing 500 assets, from January 2008to December 2016.5. NASDAQC (National Association of Securities Dealers Automated QuotationComposite, USA), containing almost all stocks listed on the Nasdaq stockmarket, from February 2003 to April 2016.Following [19, 20], we generate 10 problems with annual or quarterly rebalancing, aftera preprocessing procedure that eliminates the elements with the highest volatilities.A rolling window (RW) for setting up the model parameters is considered. For eachdataset, the length of the RWs is ﬁxed in order to build positive deﬁnite covariancematrices and ensure statistical signiﬁcance. Diﬀerent datasets require diﬀerent lengthsfor the RWs. In Table 1 we summarize the information on the test problems.

Table 1

Characteristics of the portfolio test problems (y = years, m = months)

Problem Assets RW Sub-periods n FF48-10 48 5 y 10 y 1632FF48-20 48 5 y 20 y 3552FF48-30 48 5 y 30 y 5472FF100-10 96 10 y 10 y 3264FF100-20 96 10 y 20 y 7104FF100-30 96 10 y 30 y 10,944ES50 50 1 y 22 m 4300FTSE100 83 1 y 10 y 3154SP500 431 2 y 8 y 11,206NASDAQC 1203 1 y 10 y 45,714

We introduce some measures to evaluate the goodness of the optimal portfoliosversus the benchmark one, in terms of risk, sparsity and transaction costs. We consideras benchmark the multi-period naive portfolio, based on the strategy for which at eachrebalancing date the total amount is equally divided among the assets. We assumethat the investor has one unit of wealth at the beginning of the planning horizon, i.e., ξ init = 1, and we set as expected ﬁnal wealth the one provided by the benchmark. Asin [19], we deﬁne:(3.7) ratio = w ⊤ naive Cw naive w ⊤ opt Cw opt , where w naive and w opt are respectively the naive portfolio and the optimal one. Thisvalue measures the risk reduction factor with respect to the benchmark. We consider PARSE APPROXIMATIONS WITH INTERIOR POINT METHODS ratio h = w naive w opt measures the reduction factor of the holding costs with respect to the benchmark.Finally, we consider the number of variations in the weights as a measure of transactioncosts. More precisely, if w ij = w ij +1 we assume that security i has been bought or soldin the period [ t j , t j +1 ), then we estimate the number of transactions as: T = trace ( V ⊤ V ) , where V ∈ R s × ( m − , with v ij = (cid:26) | w ij − w ij +1 | ≥ ǫ, . and ǫ >

0, in order to make sense in ﬁnancial terms. A measure of the transactionreduction factor with respect to the benchmark is given by(3.9) ratio t = T naive T opt . We consider a version of the presented IP-PMM algorithm in which the solutionof problem (3.6) is computed exactly, the parameter ǫ drop controlling the heuristicdescribed in Section 3.1.1 is set to 10 − , and the constant ξ , which is used to ensurethat the respective dual slack variable is bounded away from zero, is set to 10 .We compare IP-PMM with the Split Bregman method, which is known to be veryeﬃcient for this kind of problems. In detail, we consider the Alternating Split Bregmanalgorithm used in [20], based on a further reformulation of problem (3.2) asmin w,u,d w ⊤ Cw + τ k u k + τ k d k , s.t. ¯ Aw = ¯ b,Lw = d,w = u. This algorithm splits the minimization in three parts. Given w k , u k , d k , the ( k + 1)-stiteration consists in the minimization of a quadratic function to determine w k +1 andthe application of the soft-thresholding operator[ S ( v, γ )] i = v i | v i | max( | v i | − γ, , where v is a real vector and γ >

0, to determine u k +1 and d k +1 . The optimal value w k +1 can be obtained by solving the system Hw = p k +1 , where(3.10) H = C + λ A ⊤ A + λ L ⊤ L + λ I, where λ , λ , λ > p k +1 depends on the iteration. Since H is indepen-dent of the iteration and is symmetric positive deﬁnite and sparse, in [20] the authorscompute its sparse Cholesky factorization only once and solve two triangular systemsat each iteration. We refer to this algorithm as ASB-Chol.2 V. DE SIMONE, D. DI SERAFINO, J. GONDZIO, S. POUGKAKIOTIS, AND M. VIOLA

Table 2

IP-PMM vs ASB-Chol

IP-PMMProblem Time (s) Iters ratio ratio h ratio t FF48-10 1.37e − − − − − − ASB-CholProblem Time (s) Iters ratio ratio h ratio t FF48-10 1.67e − − − − − The values of τ and τ in (3.1) are selected to guarantee reasonable portfolios interms of short positions, i.e., in terms of negative components in the solution. Notethat from the ﬁnancial point of view, negative solutions correspond to transactions inwhich an investor sells borrowed securities in anticipation of a price decline. In ourruns we consider the smallest values of τ and τ that produce at most 2% of shortpositions. We set τ = τ = 10 − for the FF48 and FF100 data sets, τ = τ = 10 − for ES50 and SP500, τ = 10 − and τ = 10 − for FTSE, and τ = 10 − and τ = 10 − for NASDAQC.In Table 2 we present the results obtained with IP-PMM and ASB-Chol on thetest problems. The termination criteria of IP-PMM are the same as in [48], while thoseof ASB-Chol are the usual ones based on the violation of constraints. The relativetolerance for the two algorithms is tol = 10 − , which guarantees that the values of ratio diﬀer by at most 10%, so that both algorithms produce comparable portfoliosin terms of risk. We note that the solution computed by ASB-Chol is thresholded bysetting to zero all the entries with absolute value not exceeding the same value of ǫ drop used in the IP-PMM dropping strategy. The results show that the optimal portfolioscomputed by IP-PMM and ASB-Chol outperform the benchmark ones in terms ofall the metrics. Concerning ratio h and ratio t , IP-PMM is generally able to producegreater values than ASB-Chol, which indicates a higher sparsity in the solution foundby IP-PMM. IP-PMM generally performs comparably or better than ASB-Chol interms of elapsed time. Although ASB-Chol is faster that IP-PMM on SP500 by 14 . PARSE APPROXIMATIONS WITH INTERIOR POINT METHODS

4. Classiﬁcation models for functional Magnetic Resonance Imagingdata.

The functional Magnetic Resonance Imaging (fMRI) technique measures brainspatio-temporal activity via Blood-Oxygen-Level-Dependent (BOLD) signals. Start-ing from the assumption that neuronal activity is coupled with cerebral blood ﬂow,fMRI signals have been used to identify regions associated with functions such asspeaking, vision, movement, etc.. By analyzing the diﬀerent oxigenation levels in spe-ciﬁc areas of the brain of healthy and ill patients, in the last decades fMRI signals havebeen used to investigate the eﬀect on the brain functionality of tumors, strokes, headand brain injuries and of cognitive disorders such as schizophrenia or Alzheimer’s (see[27, 37, 42] and the references therein).In an fMRI scan, voxels representing regions of the brain of a patient are recordedat diﬀerent time instances. The temporal resolution is usually in the order of a fewseconds, while the spatial resolution generally ranges from 4-5 mm (for some full brainanalyses) to 1 mm (for analyses on speciﬁc brain regions), which may amount up toabout a million voxels. Since fMRI experiments are conducted over groups of patients,the dimensionality of the data is further increased. Therefore, the interpretation offMRI results requires the analysis of huge quantities of data. To this aim, machinelearning techniques are being increasingly used in recent years, because of their capa-bility of dealing with massive amounts of data, incorporating also a-priori informationabout the problems they are targeted to [3, 4, 27, 28, 36, 42, 49].Here we focus on the problem of training a binary linear classiﬁer to distinguishbetween diﬀerent classes of patients (e.g., ill/healthy) or diﬀerent kinds of stimuli(e.g., pleasant/unpleasant), and to get information about the most signiﬁcant brainareas associated with the related neural activity. The two classes are identiﬁed bythe labels − s − − s

3d scans in class 1, where each 3d scan is reshaped asa row vector of size q = q × q × q , and q i is the number of voxels along the i -thcoordinate direction of the domain covering the brain. All the scans are stored asrows of a matrix D ∈ R s × q , where s = s − + s .Following [4], we use a square loss function with the aim of determining an unbi-ased hyperplane in R q that can separate the patients in the two classes. This leadsto a minimization problem of the form:(4.1) min 12 s k Dw − ˆ y k , where ˆ y is a vector containing the labels associated with each scan. Since the numberof patients is usually much smaller than the size of a scan, i.e., s ≪ q , problem (4.1) isstrongly ill posed and thus requires regularization. Recently, signiﬁcant attention hasbeen given to regularization terms encouraging the presence of structured sparsity,where smoothly varying non-zero coeﬃcients of the solution are associated with smallcontiguous regions of the brain. This is motivated by the possibility of obtainingmore interpretable solutions than those corresponding to other regularizers that do4 V. DE SIMONE, D. DI SERAFINO, J. GONDZIO, S. POUGKAKIOTIS, AND M. VIOLA not promote sparsity or lead to sparse solutions without any structure (see [4] andthe references therein).Structured sparsity can be promoted, e.g., by using a combination of ℓ andanisotropic Total Variation (TV) terms [2], which can be regarded as a fused lassoregularizer [54]. The regularized problem reads(4.2) min w s k Dw − ˆ y k + τ k w k + τ k Lw k , where k Lw k is the discrete anisotropic TV of w , i.e., L = [ L ⊤ x L ⊤ y L ⊤ z ] ⊤ ∈ R l × q isthe matrix representing ﬁrst-order forward ﬁnite diﬀerences in the x, y, z -directions ateach voxel. By penalizing the diﬀerence between each voxel and its neighbors in eachdirection, one enforces the weights of the classiﬁcation hyperplane (which share the3d structure of the scans) to assume similar values for contiguous regions of the brain,thus leading to identify whole regions of the brain involved in the decision process.The previous problem can be reformulated by introducing the variables u = Dw and d = Lw and applying the splitting w = w + − w − , d = d + − d − , ( w + , w − , d + , d − ) ≥ (0 q , q , l , l ) . Let m = l + s and n = s + 2 q + 2 l . Using the previous variables, (4.2) can beequivalently written as:min x x ⊤ Qx + c ⊤ x, s.t. Ax = b,x I ≥ , x F free , I = { s + 1 , . . . , n } , F = { , . . . , s } , (4.3)where b = 0 s + l ∈ R m , x = [ u ⊤ , ( w + ) ⊤ , ( w − ) ⊤ , ( d + ) ⊤ , ( d − ) ⊤ ] ⊤ , c = [ − ˆ y ⊤ s , τ e ⊤ w , τ e ⊤ w , τ e ⊤ d , τ e ⊤ d ] ⊤ ∈ R n , and Q = (cid:20) s I s s, ( n − s ) ( n − s ) ,s ( n − s ) , ( n − s ) (cid:21) ∈ R n × n , A = (cid:20) − I s D − D s,l s,l l,s L − L − I l I l (cid:21) ∈ R m × n . (4.4) Notice thatproblem (4.3) is in the same form as (2.5). In what follows, we present a special-ized inexact IP-PMM, suitable for solving unconstrained fused lasso least squaresproblems. The proposed specialized IP-PMM is characterized by the two followingimplementation details. Firstly, instead of factorizing system (2.8), we employ aniterative method (namely, the Preconditioned Conjugate Gradient (PCG) method) tosolve system (2.10). Secondly, as suggested in Section 3.1.1, we take advantage of thefact that the optimal solution of such problems is expected to be sparse, and use theheuristic approach that allows us to drop many of the variables of the problem, whenthe method is close to the optimal solution.

We focus on solving the normal equa-tions in (2.10). Let k denote an arbitrary iteration of IP-PMM. We re-write the matrixin (2.10) without using the succinct notation introduced earlier:(4.5) M k = (cid:20) M ,k M ⊤ ,k M ,k M ,k (cid:21) , PARSE APPROXIMATIONS WITH INTERIOR POINT METHODS M ,k = (cid:0) ( s + ρ k ) − + δ k (cid:1) I s + D (cid:16) (Θ − w + ,k + ρ k I q ) − + (Θ − w − ,k + ρ k I q ) − (cid:17) D ⊤ ,M ,k = L (Θ − w + + ρ k I q ) − D ⊤ + L (Θ − w − ,k + ρ k I q ) − D ⊤ ,M ,k = L (cid:16) (Θ − w + ,k + ρ k I q ) − + (Θ w − ,k + ρ k I q ) − (cid:17) L ⊤ + (cid:16) (Θ − d + ,k + ρ k I l ) − + (Θ − d − ,k + ρ k I l ) − + δ k I l (cid:17) , (4.6)and (Θ I k ) − =  Θ − w + ,k q,q q,l q,l q,q Θ − w − ,k q,l q,l l,q l,q Θ − d + ,k l,l l,q l,q l,l Θ − d − ,k  . Notice that the matrix D in (4.1) is dense, and hence we expect M ,k and M ,k in (4.6) to also be dense. On the other hand, M ,k remains sparse, and we know that l ≫ s . As a consequence, the Cholesky factors of the matrix in (4.5) would inevitablycontain dense blocks. Hence, it might be prohibitively expensive to compute sucha decomposition. Instead, we solve the previous system using a PCG method. Inorder to do so eﬃciently, we must ﬁnd an approximation for the coeﬃcient matrix in(4.5). Given the fact that M ,k is sparse, while M ,k and M ,k are dense, we wouldlike to ﬁnd an approximation for the dense blocks. A possible approach would be toapproximate D by a low-rank matrix. Instead, based on the assumption l ≫ s , wecan approximate M k by the following block-diagonal preconditioner:(4.7) P k = (cid:20) M ,k s,l l,s M ,k (cid:21) , where , P − k = (cid:20) M − ,k s,l l,s M − ,k (cid:21) . Notice that M ,k does have a sparse Cholesky factor, due to the sparsity displayedin the discrete anisotropic TV matrix L . On the other hand, the Cholesky factors of M ,k are dense. However, computation and storage of these dense factors is possible,as we only need to perform O ( s ) operations, and store O ( s ) elements. Spectral Analysis . Let us now further support the choice of the preconditionerin (4.7) by performing a spectral analysis of the preconditioned system R k = P − k M k .In the following, we write A × B to denote a vector space whose elements are vectors[ a ⊤ , b ⊤ ] ⊤ with a ∈ A and b ∈ B . Theorem

Let D ∈ R s × q be the matrix in (4.1) . Let also M k ∈ R ( s + l ) × ( s + l ) be the matrix deﬁned in (4.5) and P k the preconditioner deﬁned in (4.7) . Then, thepreconditioned matrix R k = P − k M k has l − rank( D ) eigenvalues λ = 1 , whose respec-tive eigenvectors form a basis for { s } × { Null ( M ⊤ ,k ) } . All the remaining eigenvaluesof the preconditioned matrix satisfy λ ∈ ( χ, ∪ (1 , , where χ = δ k ρ k σ ( A )+ ρ k δ k , δ k , ρ k are the regularization parameters of IP-PMM and A is deﬁned in (4.4) .Proof. Let us consider the following generalized eigenproblem: M k p = λP k p. We partition the eigenvector p as p = [ p ⊤ s , p ⊤ l ] ⊤ . Using (4.5), and (4.7), the eigen-6 V. DE SIMONE, D. DI SERAFINO, J. GONDZIO, S. POUGKAKIOTIS, AND M. VIOLA problem can be written as: p s + M − ,k M ⊤ ,k p l = λp s M − ,k M ,k p s + p l = λp l . (4.8)From (4.6) and (4.7) we know that M k ≻ P k ≻ λ >

0. Let usseparate the analysis in two cases.

Case 1: λ = . This is the case for every p such that p ∈ S k = { s } × { Null( M ⊤ ,k ) } , which trivially satisﬁes (4.8) for λ = 1. Upon noticing thatdim( S k ) = dim(Null( M ⊤ ,k )) = l − rank( M ⊤ ,k ) , we can conclude that R k = P − k M k has an eigenvalue λ = 1 of multiplicity l − rank( M ⊤ ,k ). The respective eigenvectors form a basis of S k . Notice also, from (4.6),that rank( M ⊤ ,k ) = rank( D ) ≤ s . Case 2: λ = . There are exactly s + rank( D ) such eigenvalues. In order toanalyze this case, we have to consider two generalized eigenproblems. On the onehand, from the ﬁrst block equation in (4.8), we have that: p s = 1 λ − M − ,k M ⊤ ,k p l , and hence, substituting this in the second block equation of (4.8) gives the followinggeneralized eigenproblem:(4.9) G l,k p l = νM ,k p l , where G l,k = M ,k M − ,k M ⊤ ,k and ν = ( λ − . By assumption λ = 1, and hence λ = ±√ ν + 1. However, we have M k ≻ M ,k − M ,k M − ,k M ⊤ ,k ≻

0. Letus assume that the maximum eigenvalue of M − ,k G l,k is greater than or equal to 1, i.e. ν max ≥

1. By substituting ν max in (4.9) and multiplying both sides of the previousinequality by p ⊤ l , we get p ⊤ l G l,k p l = ν max p ⊤ l M ,k p l and hence p ⊤ l ( G l,k − M ,k ) p l ≥ . The latter contradicts the fact that M ,k − G l,k ≻

0, and thus ν max <

1. In otherwords, λ ∈ (0 , ∪ (1 , p l = 1 λ − M − ,k M ,k p s , and substituting this in the ﬁrst block equation in (4.8) yields G s,k p s = νM ,k p s , where G s,k = M ⊤ ,k M − ,k M ,k and ν = ( λ − . As before, we have that λ = ±√ ν + 1.Using the fact that M ,k − M ⊤ ,k M − ,k M ,k ≻

0, we can mirror the previous analysisto conclude that ν max <

1, and hence λ ∈ (0 , ∪ (1 , PARSE APPROXIMATIONS WITH INTERIOR POINT METHODS ρ k and δ k respectively, are bounded away from zero, so are the eigenvaluesof R k = P − k M k , independently of the iteration k of the algorithm. In particular, wehave that λ min ( R k ) ≥ λ min ( M k ) λ max ( P k ) ≥ δ k ρ k σ ( A ) + ρ k δ k = χ, where we used the fact that λ min ( M k ) ≥ δ k and λ max ( P k ) ≤ σ ( A ) ρ k + δ k , where A isdeﬁned in (4.4). The preconditioner (4.7) may be com-puted (and applied) very eﬃciently as we expect the Cholesky factor of M ,k topreserve sparsity and M ,k ∈ R s × s to be relatively small (recall that s ≪ l ). Wededuce from Theorem 4.1 that the preconditioner deﬁned in (4.7) remains eﬀective aslong as the regularization parameters ρ k and δ k are not too small. However, to attainconvergence of IP-PMM ρ k and δ k have to be reduced and then, due to the natureof IPMs, the matrix in (2.10) becomes increasingly ill conditioned as the method ap-proaches the optimal solution. This implies that the preconditioner deﬁned in (4.7).has only a limited applicability. In particular, this means that there is a limited scopefor reﬁning it and we may not be able to prevent degrading behaviour of PCG whenIPM gets very close to the optimal solution.However, we notice that the optimal solution of problem (4.2) is expected to besparse. Like in the portfolio optimization problem, in light of the reformulation (4.3),we know that most of the primal variables x converge to zero. Close to optimalitythe presence of such variables would adversely aﬀect the conditioning of the matrixin (2.10). To prevent that, we employ a heuristic similar to the one introduced inSection 3.1.1 which consists of eliminating variables which approach zero and have anassociated Lagrange multiplier bounded away from zero. Given ǫ drop > ξ > V = ∅ , at every iteration k of IP-PMM, we add to V each variable j ∈ I \ V satisfyingcondition (3.5), and replace (2.10) with the reduced system(4.10) (cid:18) A H , G (cid:16) Q G , G + (Θ G , G k ) − + ρ k I |G| (cid:17) − ( A H , G ) ⊤ + δ k I m (cid:19) ∆ y = r ,k + A H , G (cid:16) Q G , G + (Θ G , G k ) − + ρ k I |G| (cid:17) − r G ,k , where H = { , . . . , m } , G = F ∪ ( I \ V ) (assuming (Θ F , F k ) − = 0 s,s ), and r ,k , r ,k are deﬁned in (2.9) (with constraint matrix A H , G ). We consider a dataset consisting of fMRIscans for 16 male healthy US college students (age 20 to 25), with the aim of analyzingtwo active conditions: viewing unpleasant and pleasant images [44]. The preprocessedand registered data, available from https://github.com/lucabaldassarre/neurosparse,consist of 1344 scans of size 122 ,

128 voxels (only voxels with probability greater than0.5 of being in the gray matter are considered), with 42 scans considered per subjectand active condition (i.e., 84 scans per subject in total).In order to assess the performance of the IP-PMM on this type of problems,we carry out a comparison with two state-of-the-art algorithms for the solution ofproblem (4.2): • FISTA. As done for the tests in [4], problem (4.2) is reformulated asmin w s k Dw − ˆ y k + k ˆ Lw k , V. DE SIMONE, D. DI SERAFINO, J. GONDZIO, S. POUGKAKIOTIS, AND M. VIOLA where ˆ L = (cid:2) τ I q τ L ⊤ (cid:3) ⊤ , and solved by a version of FISTA [6] in which theproximal operator associated with k ˆ Lw k is approximated by 10 steps of aninner FISTA cycle. • ADMM. We consider the ADMM method [11] applied to the problemmin w,u,d s k Dw − ˆ y k + τ k u k + τ k d k , s.t. w − u = 0 q ,Lw − d = 0 l , in which the minimization of the quadratic function associated with the up-date of w is approximated by 10 steps of the CG algorithm.In Table 3 we show the results obtained by applying the algorithms to the solutionof the fMRI data classiﬁcation problem. For each choice of the pair of regularizationparameters ( τ , τ ), we report the average results obtained in a Leave-One-Subject-Out (LOSO) cross-validation test over the full dataset of patients. Because of thissetting, for each problem the size of w is q = 122 , D is s = 1260, and the dimension of d = Lw is l = 339 , τ = τ appeared the most appropriate.Furthermore, for the IP-PMM, the parameters ǫ drop and ξ controlling the heuristicdescribed in Section 4.1.2 are set to 10 − and 10 , respectively. To perform a faircomparison between the three algorithms, we consider a stopping criterion based onthe execution time, which, after some preliminary tests, is ﬁxed to 30 minutes. Thesolution of the normal equations system (4.10) is computed by the MATLAB PCGfunction, for which we set the maximum number of iterations to 2000 and the toleranceas tol = ( − if k r y,k k < , max n − , − k r y,k k o otherwise,where r y,k is the right-hand side of equation (4.10).For each algorithm tested, we report the mean and the standard deviation forthree quality measures of the solution: classiﬁcation accuracy (ACC), solution density (DEN) and corrected pairwise overlap (CORR OVR) (see [4, Section 2.3.3]). Thelatter, which is maybe the less common in the ﬁeld of machine learning, is meantto measure the “stability” of the voxel selection. The three metrics are computedafter thresholding the solution, as in [4]: after sorting the entries by their increasingmagnitude, we set to zero the entries contributing at most to 0 .

01% of the ℓ -normof the solution.By looking at Table 3, one can see that IP-PMM appears to be generally betterthan the other algorithms in enforcing the structured sparsity of the solution, pre-senting a good level of sparsity and overlap. It is worth noting that, because of itsdeﬁnition, the corrected pairwise overlap tends to zero as the density goes towards100%. Hence, for ADMM, which seems to be unable to enforce sparsity in the solu-tion, the overlap is close to zero. As suggested in [4], one can evaluate the results interms of the distance of the pair (ACC, CORR OVR) from the pair (100 , τ = τ = 10 − , for which the average accuracy is 82 .

3% and the corrected overlap is82 .

6% with an average solution density of 2 . PARSE APPROXIMATIONS WITH INTERIOR POINT METHODS Table 3

Comparison of IP-PMM, FISTA and ADMM in terms of the LOSO cross-validation scores

Algorithm τ = τ ACC DEN CORR OVR

IP-PMM 10 − . ± .

11 20 . ± .

63 43 . ± . · − . ± .

80 3 . ± .

84 62 . ± . − . ± .

22 2 . ± .

34 82 . ± . − . ± .

01 88 . ± .

71 5 . ± . · − . ± .

92 19 . ± .

86 65 . ± . − . ± .

58 5 . ± .

44 80 . ± . − . ± .

91 98 . ± .

03 0 . ± . · − . ± .

37 97 . ± .

05 0 . ± . − . ± .

51 97 . ± .

19 0 . ± . Fig. 1 . History of classiﬁcation accuracy, solution density and corrected pairwise overlap forIP-PMM (left) and FISTA (right), in the case τ = τ = 10 − . For the three quantities we reportaverage measures with conﬁdence intervals. of FISTA on the problem where the two methods reach the best scores, i.e., with τ = τ = 10 − . For all the 16 instances of the LOSO cross validation, we store thecurrent solution of each algorithm after every minute and, at the end of the execution,we compute the three quality measures for such intermediate solutions. The resultsare shown in Figure 1 in terms of history of the mean values (lines) together withtheir 95% conﬁdence intervals (shaded regions). From the plots we can see that whileFISTA reaches the measures reported in Table 3 at the end of the 30-minute run, theperformance of IP-PMM stabilizes after about 20 minutes.

5. TV-based Poisson Image Restoration.

Next we consider the restorationof images corrupted by Poisson noise, which arises in many applications, such asﬂuorescence microscopy, computed tomography (CT) and astronomical imaging (see,e.g., [25] and the references therein). In the discrete formulation of the restorationproblem, the object to be restored is represented by a vector w ∈ R n and the measureddata are assumed to be a vector g ∈ N m , whose entries g j are samples from m independent Poisson random variables G j with probability P ( G j = g j ) = e − ( Dw + a ) j (cid:2) ( Dw + a ) j (cid:3) g j g j ! , V. DE SIMONE, D. DI SERAFINO, J. GONDZIO, S. POUGKAKIOTIS, AND M. VIOLA where a ∈ R m + models the background radiation detected by the sensors. The matrix D = ( d ij ) ∈ R m × n models the functioning of the imaging system and satisﬁes d ij ≥ i, j, m X i =1 d ij = 1 for all j. Here we assume that D represents a convolution operator with periodic boundary con-ditions, which implies that D has a Block-Circulant structure with Circulant Blocks(BCCB). Hence, Dw is computed expeditiously using the 2-dimensional Fast FourierTransform (FFT). The maximum-likelihood approach [9] for the estimation of u leadsto the minimization of the Kullback-Leibler (KL) divergence of Dw + a from g :(5.1) D KL ( w ) ≡ D KL ( Dw + a, g ) = m X j =1 (cid:18) g j ln g j ( Dw + a ) j + ( Dw + a ) j − g j (cid:19) , where we set g j ln( g j / ( Dw + a ) j ) = 0 if g j = 0 (we implicitly assume that g has beenconverted into a real vector with entries ranging in the same interval as the entriesof w ). Since the estimation problem is highly ill conditioned, a regularization termis added to (5.1). We consider the Total Variation (TV) [50], which has receivedconsiderable attention because of its ability of preserving edges and smoothing ﬂatareas of the images. Notice that, while it may introduce staircase artifacts, TV isstill applied in many medical and biological applications (see, e.g., [5, 43, 62] andhttp://ranger.uta.edu/ ∼ huang/R CSMRI.htm). The feasible set of the problem isdeﬁned by non-negativity constraints on the image intensity and the linear constraint P ni =1 w i = P mj =1 ( g j − a j ) ≡ r which guarantees preservation of the total intensity ofthe image.The resulting model is(5.2) min w D KL ( w ) + λ k Lw k s.t. e ⊤ n w = r,w ≥ , where L ∈ R l × n is the matrix arising from the discretization of the TV functional (asin [15]). By employ-ing the splitting strategy used in the previous sections, we can transform problem(5.2) to the following equivalent form:(5.3) min x f ( x ) ≡ D KL ( w ) + c ⊤ u, s.t. Ax = b,x ≥ , where, after introducing the additional constraint d = Lw , and letting m = l + 1, n = n + 2 l , we set x = [ w ⊤ , u ⊤ ] ⊤ ∈ R n , u = [( d + ) ⊤ , ( d − ) ⊤ ] ⊤ ∈ R l , c = λ e l , b = [ r, ⊤ l ] ⊤ ∈ R m , and A = (cid:20) e ⊤ n ⊤ l ⊤ l L − I l I l (cid:21) ∈ R m × n . We solve problem (5.3) by using IP-PMM combined with a perturbed compositeNewton method [53]. Following the presentation in Section 2.2, we know that at the

PARSE APPROXIMATIONS WITH INTERIOR POINT METHODS k -th iteration of the method we have to solve two linear systems of the form of (2.8).In order to avoid factorizations, every such system is solved using the preconditionedMINimum RESidual (MINRES) method. In order to accelerate the convergence ofMINRES, we employ a block-diagonal preconditioner, which uses a diagonal approxi-mation of ∇ f ( x ). More speciﬁcally, at iteration k of IP-PMM, we have the followingcoeﬃcient matrix: M k = (cid:20) − H k A ⊤ A δ k I m (cid:21) , where H k = ( ∇ f ( x k ) + Θ − k + ρ k I n ), and we precondition it using the matrix(5.4) f M k = " e H k n,m m,n A e H − k A ⊤ + δ k I m , where e H k is a diagonal approximation of H k . In order to analyze the spectral proper-ties of the preconditioned matrix, we follow the developments in [8]. More speciﬁcally,we deﬁne b H k := e H − k H k e H k , and let: α H = λ min ( b H k ) , β H = λ max ( b H k ) , κ H = β H α H . Using this notation, we know that an arbitrary element of the numerical range of thismatrix is represented as γ H ∈ W ( b H k ) = [ α H , β H ]. Furthermore, we observe that inthe special case where e H k = diag( H k ), we have α H ≤ ≤ β H since1 n + 2 l n +2 l X i =1 λ i ( b H − k H k,j ) = 1 n + 2 l Tr( ˆ H − k H k ) = 1 . Theorem

Let k be an arbitrary IP-PMM iteration. Then, the eigenvaluesof f M − k M k lie in the union of the following intervals: I − = (cid:20) − β H − , − α H (cid:21) , I + = (cid:20)

11 + β H , (cid:21) . Proof.

The proof follows exactly the developments in [8, Theorem 3.3].In problem (5.3), f ( x ) = D KL ( w ) + c ⊤ u and hence ∇ f ( x ) = (cid:20) ∇ D KL ( w ) c (cid:21) , ∇ f ( x ) = (cid:20) ∇ D KL ( w ) 0 n, l l,n l, l (cid:21) , where ∇ D KL ( w ) = D ⊤ (cid:18) e m − gDw + a (cid:19) , ∇ D KL ( w ) = D ⊤ U ( w ) D, with U ( w ) = diag (cid:16) √ gDw + a (cid:17) . Here the square root and the ratio are assumed to becomponent-wise. Notice that D might be dense; however, as previously noted, itsaction can be computed via the FFT. Unfortunately, D ⊤ U ( w ) D is not expected tobe close to multilevel circulant. Even if it could be well-approximated by a multilevelcirculant matrix, the scaling matrix of IP-PMM would destroy this structure. In otherwords, we use the structure of D only when applying it to a vector. As a result, weonly store the ﬁrst column of D and we use the FFT to apply this matrix to a vector.This allows us to compute the action of the Hessian easily.2 V. DE SIMONE, D. DI SERAFINO, J. GONDZIO, S. POUGKAKIOTIS, AND M. VIOLA

Remark e H k =diag( H k ), but the structure of the problem makes this choice rather expensive. Amore eﬃcient alternative is to use e H k = U ( w k ) , which is easier to compute and, aswe will see in the following section, leads to good reconstruction results in practice. To evaluate the performance of the IP-PMM on this class of problems, we consider a set of three 256 ×

256 grayscale images,which are presented in Figure 2. For each of the three images we set up three restora- cameraman house peppers

Fig. 2 . The three × grayscale images of the image restoration tests. tion tests, where the images are corrupted by Poisson noise and D represents one ofthe following blurs: Gaussian blur (GB), motion blur (MB), and out-of-focus blur(OF).We compare the proposed method with the state-of-the-art Primal-Dual Algo-rithm with Linesearch (PDAL) method proposed in [39]. By following the exampleof [59, Algorithm 2], problem (5.2) is reformulated as(5.5) min w max p,y g ⊤ ln(1 + y ) − y ⊤ ( Dw + a ) − λ w ⊤ L ⊤ p + χ ∞ ( p ) + χ F ( w ) , where χ ∞ is the characteristic function of the ∞ -norm unit ball and χ F the character-istic function of the feasible set F of problem (5.2). It is worth noting that the PDALalgorithm for the solution of problem (5.5) requires at each step a projection on thefeasible set F , which is performed here by using the secant algorithm proposed by Daiand Fletcher in [22]. Concerning the parameters of PDAL, we use the same notationand tuning as in [39]. Following Section 6 of that paper, we set µ = 0 . δ = 0 . β = 25. The initial steplength is τ = p /ω , where ω is an estimate of k M ⊤ M k and M = (cid:2) D ⊤ L ⊤ (cid:3) ⊤ is the matrix linking the primal and dual variables. In the IP-PMM, we use the MINRES code by Michael Saunders and co-workers, available fromhttps://web.stanford.edu/group/SOL/software/minres/, for which we set the relativetolerance tol = 10 − and the maximum number of iterations at each call equal to 20.The regularization parameter λ is determined by trial and error to minimize the RootMean Square Error (RMSE) obtained by IP-PMM. For all the problems, the startingpoint is chosen to be the noisy and blurry image, i.e., g .For all 9 tests we run 20 iterations of the IP-PMM method and let PDAL runfor the same amount of time. In Figure 3 we report a comparison between the twoalgorithms in terms of elapsed time versus Root Mean Square Error (RMSE) in thesolution of the 9 instances described above. As can be seen from the plots, the IP-PMM clearly outperforms PDAL on the instances with GB and OF (columns 1 and PARSE APPROXIMATIONS WITH INTERIOR POINT METHODS time (s) R M SE -3 cameraman - GB IP-PMMPDAL time (s) R M SE -3 cameraman - MB IP-PMMPDAL time (s) R M SE -3 cameraman - OF IP-PMMPDAL time (s) R M SE house - GB IP-PMMPDAL time (s) R M SE -3 house - MB IP-PMMPDAL time (s) R M SE -3 house - OF IP-PMMPDAL time (s) R M SE peppers - GB IP-PMMPDAL time (s) R M SE peppers - MB IP-PMMPDAL time (s) R M SE peppers - OF IP-PMMPDAL

Fig. 3 . Comparison between IP-PMM and PDAL in terms of Root Mean Square Error (RMSE)vs execution time in the solution of the 9 image restoration problems. From top to bottom, the rowsrefer to the cameraman instances, the house instances and the peppers instances, respectively. Fromleft to right, the columns refer to the GB, MB and OF, respectively.

3, respectively, of Figure 3), while on the instances characterized by MB the twoalgorithms perform comparably.

Table 4

Comparison between IP-PMM and PDAL in terms of RMSE, PSNR and MSSIM computed atthe solutions provided by the two algorithms.

IP-PMM PDALProblem RMSE PSNR MSSIM RMSE PSNR MSSIM cameraman - GB 4.85e −

02 2.63e+01 8.33e −

01 5.02e −

02 2.60e+01 8.22e − −

02 2.52e+01 8.11e −

01 5.59e −

02 2.51e+01 7.77e − −

02 2.58e+01 7.98e −

01 5.26e −

02 2.56e+01 7.62e − −

02 2.03e+01 7.51e −

01 9.88e −

02 2.01e+01 6.92e − −

02 3.14e+01 8.67e −

01 2.77e −

02 3.11e+01 8.43e − −

02 2.84e+01 8.33e −

01 4.09e −

02 2.78e+01 7.70e − −

01 1.82e+01 7.46e −

01 1.25e −

01 1.81e+01 6.57e − −

02 2.12e+01 8.90e −

01 8.78e −

02 2.11e+01 8.72e − −

02 2.05e+01 8.01e −

01 9.70e −

02 2.03e+01 6.60e − To better analyze the diﬀerence between the solutions provided by the two algo-4

V. DE SIMONE, D. DI SERAFINO, J. GONDZIO, S. POUGKAKIOTIS, AND M. VIOLA rithms, one can look at Table 4, where we report the value of three scores: RMSE,Peak Signal-to-Noise Ratio (PSNR), and Mean Structural SIMilarity (MSSIM) [25].It is worth noting that for RMSE smaller values are better, while for PSNR andMSSIM, higher values indicate better noise removal and perceived similarity betweenthe restored and original image, respectively. From the table it is clear that in allthe considered cases IP-PMM is able to produce a better restored image than PDAL,having always a larger MSSIM, also when the RMSE and PSNR values are compara-ble. For the sake of space, we now restrict the comparison to the cases where the twoalgorithms seem to have reached equivalent solutions in terms of RMSE, to understandthe diﬀerences in the restored images. We focus on the three instances in which D represents MB (second column of Figure 3). In Figure 4 we report the results forcameraman, house and peppers with MB. By looking at the images one can see thatthose reconstructed by IP-PMM appear to be smoother (look, for example, at thesky in cameraman and house), which somehow indicates that the IP-PMM is betterthan PDAL in enforcing the TV regularization. Observe that this “visual” diﬀerenceis reﬂected by the higher values of MSSIM reported for IP-PMM in Table 4.

6. Linear Classiﬁcation via Regularized Logistic Regression.

Finally, wedeal with the problem of training a linear binary classiﬁer. Let us consider a matrix D ∈ R n × s whose rows d ⊤ i , with i ∈ { , . . . , n } , represent the training points, and avector of labels g ∈ {− , } n . In other words, we have a training set with n binary-labeled samples and s features. According to the logistic model, the conditionalprobability of having the label g i given the point d i has the form p log ( w ) i = P ( g i | d i ) = 11 + e − g i d ⊤ i w , where w ∈ R s is the vector of weights determining the unbiased linear model underconsideration. By following the maximum-likelihood approach, the weight vector w can be obtained by maximizing the log-likelihood function or, equivalently, byminimizing the logistic loss function , i.e., by solvingmin w φ ( w ) ≡ n n X i =1 φ i ( w ) , φ i ( w ) = log (cid:16) e − g i d ⊤ i w (cid:17) . To cope with the inherent ill-posedness of the estimation process, a regularizationterm is usually added to the previous model. For large-scale instances, where the fea-tures tend to be redundant, an ℓ -regularization term is usually introduced to enforcesparsity in the solution, thus embedding feature selection in the training process. Thisresults in the well-studied ℓ -regularized logistic regression model:(6.1) min w φ ( w ) + τ k w k , where τ > x f ( x ) ≡ φ ( w ) + c ⊤ u, s.t. Ax = b,u ≥ , (6.2) PARSE APPROXIMATIONS WITH INTERIOR POINT METHODS blurry and noisy Restored image - IP-PMM Restored image - PDALblurry and noisy Restored image - IP-PMM Restored image - PDALblurry and noisy Restored image - IP-PMM Restored image - PDAL Fig. 4 . Results on cameraman, house and peppers with MB: noisy and blurry images (left),images restored by IP-PMM (center), images restored by PDAL (right). where, after introducing the additional constraint u = w , with u = [( d + ) ⊤ , ( d − ) ⊤ ] ⊤ ∈ R s , and letting m = s , n = 3 s , we set x = [ w ⊤ , u ⊤ ] ⊤ ∈ R n , c = τ e s , b = 0 m ,and A ∈ R m × n deﬁned as A = [ I s − I s I s ]. The version of IP-PMM solvingproblem (6.2) is very similar to the one used to solve (5.3). The only diﬀerence herelies in the preconditioner. In particular, when solving problems of the form (6.2), weuse the preconditioner deﬁned in (5.4) (and subsequently analyzed in Theorem 5.1),but we set e H k = diag( H k ). ∼ cjlin/libsvmtools/datasets/binary.html). The names of the datasets, togetherwith their number of features, training points and testing points are summarized inTable 5. For real-sim there is no predetermined separation of data between train andtest, hence we apply a hold-out strategy keeping 30% of the data for testing.6 V. DE SIMONE, D. DI SERAFINO, J. GONDZIO, S. POUGKAKIOTIS, AND M. VIOLA

Table 5

Characteristics of the ℓ -regularized logistic regression problems Problem Features Train pts Test pts gisette 5000 6000 1000rcv1 47,236 20,242 677,399real-sim 20,958 50,617 21,692

To overcome the absence of the hyperplane bias in model (6.1), we add to thedata matrices a further column with all ones, hence the resulting size of the problemsis equal to s + 1. For all the problems we set τ = n , which is a standard choice in theliterature.To assess the eﬀectiveness and eﬃciency of the proposed method we compare itwith two state-of-the-art methods: • ∼ boyd/papers/distropt stat learning admm.html; • a MATLAB implementation of the newGLMNET method [60] used in LIB-SVM, developed by the authors of [61] and available from https://github.com/ZiruiZhou/IRPN.As in the tests presented in Section 5.2, the solution of the augmented system in IP-PMM is performed by means of the MINRES implementation by Michael Saunders’team, with maximum number of iterations equal to 20 and tolerance tol = 10 − .We compare the three algorithms in terms of objective function value and classiﬁ-cation error versus execution time, on runs lasting 15 seconds. The plots are reportedin Figure 5. The IP-PMM is comparable with newGLMNET on the gisette instance,characterized by a very dense ( >

7. Conclusions.

We have presented specialized IPMs for quadratic and generalconvex nonlinear optimization problems that model various sparse approximation in-stances. We have shown that by a proper choice of linear algebra solvers, which are akey issue in IPMs, we are able to eﬃciently solve the larger but smooth optimizationproblems coming from a standard reformulation of the original ones. This conﬁrms theability of IPMs to handle large sets of linear and non-negativity constraints. Com-putational experiments have been performed on diverse applications: multi-periodportfolio selection, classiﬁcation of fMRI data, restoration of blurry images corruptedby Poisson noise, and linear binary classiﬁcation via regularized logistic regression.Comparisons with state-of-the-art ﬁrst-order methods, which are widely used to tacklesparse approximation problems, have provided evidence that the presented IPM ap-proach can oﬀer a noticeable advantage over those methods, especially when dealingwith not-so-well conditioned problems.We also believe that the results presented in this work may provide a basis for anin-depth analysis of the application of IPMs to many sparse approximation methods,and we plan to work in that direction in the future.

REFERENCES[1]

A. Altman and J. Gondzio , Regularized symmetric indeﬁnite systems in interior point meth-

PARSE APPROXIMATIONS WITH INTERIOR POINT METHODS time (s) -1 gisette - obj fun IP-PMMADMMnewGLMNET time (s) gisette - class err

IP-PMMADMMnewGLMNET time (s) rcv1 - obj fun

IP-PMMADMMnewGLMNET time (s) rcv1 - class err

IP-PMMADMMnewGLMNET time (s) real-sim - obj fun

IP-PMMADMMnewGLMNET time (s) real-sim - class err

IP-PMMADMMnewGLMNET

Fig. 5 . Results on the three ℓ -regularized logistic regression problems (objective function valueand classiﬁcation error versus execution time).ods for linear and quadratic optimization , Optimization Methods and Software, 11–12(1999), pp. 275–302, https://doi.org/10.1080/10556789908805754.[2] A. Argyriou, L. Baldassarre, C. A. Micchelli, and M. Pontil , On sparsity inducingregularization methods for machine learning , in Empirical Inference: Festschrift in Honorof Vladimir N. Vapnik, B. Sch¨olkopf, Z. Luo, and V. Vovk, eds., Berlin, Heidelberg, 2013,Springer, pp. 205–216, https://doi.org/10.1007/978-3-642-41136-6 18.[3]

L. Baldassarre, J. Mour˜ao-Miranda, and M. Pontil , Structured sparsity models for braindecoding from fMRI data , in 2012 Second International Workshop on Pattern Recognitionin NeuroImaging, July 2012, pp. 5–8, https://doi.org/10.1109/PRNI.2012.31.[4]

L. Baldassarre, M. Pontil, and J. Mour˜ao-Miranda , Sparsity is better with stability: V. DE SIMONE, D. DI SERAFINO, J. GONDZIO, S. POUGKAKIOTIS, AND M. VIOLA

Combining accuracy and stability for model selection in brain decoding , Frontiers in Neu-roscience, 11 (2017), https://doi.org/10.3389/fnins.2017.00062.[5]

R. C. Barnard, H. Bilheux, T. Toops, E. Nafziger, C. Finney, D. Splitter, andR. Archibald , Total variation-based neutron computed tomography , Review of ScientiﬁcInstruments, 89 (2018), p. 053704, https://doi.org/10.1063/1.5037341.[6]

A. Beck and M. Teboulle , A fast iterative shrinkage-thresholding algorithm for linear inverseproblems , SIAM Journal on Imaging Sciences, 2 (2009), pp. 183–202, https://doi.org/10.1137/080716542.[7]

S. Bellavia , Inexact interior-point method , Journal of Optimization Theory and Applications,96 (1998), pp. 109–121, https://doi.org/10.1023/A:1022663100715.[8]

L. Bergamaschi, J. Gondzio, A. Mart´ınez, J. W. Pearson, and S. Pougkakiotis , A newpreconditioning approach for an interior point-proximal method of multipliers for linearand convex quadratic programming , Numerical Linear Algebra with Applications, p. e2361,https://doi.org/10.1002/nla.2361.[9]

M. Bertero, P. Boccacci, G. Desider`a, and G. Vicidomini , Image deblurring with Poissondata: from cells to galaxies , Inverse Problems, 25 (2009), p. 123006, https://doi.org/10.1088/0266-5611/25/12/123006.[10]

D. P. Bertsekas , Nonlinear programming , Athena Scientiﬁc Optimization and ComputationSeries, Athena Scientiﬁc, Belmont, MA, second ed., 1999.[11]

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein , Distributed optimization andstatistical learning via the alternating direction method of multipliers , Foundations andTrends in Machine Learning, 3 (2011), pp. 1–122, https://doi.org/10.1561/2200000016.[12]

S. Cafieri, M. D’Apuzzo, V. De Simone, and D. di Serafino , On the iterative solutionof KKT systems in potential reduction software for large-scale quadratic problems , Com-putational Optimization and Applications, 38 (2007), pp. 27–45, https://doi.org/10.1007/s10589-007-9035-y.[13]

S. Cafieri, M. D’Apuzzo, V. De Simone, D. di Serafino, and G. Toraldo , ConvergenceAnalysis of an Inexact Potential Reduction Method for Convex Quadratic Programming ,Journal of Optimization Theory and Applications, 135 (2007), pp. 355–366, https://doi.org/10.1007/s10957-007-9264-3.[14]

E. J. Cand´es, J. Romberg, and T. Tao , Stable signal recovery from incomplete and inaccuratemeasurements , Communications in Pure Applied Mathematics, 59 (2006), pp. 1207–1223,https://doi.org/doi.org/10.1002/cpa.20124.[15]

A. Chambolle , An algorithm for total variation minimization and applications , Journal ofMathematical Imaging and Vision, 20 (2004), pp. 89–97, https://doi.org/10.1023/B:JMIV.0000011320.81911.38.[16]

S. S. Chen, D. L. Donoho, and M. A. Saunders , Atomic decomposition by basis pursuit ,SIAM Review, 43 (2001), pp. 129–159, https://doi.org/10.1137/S003614450037906X.[17]

Z.-p. Chen, G. Li, and J.-e. Guo , Optimal investment policy in the time consistent mean-variance formulation , Insurance: Mathematics & Economics, 52 (2013), pp. 145–156,https://doi.org/10.1016/j.insmatheco.2012.11.007.[18]

S. Corsaro and V. De Simone , Adaptive l -regularization for short-selling control in portfolioselection , Computational Optimization and Applications, 72 (2019), pp. 457–478, https://doi.org/10.1007/s10589-018-0049-4.[19] S. Corsaro, V. De Simone, and Z. Marino , Fused lasso approach in portfolio selection ,Annals of Operations Research, (2019), https://doi.org/10.1007/s10479-019-03289-w.[20]

S. Corsaro, V. De Simone, and Z. Marino , Split Bregman iteration for multi-periodmean variance portfolio optimization , Applied Mathematics and Computation, 392 (2021),pp. 125715, 10, https://doi.org/10.1016/j.amc.2020.125715.[21]

S. Corsaro, V. De Simone, Z. Marino, and F. Perla , l -regularization for multi-periodportfolio selection , Annals of Operations Research, 294 (2020), pp. 75–86, https://doi.org/10.1007/s10479-019-03308-w.[22] Y.-H. Dai and R. Fletcher , New algorithms for singly linearly constrained quadratic programssubject to lower and upper bounds , Mathematical Programming, 106 (2006), pp. 403–421,https://doi.org/10.1007/s10107-005-0595-2.[23]

M. D’Apuzzo, V. De Simone, and D. di Serafino , On mutual impact of numerical lin-ear algebra and large-scale optimization with focus on interior point methods , Computa-tional Optimization and Applications, 45 (2010), pp. 283–310, https://doi.org/10.1007/s10589-008-9226-1.[24]

V. De Simone, D. di Serafino, and M. Viola , A subspace-accelerated split Bregman methodfor sparse data recovery with joint ℓ -type regularizers , Electronic Transactions on Numer-ical Analysis, 53 (2020), pp. 406–425, https://doi.org/10.1553/etna vol53s406.PARSE APPROXIMATIONS WITH INTERIOR POINT METHODS [25] D. di Serafino, G. Landi, and M. Viola , ACQUIRE: an inexact iteratively reweighted normapproach for TV-based Poisson image restoration , Applied Mathematics and Computation,364 (2020), pp. 124678, 23, https://doi.org/10.1016/j.amc.2019.124678.[26]

D. di Serafino and D. Orban , Constraint-preconditioned Krylov solvers for regularizedsaddle-point systems , SIAM Journal on Scientiﬁc Computing, (2021). To appear.[27]

E. D. Dohmatob, A. Gramfort, B. Thirion, and G. Varoquaux , Benchmarking solversfor TV- ℓ , in 2014 InternationalWorkshop on Pattern Recognition in Neuroimaging, June 2014, pp. 1–4, https://doi.org/10.1109/PRNI.2014.6858516.[28] M. Dubois, F. Hadj-Selem, T. L¨ofstedt, M. Perrot, C. Fischer, V. Frouin, and E. Duch-esnay , Predictive support recovery with TV-Elastic Net penalty and logistic regression: Anapplication to structural MRI , in 2014 International Workshop on Pattern Recognition inNeuroimaging, June 2014, pp. 1–4, https://doi.org/10.1109/PRNI.2014.6858517.[29]

K. Fountoulakis and J. Gondzio , A second-order method for strongly convex ℓ -regularization problems , Mathematical Programming, 156 (2016), pp. 189–219, https://doi.org/10.1007/s10107-015-0875-4.[30] K. Fountoulakis, J. Gondzio, and P. Zhlobich , Matrix-free interior point method for com-pressed sensing problems , Mathematical Programming Computation, 6 (2014), pp. 1–31,https://doi.org/10.1007/s12532-013-0063-6.[31]

M. P. Friedlander and D. Orban , A primal-dual regularized interior-point method for con-vex quadratic programs , Mathematical Programming Computation, 4 (2012), pp. 71–107,https://doi.org/10.1007/s12532-012-0035-2.[32]

J. Gondzio , Interior point methods 25 years later , European Journal of Operational Research,218 (2012), pp. 587–601, https://doi.org/10.1016/j.ejor.2011.09.017.[33]

J. Gondzio , Convergence analysis of an inexact feasible interior point method for convexquadratic programming , SIAM Journal on Optimization, 23 (2013), pp. 1510–1527, https://doi.org/10.1137/120886017.[34]

J. Gondzio and M. Makowski , Solving a class of LP problems with a primal-dual logarith-mic barrier method , European Journal of Operational Research, 80 (1995), pp. 184–192,https://doi.org/https://doi.org/10.1016/0377-2217(93)E0323-P.[35]

J. Gondzio and G. Toraldo (eds.) , Linear algebra issues arising in interior point methods ,Special issue of Computational Optimization and Applications, 36 (2007), pp. 137–341.[36]

A. Gramfort, B. Thirion, and G. Varoquaux , Identifying predictive regions from fMRI withTV-L1 prior , in 2013 International Workshop on Pattern Recognition in Neuroimaging,June 2013, pp. 17–20, https://doi.org/10.1109/PRNI.2013.14.[37]

Y. Kamitani and F. Tong , Decoding the visual and subjective contents of the human brain ,Nature Neuroscience, 8 (2005), pp. 679–685, https://doi.org/10.1038/nn1444.[38]

D. Li and W. Ng , Optimal dynamic portfolio selection: Multiperiod mean-variance formula-tion , Mathematical Finance, 10 (2000), pp. 387–406, https://doi.org/10.1111/1467-9965.00100.[39]

Y. Malitsky and T. Pock , A ﬁrst-order primal-dual algorithm with linesearch , SIAM Journalon Optimization, 28 (2018), pp. 411–432, https://doi.org/10.1137/16M1092015.[40]

H. M. Markowitz , Portfolio selection: Eﬃcient diversiﬁcation of investments , Cowles Foun-dation for Research in Economics at Yale University, Monograph 16, John Wiley & Sons,Inc., New York; Chapman & Hall, Ltd., London, 1959.[41]

S. Mehrotra , On the implementation of a primal-dual interior point method , SIAM Journalon Optimization, 2 (1992), pp. 575–601, https://doi.org/10.1137/0802028.[42]

V. Michel, A. Gramfort, G. Varoquaux, E. Eger, and B. Thirion , Total variation regu-larization for fMRI-based prediction of behavior , IEEE Transactions on Medical Imaging,30 (2011), pp. 1328–1340, https://doi.org/10.1109/TMI.2011.2113378.[43]

A. M. Mota, N. Oliveira, P. Almeida, and N. Matela ,

3D total variation minimiza-tion ﬁlter for breast tomosynthesis imaging , in Breast Imaging, A. Tingberg, K. L˚ang,and P. Timberg, eds., Cham, 2016, Springer, pp. 501–509, https://doi.org/10.1007/978-3-319-41546-8 63.[44]

J. Mour˜ao-Miranda, E. Reynaud, F. McGlone, G. Calvert, and M. Brammer , The im-pact of temporal compression and space selection on SVM analysis of single-subject andmulti-subject fMRI data , NeuroImage, 33 (2006), pp. 1055–1065, https://doi.org/10.1016/j.neuroimage.2006.08.016.[45]

Y. Nesterov and A. Nemirovskii , Interior-point polynomial algorithms in convex program-ming , vol. 13 of SIAM Studies in Applied Mathematics, Society for Industrial and AppliedMathematics (SIAM), Philadelphia, PA, 1994, https://doi.org/10.1137/1.9781611970791.[46]

S. Pougkakiotis and J. Gondzio , Dynamic non-diagonal regularization in interior point V. DE SIMONE, D. DI SERAFINO, J. GONDZIO, S. POUGKAKIOTIS, AND M. VIOLA methods for linear and convex quadratic programming , Journal of Optimization Theoryand Applications, 181 (2019), pp. 905–945, https://doi.org/10.1007/s10957-019-01491-1.[47]

S. Pougkakiotis and J. Gondzio , An interior point-proximal method of multipliers for posi-tive semideﬁnite programming , arXiv:2010.14285, (2020).[48]

S. Pougkakiotis and J. Gondzio , An interior point-proximal method of multipliers for convexquadratic programming , Computational Optimization and Applications, 78 (2021), pp. 307–351, https://doi.org/10.1007/s10589-020-00240-9.[49]

M. J. Rosa, L. Portugal, T. Hahn, A. J. Fallgatter, M. I. Garrido, J. Shawe-Taylor,and J. Mour˜ao-Miranda , Sparse network-based models for patient classiﬁcation usingfMRI , NeuroImage, 105 (2015), pp. 493–506, https://doi.org/10.1016/j.neuroimage.2014.11.021.[50]

L. I. Rudin, S. Osher, and E. Fatemi , Nonlinear total variation based noise removal algo-rithms , Physica D, 60 (1992), pp. 259–268, https://doi.org/10.1016/0167-2789(92)90242-F.[51]

M. Saunders and J. A. Tomlin , Solving regularized linear programs using barrier methodsand KKT systems , Tech. Report SOL 96-4, Systems Optimization Laboratory, Departmentof Operations Research, Stanford University, Stanford, CA 94305, USA, December 1996.[52]

M. A. Saunders , Cholesky-based methods for sparse least squares: the beneﬁts of regular-ization , in Linear and nonlinear conjugate gradient-related methods (Seattle, WA, 1995),SIAM, Philadelphia, PA, 1996, pp. 92–100.[53]

R. Tapia, Y. Zhang, M. Saltzman, and A. Weiser , The Mehrotra predictor-correctorinterior-point method as a perturbed composite Newton method , SIAM Journal on Op-timization, 6 (1996), pp. 47–56, https://doi.org/10.1137/0806004.[54]

R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight , Sparsity and smoothness viathe fused lasso , Journal of the Royal Statistical Society: Series B (Statistical Methodology),67 (2005), pp. 91–108, https://doi.org/10.1111/j.1467-9868.2005.00490.x.[55]

J. A. Tropp and S. J. Wright , Computational methods for sparse solution of linear in-verse problems , Proceedings of the IEEE, 98 (2010), pp. 948–958, https://doi.org/10.1109/JPROC.2010.2044010.[56]

R. J. Vanderbei , Symmetric quasideﬁnite matrices , SIAM Journal on Optimization, 5 (1995),pp. 100–113, https://doi.org/10.1137/0805005.[57]

V. N. Vapnik , Statistical Learning Theory , John Wiley & Sons, New York, 1998.[58]

R. A. Waltz, J. L. Morales, J. Nocedal, and D. Orban , An interior algorithm for nonlinearoptimization that combines line search and trust region steps , Mathematical Programming,107 (2006), pp. 391–408, https://doi.org/10.1007/s10107-004-0560-5.[59]

Y.-W. Wen, R. H. Chan, and T.-Y. Zeng , Primal-dual algorithms for total variation basedimage restoration under Poisson noise , Science China Mathematics, 59 (2016), pp. 141–160, https://doi.org/10.1007/s11425-015-5079-0.[60]

G.-X. Yuan, C.-H. Ho, and C.-J. Lin , An improved GLMNET for L1-regularized logisticregression , Journal of Machine Learning Research, 13 (2012), pp. 1999–2030, http://jmlr.org/papers/v13/yuan12a.html.[61]

M.-C. Yue, Z. Zhou, and A. M.-C. So , A family of inexact SQA methods for non-smoothconvex minimization with provable convergence guarantees based on the Luo-Tseng errorbound property , Mathematical Programming, 174 (2019), pp. 327–358, https://doi.org/10.1007/s10107-018-1280-6.[62]

J. Zhang, Y. Hu, and J. G. Nagy , A scaled gradient method for digital tomographic imagereconstruction , Inverse Problems & Imaging, 12 (2018), p. 239, https://doi.org/10.3934/ipi.2018010.[63]