[PDF] SILVar: Single Index Latent Variable Models

Abstract

A semi-parametric, non-linear regression model in the presence of latent variables is introduced. These latent variables can correspond to unmodeled phenomena or unmeasured agents in a complex networked system. This new formulation allows joint estimation of certain non-linearities in the system, the direct interactions between measured variables, and the effects of unmodeled elements on the observed system. The particular form of the model adopted is justified, and learning is posed as a regularized empirical risk minimization. This leads to classes of structured convex optimization problems with a "sparse plus low-rank" flavor. Relations between the proposed model and several common model paradigms, such as those of Robust Principal Component Analysis (PCA) and Vector Autoregression (VAR), are established. Particularly in the VAR setting, the low-rank contributions can come from broad trends exhibited in the time series. Details of the algorithm for learning the model are presented. Experiments demonstrate the performance of the model and the estimation algorithm on simulated and real data.

Full PDF

IIEEE TRANSACTIONS ON SIGNAL PROCESSING 1

SILVar: Single Index Latent Variable Models

Jonathan Mei and Jos´e M.F. Moura

Abstract —A semi-parametric, non-linear regression model inthe presence of latent variables is introduced. These latent vari-ables can correspond to unmodeled phenomena or unmeasuredagents in a complex networked system. This new formulationallows joint estimation of certain non-linearities in the system, thedirect interactions between measured variables, and the effectsof unmodeled elements on the observed system. The particularform of the model adopted is justiﬁed, and learning is posed asa regularized empirical risk minimization. This leads to classesof structured convex optimization problems with a “sparse pluslow-rank” ﬂavor. Relations between the proposed model andseveral common model paradigms, such as those of RobustPrincipal Component Analysis (PCA) and Vector Autoregression(VAR), are established. Particularly in the VAR setting, the low-rank contributions can come from broad trends exhibited inthe time series. Details of the algorithm for learning the modelare presented. Experiments demonstrate the performance of themodel and the estimation algorithm on simulated and real data.

I. I

NTRODUCTION

How real is this relationship? This is a ubiquitous ques-tion that presents itself not only in judging interpersonalconnections but also in evaluating correlations and causalitythroughout science and engineering. Two reasons for reachingincorrect conclusions based on observed relationships in col-lected data are chance and outside inﬂuences. For example,we can ﬂip two coins that both show heads, or observe thattoday’s temperature measurements on the west coast of thecontinental USA seem to correlate with tomorrow’s on theeast coast throughout the year. In the ﬁrst case, we might notimmediately conclude that coins are related, since the numberof ﬂips we observe is not very large relative to the possiblevariance of the process, and the apparent link we observedis up to chance. In the second case, we still may hesitate touse west coast weather to understand and predict east coastweather, since in reality both are closely following a seasonaltrend.Establishing interpretable relationships between entitieswhile mitigating the effects of chance can be achieved viasparse optimization methods, such as regression (Lasso) [1]and inverse covariance estimation [2]. In addition, the exten-sion to time series via vector autoregression [3], [4] yieldsinterpretations related to Granger causality [5]. In each ofthese settings, estimated nonzero values correspond to actualrelations, while zeros correspond to absence of relations.However, we are often unable to collect data to observeall relevant variables, and this leads to observing relationships

This work partially funded by NSF grants CCF 1011903 and CCF 1513936.This paper appears in: IEEE Transactions on Signal Processing, onlineMarch 21, 2018Print ISSN: 1053-587X, Online ISSN: 1941-0476Digital Object Identiﬁer: 10.1109/TSP.2018.2818075 that may be caused by common links with those unobservedvariables. The hidden variables in this model are fairly general;they can possibly model underlying trends in the data, orthe effects of a larger network on an observed subnetwork.For example, one year of daily temperature measurementsacross a country could be related through a graph based ongeographical and meteorological features, but all exhibit thesame signiﬁcant trend due to the changing seasons. We have nosingle sensor that directly measures this trend. In the literature,a standard pipeline is to de-trend the data as a preprocessingstep, and then estimate or use a graph to describe the variationsof the data on top of the underlying trends [6]–[8].Alternatively, attempts have been made to capture the effectsof hidden variables via sparse plus low-rank optimization [9].This has been extended to time series [10], and even to a non-linear setting via Generalized Linear Models (GLMs) [11].What if the form of the non-linearity (link function) is notknown? Regression using a GLM with an unknown linkfunction is also known as a Single Index Model (SIM). Recentresults have shown good performance when using SIMs forsparse regression [12].So far, when choosing a model, current methods will imposea ﬁxed form for the (non-)linearity, assume the absence ofany underlying trends, perform separate pre-processing orpartitioning in an attempt to remove or otherwise explicitlyhandle such trends, or take some combination of these steps.To address all of these issues, we propose a model with anon-linear function applied to a linear argument that capturesthe effects of latent variables, which manifest as unmodeledtrends in the data. Thus, we introduce the Single IndexLatent Variable (SILVar) model, which uses the SIM in asparse plus low-rank optimization setting to enable general,interpretable multi-task regression in the presence of unknownnon-linearities and unobserved variables. That is, we proposethe SILVar model not only to use for regression in difﬁcultsettings, but also as a tool for uncovering hidden relationshipsburied in data.First, we establish notation and review prerequisites inSection II. Next, we introduce the SILVar model and discussseveral paradigms in which it can be applied in Section III.Then, we detail the numerical procedure for learning theSILVar model in Section IV. Finally, we demonstrate theperformance via experiments on synthetic and real data inSection VI. II. B

ACKGROUND

In this section, we introduce the background concepts andnotation used throughout the remainder of the paper. a r X i v : . [ s t a t . M L ] M a r EEE TRANSACTIONS ON SIGNAL PROCESSING 2

A. Bregman Divergence and Convex Conjugates

For a given convex function φ , the Bregman Divergence [13]associated with φ between y and x is denoted D φ ( y (cid:107) x ) = φ ( y ) − φ ( x ) − ∇ φ ( x ) (cid:62) ( y − x ) . (1)The Bregman Divergence is a non-negative (asymmetric)quasi-metric. Two familiar special cases of the BregmanDivergence are Euclidean Distance when φ ( x ) = (cid:107) x (cid:107) andKullback-Liebler Divergence when φ ( x ) = (cid:80) i x i log x i in thecase that x is a valid probability distribution (i.e., x ≥ and (cid:80) i x i = 1 ).The convex conjugate φ ∗ of a function φ is given by φ ∗ ( x ) ∆ = sup y y (cid:62) x − φ ( y ) . (2)The convex conjugate arises naturally in optimization prob-lems when deriving a dual form for the original (primal)problem. For closed, convex, differentiable, 1-D function φ with invertible gradient, the following properties hold φ ∗ ( x ) = x ( ∇ φ ) − ( x ) − φ (( ∇ φ ) − ( x ))( ∇ φ ) − = ∇ φ ∗ ( φ ∗ ) ∗ = φ (3)where ( · ) − denotes the inverse function, not the multiplicativeinverse. In words, these properties give an alternate form ofthe conjugate in terms of gradients of the original function,state that the function inverse of the gradient is equal to thegradient of the conjugate, and state that the conjugate of theconjugate is the original function. B. Generalized Linear and Single-Index Models

The Generalized Linear Model (GLM) can be describedusing several parameterizations. We adopt the one based onthe Bregman Divergence [14]. For observations y i ∈ R and x i ∈ R p , let y = ( y . . . y n ) (cid:62) , X = ( x . . . x n ) . The modelis parameterized by 1) a non-linear link function g = ( ∇ φ ) − where φ is a closed, convex, differentiable, invertible function;and 2) a vector a ∈ R p . We have the model written as E [ y i | x i ] = g (cid:0) a (cid:62) x i (cid:1) , (4)(note that some references use g − as the link function wherewe use g ) and the corresponding likelihood function writtenas P ( y i | x i ) = exp (cid:8) − D φ (cid:0) y i (cid:107) g (cid:0) a (cid:62) x i (cid:1)(cid:1)(cid:9) (5)where the likelihood is expressed with respect to an appro-priate base measure [15], which can be omitted for notationalclarity.Let G = φ ∗ and g = ∇ G = ( ∇ φ ) − . Then, for data { x i , y i } with conditionally independent y i given x i (note thatthis is not necessarily assuming that x i are independent), learning the model a assuming g is known can be achievedvia empirical risk minimization, (cid:98) a = argmax a n (cid:89) i =1 exp (cid:8) − D φ (cid:0) y i (cid:107) g (cid:0) a (cid:62) x i (cid:1)(cid:1)(cid:9) = argmin a n (cid:88) i =1 D φ (cid:0) y i (cid:107) g (cid:0) a (cid:62) x i (cid:1)(cid:1) = argmin a n (cid:88) i =1 (cid:2) φ ( y i ) − φ (cid:0) g (cid:0) a (cid:62) x i (cid:1)(cid:1) −∇ φ (cid:0) g (cid:0) a (cid:62) x i (cid:1)(cid:1) (cid:0) y i − g (cid:0) a (cid:62) x i (cid:1)(cid:1)(cid:3) ( a ) = argmin a n (cid:88) i =1 (cid:2) G ∗ ( y i ) − y i (cid:0) a (cid:62) x i (cid:1) − φ (cid:0) g (cid:0) a (cid:62) x i (cid:1)(cid:1) + (cid:0) a (cid:62) x i (cid:1) g (cid:0) a (cid:62) x i (cid:1)(cid:3) ( b ) = argmin a n (cid:88) i =1 (cid:2) G ∗ ( y i ) + G (cid:0) a (cid:62) x i (cid:1) − y i (cid:0) a (cid:62) x i (cid:1)(cid:3) = argmin a (cid:98) F ( y , X , g, a ) (6)where equality ( a ) arises from the second property in (3),equality ( b ) arises from the ﬁrst property, and we introduce (cid:98) F ( y , X , g, a ) ∆ = 1 n n (cid:88) i =1 (cid:104) G ∗ ( y i ) + G (cid:16) a (cid:62) x i (cid:17) − y i (cid:16) a (cid:62) x i (cid:17)(cid:105) (7) for notational compactness.The Single Index Model (SIM) [16] takes the same formas the GLM. The crucial difference is in the estimation of themodels. When learning a GLM, the link function g is knownand the linear parameter a is estimated; however when learninga SIM, the link function g needs to be estimated along withthe linear parameter a .Recently, it has been shown that, when the function g isrestricted to be monotonic increasing and Lipschitz, learningSIMs becomes computationally tractable [15] with perfor-mance guarantees in high-dimensional settings [12]. Thus,with scalar u deﬁning the set C u = { g : ∀ y > x, ≤ g ( y ) − g ( x ) ≤ u ( y − x ) } of monotonic increasing u -Lipschitzfunctions, this leads to the optimization problem, ( (cid:98) g, (cid:98) a ) = argmin g, a (cid:98) F ( y , X , g, a ) s.t. g = ∇ G ∈ C . (8) C. Lipschitz Monotonic Regression and Estimating SIMs

The estimation of g with the objective function includ-ing terms G and G ∗ at ﬁrst appears to be an intractablydifﬁcult calculus of variations problem. However, there is amarginalization technique that cleverly avoids directly esti-mating functional gradients with respect to G and G ∗ [15]and admits gradient-based optimization algorithms for learn-ing. The marginalization utilizes Lipschitz monotonic regres-sion (LMR) as a subproblem. Thus, before introducing thismarginalization, we ﬁrst review LMR.

1) LMR:

Given ordered pairs { x i , y i } and independentGaussian w i , consider the model y i = g ( x i ) + w i , (9)which intuitively treats { y i } as noisy observations of a func-tion g indexed by x , sampled at points { x i } . Let (cid:98) g i = g ( x i ) , EEE TRANSACTIONS ON SIGNAL PROCESSING 3 an estimate of the function value, with (cid:98) g = ( (cid:98) g . . . (cid:98) g n ) (cid:62) , and x [ j ] denote the j th element of the { x i } sorted in ascendingorder. Then LMR is described by the problem, (cid:98) g ∆ = LMR ( y , x ) = argmin g n (cid:88) i =1 ( g ( x i ) − y i ) s.t. ≤ g (cid:0) x [ j +1] (cid:1) − g (cid:0) x [ j ] (cid:1) ≤ x [ j +1] − x [ j ] for j = 1 , . . . , n − . (10) While there may be in fact inﬁnitely many possible (contin-uous) monotonic increasing Lipschitz functions g that passthrough the points (cid:98) g , the solution vector (cid:98) g is unique. Wewill introduce a simple yet effective algorithm for solving thisproblem later in Section IV-A.

2) Estimating SIMs:

We now return to the objective func-tion (8). Let (cid:98) g = LMR ( y , X (cid:62) a ) . Then the gradient w.r.t. a can be expressed in terms of an estimate of g without explicitcomputation of G or G ∗ , ∇ a F = n (cid:88) i =1 [( (cid:98) g i − y i ) x i ] . (11)This allows us to apply gradient or quasi-Newton methods tosolve the minimization in a , which is itself a convex problemsince the original problem was jointly convex in g and a .III. S INGLE I NDEX L ATENT V ARIABLE M ODELS

In this section, we build the Single Index Latent Variable(SILVar) model from fundamental concepts.

A. Multitask regression and latent variables

First, we extend the SIM to the multivariate case andthen examine how latent variables can affect learning ofthe linear parameter. Let y i = ( y i . . . y mi ) (cid:62) , g ( x ) =( g ( x ) . . . g m ( x m )) (cid:62) , and A = ( a . . . a m ) (cid:62) . Considerthe vectorization, E [ y ji | x i ] = g j (cid:0) a (cid:62) j x i (cid:1) ⇒ E [ y i | x i ] = g ( Ax i ) . (12)For the remainder of this paper, we make an assumption thatall g j = g for notational simplicity, though the same analysisreadily extends to the case where g j are distinct.Now, let us introduce a set of latent variables z i ∈ R r with r (cid:28) p and the corresponding linear parameter B =( b . . . b m ) (cid:62) ∈ R m × r (note we can incorporate a linear offsetby augmenting the variable z ← ( z (cid:62) (cid:62) and adding thelinear offset as a column of B ). This leads to the asymptoticmaximum likelihood estimate, (cid:0) g, A , B (cid:1) = argmin g, A , B F ( y i , x i , z i , g, A , B ) s.t. g = ∇ G ∈ C , (13)where F ( y i , x i , z i , g, A , B ) ∆ = E (cid:34) m (cid:88) j =1 (cid:104) G ∗ ( y j ) + G ( a (cid:62) j x i + b (cid:62) j z i ) (cid:105) − y (cid:62) ( Ax i + Bz i ) (cid:35) . (14) Now consider the case in which the true distribution remainsthe same, but we only observe x i and not z i , ( (cid:98) g, (cid:98) A ) =argmin g, A F ( y i , x i , g, A ) ∆ = E (cid:34) m (cid:88) j =1 (cid:104) G ∗ ( y ji ) + G ( a (cid:62) j x i ) (cid:105) − y (cid:62) i ( Ax i ) (cid:35) s.t. g = ∇ G ∈ C . (15) We now propose a relation between the two models in (13)and (15), which will ﬁnally lead to the SILVar model. Here wepresent the abridged theorem, and relegate the full expressionsand derivation to Appendix A. To establish notation, letprimes ( (cid:48) ) denote derivatives, hats ( (cid:98) ) denote a parameterestimate with only observed variables, overbars ( ) denote anunderlying parameter estimate when we have access to bothobserved and latent variables, and we drop the subscripts fromthe random variables x and z to reduce clutter. Theorem 1.

Assume that (cid:98) g (cid:48) (0) (cid:54) = 0 and that | (cid:98) g (cid:48)(cid:48) | ≤ J and | g (cid:48)(cid:48) | ≤ J for some J < ∞ . Furthermore, assume inmodels (14) and (15) that max j (cid:0) (cid:107) (cid:98) a j (cid:107) , (cid:107) a j (cid:107) + (cid:107) b j (cid:107) (cid:1) ≤ k, max (cid:0) E [ (cid:107) x (cid:107) (cid:107) x (cid:107) ∞ ] , E [ (cid:107) x (cid:107) (cid:107) x (cid:107) ∞ (cid:107) z (cid:107) ] , E [ (cid:107) x (cid:107) (cid:107) z (cid:107) ] (cid:1) ≤ s Nr , where subscripts in s Nr indicate that the bounds may growwith the values in the subscript. Then, the parameters (cid:98) A and A from models (14) and (15) are related as (cid:98) A = q ( A + L ) + E , where q = g (cid:48) (0) (cid:98) g (cid:48) (0) , µ x = E [ x i ] , L = (cid:16) B E [ zx (cid:62) ] + ( g ( ) − (cid:98) g ( )) µ (cid:62) x (cid:17) (cid:0) E [ xx (cid:62) ] (cid:1) † ⇒ rank ( L ) ≤ r + 1 , and E = (cid:98) A − q ( A + L ) is bounded as M N (cid:107) E (cid:107) F ≤ Jσ (cid:96) √ N (cid:98) g (cid:48) (0) M s Nr k , where σ (cid:96) = (cid:13)(cid:13)(cid:13)(cid:0) E (cid:2) xx (cid:62) (cid:3)(cid:1) † (cid:13)(cid:13)(cid:13) , the largest singular value of thepseudo-inverse of the covariance. The proof is given in Appendix A. The assumptions requirethat the 2nd order Taylor series expansion is accurate aroundthe point , that the model parameters A and B are bounded,and that the distributions generating the data are not too spreadout (beyond where the Taylor series expansion is performed).These are all intuitively reasonable and unsurprising assump-tions. Though the theorem poses a hard constraint on | (cid:98) g (cid:48)(cid:48) | , wehypothesize that this is a rather strong condition that can beweakened to be in line with similar models.The theorem determines certain scaling regimes underwhich the sparse plus low-rank approximation remains rea-sonable. For instance, consider the case where M ∼ N andthe moments scale as s Nr ∼ √ N , which is reasonable giventheir form (i.e., very loosely, (cid:107) x (cid:107) ∞ ∼ and (cid:107) x (cid:107) ∼ √ N and small latent power (cid:107) z (cid:107) ∼ relative to N ). Then, tokeep the error of constant order, the power of each row ofthe matrix would need to stay constant k ∼ . If we see EEE TRANSACTIONS ON SIGNAL PROCESSING 4 A as a graph adjacency matrix, then, in rough terms, thiscan correspond intuitively to the case in which the node in-degrees stay constant so that the local complexity remainsthe same even as the global complexity increases while thenetwork grows. Again, we hypothesize that this overall boundcan be tightened given more assumptions (e.g., on the networktopology). Thus, we propose the SILVar model, (cid:98) y = (cid:98) g (cid:16)(cid:16) (cid:98) A + (cid:98) L (cid:17) x (cid:17) , (16)and learn it using the optimization problem, ( (cid:98) g, (cid:98) A , (cid:98) L ) =argmin g, A , L (cid:98) F ( Y , X , g, A + L ) + h ( A ) + h ( L ) s.t. g = ∇ G ∈ C , (17)where (cid:98) F ( Y , X , g, A )= 1 n n (cid:88) i =1 (cid:34) m (cid:88) j =1 (cid:104) G ∗ ( y ji )+ G (cid:16) a (cid:62) j x i (cid:17)(cid:105) − y (cid:62) i ( Ax i ) (cid:35) , (18) the empirical version of F , and h and h are regularizers on A and L respectively. Two natural choices for h would be h ( L ) = λ (cid:107) L (cid:107) ∗ the nuclear norm and h ( L ) = I {(cid:107) L (cid:107) ∗ ≤ λ } the indicator of the nuclear norm ball, both relating tothe nuclear norm of L , since L is approximately low rankdue to the inﬂuence of a relatively small number of latentvariables. We may choose different forms for h dependingon our assumptions about the structure of A . For example, if A is assumed sparse, we may use h ( A ) = λ (cid:107) v ( A ) (cid:107) , the (cid:96) norm applied element-wise to the vectorized A matrix. Theseexamples are extensions to the “sparse and low-rank” models,which have been shown under certain geometric incoherenceconditions to be identiﬁable [9]. In other words, if the sparsecomponent is not too low-rank, and if the low-rank componentis not too sparse, then A and L can be recovered uniquely. B. Connection to Related Problems

In this section, we show how the SILVar model can be usedin various problem settings commonly considered throughoutthe literature.

1) Generalized Robust PCA:

Though we posed our initialproblem as a regression problem, if we take our measurementvectors to be x i = e i the canonical basis vectors (i.e., so thatthe overall data matrix X = I ), then we arrive at (cid:98) Y = (cid:98) g ( (cid:98) A + (cid:98) L ) . (19)This is precisely the model behind Generalized RobustPCA [17], but with the twist of estimating the link functionas well [18]. What is worth noting is that although we arrivedat our model via a regression problem with latent variables,the model itself is also able to describe a problem that arisesfrom very different assumptions on how the data is generatedand structured.We also note that the SILVar model can be modiﬁed to sharea space with the Generalized Low-Rank (GLR) Models [19].The GLR framework is able to describe arbitrary types ofdata with an appropriate choice of convex link function g determined a priori , while the SILVar model is restricted toa certain continuous class of convex link functions but aims to learn this function. The modiﬁcation is simply a matrixfactorization L = UV (and “inﬁnite” penalization on A ).The explicit factorization makes the problem non-convex butinstead block convex, which still allows for alternating convexsteps in U with ﬁxed V (and vice versa) to tractably reachlocal minima under certain conditions. Nonetheless, due to thenon-convexity, further discussion of this particular extensionwill be beyond the scope of this paper.

2) Extension to Autoregressive Models:

We can apply theSILVar model to learn from time series as well. Consider aset of N time series each of length K , X ∈ R N × K . Weassume the noise at each time step is independent (note that,with this assumption, the time series are still dependent acrosstime), and take in our previous formation, y i ← x k and x i ← x k − k − M = ( x (cid:62) k − . . . x (cid:62) k − M ) (cid:62) so that the model of order M takes the form, (cid:98) x k = g (cid:32) M (cid:88) i =1 (cid:16) A ( i ) + L ( i ) (cid:17) x k − i (cid:33) , (20)and learn it using the optimization problem, ( (cid:98) g, (cid:98) A , (cid:98) L ) = argmin g, A , L (cid:98) F ( X , g, A + L ) + h ( A ) + h ( L ) s.t. g = ∇ G ∈ C , (21) where A = (cid:0) A (1) . . . A ( M ) (cid:1) and L = (cid:0) L (1) . . . L ( M ) (cid:1) and (cid:98) F ( X , g, A ) = 1 K − M K (cid:88) k = M +1 (cid:34) N (cid:88) j =1 (cid:34) G ∗ ( x jk )+ G (cid:32) M (cid:88) i =1 a ( i ) j x k − i (cid:33)(cid:35) − x (cid:62) k (cid:32) M (cid:88) i =1 A ( i ) x k − i (cid:33)(cid:35) , where A ( i ) = (cid:16) a ( i )1 . . . a ( i ) N (cid:17) (cid:62) , similarly to before. Notethat the analysis in the previous section follows naturallyin this setting, so that here rank ( L i ) ≤ r + 1 . Then, thematrix A may be assumed to be group sparse, relating togeneralized notions of Granger Causality [3], [20], and onepossible corresponding choice of regularizer taking the form h ( A ) = λ (cid:80) i,j (cid:13)(cid:13)(cid:13)(cid:16) a (1) ij . . . a ( M ) ij (cid:17)(cid:13)(cid:13)(cid:13) .Another structural assumption could be that of the CausalGraph Process model [21], inspired by Signal Processing onGraphs [6], in which A ( i ) are matrix polynomials in oneunderlying matrix (cid:101) A . This framework utilizes the nonconvexregularizer h ( A ) = λ (cid:107) v ( A (1) ) (cid:107) + λ (cid:80) i (cid:54) = j (cid:107) A ( i ) A ( j ) − A ( j ) A ( i ) (cid:107) F to encourage both sparsity and commutativity,which is satisﬁed if A ( i ) are all matrix polynomials in thesame matrix. Since this particular regularization is again blockconvex, convex steps can still be taken in each A ( i ) withall other blocks A ( j ) for j (cid:54) = i ﬁxed, for a computationallytractable algorithm to reach a local minimum under certainconditions. However, further detailed discussion will remainoutside the scope of this paper.The hidden variables in this time series model can evenpossibly model underlying trends in the data. For example,one year of daily temperature measurements across a countrycould be related through a graph based on geographical andmeteorological features, but all exhibit the same signiﬁcant EEE TRANSACTIONS ON SIGNAL PROCESSING 5 trend due to the changing seasons. In previous work, a standardpipeline is to detrend the data as a preprocessing step, and thenestimate or use a graph to describe the variations of the dataon top of the underlying trends [6]–[8]. Instead, the time seriescan also be modeled as a modiﬁed autoregressive process,depending on a low-rank trend L (cid:48) = (cid:0) (cid:96) (cid:48) . . . (cid:96) (cid:48) K (cid:1) ∈ R N × K and the variations of the process about that trend, (cid:98) x k = g (cid:32) (cid:96) (cid:48) k + M (cid:88) i =1 A ( i ) (cid:0) x k − i − (cid:96) (cid:48) k − i (cid:1)(cid:33) . (22)Substituting this into Equation (20) yields (cid:96) (cid:48) k + M (cid:88) i =1 A ( i ) (cid:0) x k − i − (cid:96) (cid:48) k − i (cid:1) = M (cid:88) i =1 (cid:16) A ( i ) + L ( i ) (cid:17) x k − i ⇒ M (cid:88) i =1 L ( i ) x k − i = (cid:96) (cid:48) k − M (cid:88) i =1 A ( i ) (cid:96) (cid:48) k − i (23)Thus we estimate the trend using ridge regression for numeri-cal stability purposes but without enforcing L (cid:48) to be low rank.We can accomplish this via the simple optimization, (cid:98) L (cid:48) = argmin L (cid:48) K (cid:88) k = M +1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:96) (cid:48) k − M (cid:88) i =1 A ( i ) (cid:96) (cid:48) k − i − M (cid:88) i =1 L ( i ) x k − i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + λ (cid:107) L (cid:48) (cid:107) F (24)with λ being the regularization parameter. In this way, theextension of SILVar to autoregressive models can allow jointestimation of the effects of the trend and of these variationssupported on the graph, as will be demonstrated via experi-ments in Section VI.IV. E FFICIENTLY L EARNING

SILV AR M ODELS

In this section, we describe the formulation and correspond-ing algorithm for learning the SILVar model. Surprisingly,in the single-task setting, learning a SIM is jointly convexin g and a as demonstrated in [15]. The pseudo-likelihoodfunctional (cid:98) F used for learning the SILVar model in (21) isthus also jointly convex in g , A , and L by a simple extensionfrom the single-task regression setting. Lemma 2 (Corollary of Theorem 2 of [15]) . The (cid:98) F in theSILVar model learning problem (16) is jointly convex in g , A ,and L . This convexity is enough to ensure that the learning canconverge and be computationally efﬁcient. Before describingthe full algorithm, one detail remains: the implementation ofthe LMR introduced in Section II-C.

A. Lipschitz Monotonic Regression

To tackle LMR, we ﬁrst introduce the related simplerproblem of monotonic regression, which is solved by a well-known algorithm, the pooled adjacent violators (PAV) [22].The monotonic regression problem is formulated asPAV ( y , x ) ∆ = argmin g n (cid:88) i =1 ( g ( x i ) − y i ) s.t. ≤ g (cid:0) x [ j +1] (cid:1) − g (cid:0) x [ j ] (cid:1) . (25) The PAV algorithm has a complexity of O ( n ) , which is dueto a single complete sweep of the vector y . The monotonicregression problem can also be seen as a standard (cid:96) projectiononto the convex set of monotonic functions.We introduce a simple generalization to the monotonicregression,GPAV t ( y , x ) ∆ = argmin g n (cid:88) i =1 ( g ( x i ) − y i ) s.t. t [ j +1] ≤ g (cid:0) x [ j +1] (cid:1) − g (cid:0) x [ j ] (cid:1) for j = 1 , . . . , n − . (26)This modiﬁcation allows weaker (if t i < ) or stronger (if t i > ) local monotonicity conditions to be placed on theestimated function g . Let t [1] = 0 and s [ i ] = i (cid:80) j =1 t [ j ] , and invectorized form, s ∆ = cusum ( t ) . (27)Then, we can rewrite, GPAV t ( y , x ) = argmin g n (cid:88) i =1 ( g ( x i ) − s i − ( y i − s i )) s.t. ≤ g (cid:0) x [ j +1] (cid:1) − s [ j +1] − (cid:0) g (cid:0) x [ j ] (cid:1) − s [ j ] (cid:1) for j = 1 , . . . , n − , (28) which leads us to recognize thatGPAV t ( y , x ) = PAV ( y − s , x ) + s . (29)Returning to LMR (10), we pose it as the projection ontothe intersection of 2 convex sets, the monotonic functionsand the functions with upper bounded differences, for whicheach projection can be solved efﬁciently using GPAV (forthe second set, we can use GPAV on the negated functionand negate the result). Thus to perform this optimizationefﬁciently utilizing PAV as a subroutine, we can use anaccelerated Dykstra’s Algorithm [23], which is a generalmethod to ﬁnd the projection onto the non-empty intersectionof any number of convex sets. The accelerated algorithm canhas a geometric convergence rate for O (log n ) passes, witheach pass utilizing PAV with O ( n ) cost, for a total cost of O ( n log n ) . We make a brief sidenote that this is better thanthe O ( n ) in [24] and simpler than the O ( n log n ) in [25]achieved using a complicated tree-like data structure, and hasno learning rate parameter to tune compared to an ADMM-based implementation that would also yield O ( n log n ) . Weprovide the steps of LMR in Algorithm 1 as a straightforwardapplication of the the accelerated Dykstra’s algorithm, butleave the details of the derivation of the acceleration to [23].In addition, the numerical stability parameter is included tohandle small denominators, but in practice we did not observethe demoninator falling below ε = 10 − in our simulationsbefore convergence.While the GPAV allows us to quickly perform Lipschitzmonotonic regression with our particular choice of t , it shouldbe noted that using other choices for t can easily generalizethe problem to bi-H¨older regression as well, to ﬁnd functionsin the set C u,β(cid:96),α = { g : ∀ y > x, (cid:96) | y − x | α ≤ g ( y ) − g ( x ) ≤ u | y − x | β } . This set may prove interesting for further analysisof the estimator and is left as a topic for future investigation. EEE TRANSACTIONS ON SIGNAL PROCESSING 6

Algorithm 1

Lipschitz monotone regression (LMR) Let t = 0 and compute t [ i +1] = x [ i +1] − x [ i ] , s = cusum ( t ) . Set numerical stability parameter < ε < , error = ∞ , and tolerance < δ < . Initialize g (0) (cid:96) = PAV ( y , x ) , g (0) u = − PAV ( − y + s , x ) + s , v = , v = g u − g (cid:96) , w = g u , k = 0 while error ≥ δ do g ( k +1) (cid:96) ← PAV (cid:16) g ( k ) u − v , x (cid:17) g ( k +1) u ← − PAV (cid:16) − ( g ( k +1) (cid:96) − v ) + s , x (cid:17) + s v ← g ( k +1) (cid:96) − ( g ( k +1) u − v ) v ← g ( k +1) u − ( g ( k +1) (cid:96) − v ) if (cid:12)(cid:12)(cid:12) (cid:107) g ( k ) (cid:96) − g ( k ) u (cid:107) +( g ( k +1) (cid:96) − g ( k +1) u ) (cid:62) ( g ( k ) (cid:96) − g ( k ) u ) (cid:12)(cid:12)(cid:12) ≥ ε then α ( k +1) ← (cid:107) g ( k ) (cid:96) − g ( k ) u (cid:107) (cid:107) g ( k ) (cid:96) − g ( k ) u (cid:107) +( g ( k +1) (cid:96) − g ( k +1) u ) (cid:62) ( g ( k ) (cid:96) − g ( k ) u ) z ← g ( k ) u + α ( k +1) ( g ( k +1) u − g ( k ) u ) else z ← ( g ( k +1) (cid:96) + g ( k +1) u ) end if error ← (cid:107) z − w (cid:107) w ← z k ← k + 1 end while return z B. Optimization Algorithms

With LMR in hand, we can outline the algorithm for solvingthe convex problem (17). This procedure can be performedfor general h and h , but proximal mappings are efﬁcient tocompute for many common regularizers. Thus, we describethe basic algorithm using gradient-based proximal methods(e.g., accelerated gradient [26], and quasi-Newton [27]), whichrequire the ability to compute a gradient and the proximalmapping.In addition, we can compute the objective function valuefor purposes of backtracking [28], evaluating algorithm con-vergence, or computing error (e.g., for validation or testingpurposes). While our optimization outputs an estimate of g ,the objective function depends on the value of G and itsconjugate G ∗ . Though we cannot necessarily determine G or G ∗ uniquely since G + c and G both yield g as a gradient forany c ∈ R , the value of G ( x ) + G ∗ ( y ) is unique for a ﬁxed g .To see this, consider (cid:101) G = G + c for some constant c . Then, (cid:101) G ( x ) + (cid:101) G ∗ ( y ) = G ( x ) + c + max z [ zy − (cid:101) G ( z )]= G ( x ) + c + max z [ zy − G ( z ) − c ]= G ( x ) + max z [ zy − G ( z )]= G ( x ) + G ∗ ( y ) (30)This allows us to compute the objective function by per-forming the cumulative numerical integral of (cid:98) g on pointsv ( Θ ) (e.g., G=cumtrapz(theta,ghat) in Matlab). Then,a discrete convex conjugation (also known as the discreteLegendre transform (DLT) [29]) computes G ∗ .We did not notice signiﬁcant differences in performanceaccuracy due to our implementation using quasi-Newton meth- ods and backtracking compared to an accelerated proximalgradient method, possibly because both algorithms were rununtil convergence. Thus, we show only one set of results. How-ever, we note that the runtime was improved by implementingthe quasi-Newton and backtracking method. Algorithm 2

Single Index Latent Variable (SILVar) Learning Initialize (cid:98) A = , (cid:98) L = while not converged do Proximal Methods Computing gradients: Θ ← ( (cid:98) A + (cid:98) L ) X (cid:98) g ← LMR ( v ( Y ) , v ( Θ )) ∇ A F = ∇ L F = (cid:88) i ∈I ( (cid:98) g ( θ i ) − y i ) x (cid:62) i Optionally compute function value: (cid:98) G ← cumtrapz ( v ( Θ ) , (cid:98) g ) (cid:98) G ∗ ← DLT ( v ( Θ ) , (cid:98) G , v ( Y )) (cid:98) F = (cid:88) ij (cid:98) G ∗ ( y ij ) + (cid:98) G ( θ ij ) − y ij θ ij end while return ( (cid:98) g , A , L ) Algorithm 2 describes the learning procedure and detailsthe main function and gradient computations while assuminga proximal operator is given. The computation of the gradientand the update vector depends on the particular variation ofproximal method utilized; with stochastic gradients, the set

I ⊂ { , . . . , n } could be pseudorandomly generated at eachiteration, while with standard gradients, I = { , . . . , n } . The cumtrapz procedure takes as input the coordinates of pointsdescribing the function (cid:98) g . The DLT procedure takes as its ﬁrsttwo inputs the coordinates of points describing the function (cid:98) G ,and as its third input the points at which to evaluate the convexconjugate (cid:98) G ∗ .Here, an observant reader may notice the subtle point thatthe function G ∗ may only be deﬁned inside a ﬁnite or semi-inﬁnite interval. This can occur if the function g is boundedbelow and/or above, so that G has a minimum/maximumslope, and G ∗ is inﬁnite for any values below that minimumor above that maximum slope of G (to see this, one mayrefer back to the deﬁnition of conjugation (2)). Fortunately,this does not invalidate our method. It is straightforward tosee that (cid:98) g ( x ) = x is always a feasible solution and that (cid:98) G ∗ ( y ) = y / is deﬁned for all y ; thus starting from thissolution, with appropriately chosen step sizes, gradient-basedalgorithms will avoid the solutions that make the objectivefunction inﬁnite. Furthermore, even if new data y j falls outsidethe valid domain for the learned (cid:98) G ∗ and we incur an “inﬁnite”loss using the model, evaluating (cid:98) g (cid:16)(cid:16) (cid:98) A + (cid:98) B (cid:17) x j (cid:17) is still welldeﬁned. This problem is not unique to SIM’s, as assuming aﬁxed link function g in a GLM can also incur inﬁnite loss ifnew data does not conform to modeling assumptions implicitto the link function (e.g., the log-likelihood for a negative y under a non-negative distribution), and making a prediction EEE TRANSACTIONS ON SIGNAL PROCESSING 7 using the learned GLM is still well-deﬁned. Practically, theloss for SILVar computed using the DLT may be large, butwill not be inﬁnite [29].V. P

ERFORMANCE

We now provide conditions under which the optimizationrecovers the parameters of the model.First let us establish a few additional notations needed. Let ( (cid:101) g, (cid:101) A , (cid:101) L ) be the best Lipschitz monotonic function, sparsematrix, and low-rank matrix that models the true data gen-eration. Let (cid:101) A be sparse on the index set S of size | S | = s A and (cid:101) L = U Λ V (cid:62) be the SVD of (cid:101) L , where U ∈ R M × r L , Λ ∈ R r L × r L , and V ∈ R N × r L where r L is the rank of (cid:101) L .Then let A S be equal to A on S and on S c (so that (cid:101) A S = (cid:101) A and A S + A S c = A ), and L R = L − ( I − UU (cid:62) ) L ( I − VV (cid:62) ) and L R c = L − L R (so that (cid:101) L R = (cid:101) L and L R c + L R = L ).Consider the set of approximately S -sparse and R -low-rankmatrices B γ ( S, R ) = { ( Φ , Ψ ) : γ (cid:107) Φ S c (cid:107) + (cid:107) Ψ R c (cid:107) ∗ ≤ γ (cid:107) Φ S (cid:107) + (cid:107) Ψ R (cid:107) ∗ ) } . Intuitively, this is the set of matricesfor which the energy of Φ is on the same sparsity setas (cid:101) A and similarly the energy of Ψ is in the same low-rank space as (cid:101) L . Finally, let the marginalization of the lossfunctional w.r.t. g be (cid:98) m ( Y , X , A ) = min g ∈C (cid:98) F ( Y , X , g, A ) ,the value of the marginalized function at the true matrixparameters be (cid:98)(cid:101) g = min g ∈C (cid:98) F ( Y , X , g, (cid:101) A + (cid:101) L ) , the Hessianof the marginalized functional w.r.t. the matrix parameter be (cid:101) I = ∇ v ( A ) (cid:98) m ( Y , X , (cid:101) A + (cid:101) L ) , and denote the quadratic formin shorthand (cid:107) v (cid:107) (cid:101) I = v (cid:62) (cid:101) I v .Now, consider the following assumptions:1) There exists some α > such that (cid:107) Φ + Ψ (cid:107) (cid:101) I ≥ α (cid:107) Φ + Ψ (cid:107) F ∀ ( Φ , Ψ ) ∈ B γ ( S, R ) . This is essentially a Restricted Strong Convexity condi-tion standard for structured recovery [30].2) Let (cid:98)(cid:101) Γ ∈ R M × K be the matrix with columns given by (cid:98)(cid:101) Γ k = (cid:98)(cid:101) g (( (cid:101) A + (cid:101) L ) x k ) − y k , k = 1 , . . . , K. Then for some λ > and γ > , K (cid:13)(cid:13)(cid:13)(cid:98)(cid:101) ΓX (cid:62) (cid:13)(cid:13)(cid:13) ∞ ≤ λγ/ K (cid:13)(cid:13)(cid:13)(cid:98)(cid:101) ΓX (cid:62) (cid:13)(cid:13)(cid:13) ≤ λ/ where (cid:107) A (cid:107) ∞ is the largest element of A in magnitude,and (cid:107) A (cid:107) is the largest singular value of A .This states that the error is not too powerful and is nottoo correlated with the regression variables, and it is alsosimilar to standard conditions for structured recovery.3) Let τ = γ (cid:113) s A r L and µ = r L (1+ τ ) , then max( (cid:107) Φ (cid:107) , (cid:107) Ψ (cid:107) ∞ /γ ) γ (cid:107) Φ (cid:107) + (cid:107) Ψ (cid:107) ∗ ≤ µ ∀ ( Φ , Ψ ) ∈ B γ ( S, R ) . This is a condition on the incoherence between thesparsity set S and low rank subspace R (see [9], [17],[31], [32] for other similar variations on rank-sparsityincoherence considered previously). Basically, this statesthat S should be such that S -sparse matrices are not approximately low-rank, while R should be such that R -low-rank matrices are not approximately sparse.Then we have the following result, Theorem 3.

Under Assumptions 1-3, the solution to theoptimization problem (21) satisﬁes (cid:107) (cid:98) A − (cid:101) A (cid:107) F ≤ λγ √ s A α (cid:32) (cid:114) τ (cid:33) ≤ λγ √ s A α (31) (cid:107) (cid:98) L − (cid:101) L (cid:107) F ≤ λ √ r L α (cid:32) (cid:114) τ τ (cid:33) ≤ λ √ r L α . (32)We give the proof for this theorem in the Appendix.The theorem relates the performance of the optimization toconditions on the problem parameters. Of course, in practicethese conditions are difﬁcult to check, since they requireknowledge of ground truth. Even given ground truth, verifyingAssumption 3 requires solving another separate non-convexoptimization problem. Thus, future directions for work couldinclude investigation into which kinds of sparse matrix (ornetwork) models for A and low-rank models L producefavorable scaling regimes in the problem parameters with highprobability. VI. E XPERIMENTS

We study the performance of the algorithm via simulationson synthetically generated data as well as real data. In theseexperiments, we show the different regression settings, asdiscussed in Section III-B, under which the SILVar model canbe applied.

A. Synthetic Data

The synthetic data corresponds to a multi-task regressionsetting. The data was generated by ﬁrst creating random sparsematrix A f = ( A A h ) where B ∈ R p × H and H = (cid:98) hp (cid:99) thenumber of hidden variables, and h is the proportion of hiddenvariables. The elements of A f were ﬁrst generated as i.i.d.Normal variables, and then a sparse mask was applied to A choose (cid:98) m log ( p ) (cid:99) non-zeros, and B was scaled by a factorof √ H . Next, the data X f = ( X (cid:62) Z (cid:62) ) (cid:62) were generatedwith each vector drawn i.i.d. from a multivariate Gaussianwith mean and covariance matrix Σ f ∈ R ( p + H ) × ( p + H ) .This was achieved by creating Σ / f by thresholding a matrixwith diagonal entries of and off-diagonal entries drawn froman i.i.d. uniform distribution in the interval [ − . , . to begreater than 0.35 in magnitude. Then Σ / f was pre-multipliedwith a matrix of i.i.d. Normal variables. Finally, we generated Y = g ( A f X f ) + W for 2 different link functions g ( x ) =log(1 + e c x ) and g ( x ) = e − c x − , and W is added i.i.d.Gaussian noise. Then, the task is to learn ( (cid:98) g, (cid:98) A , (cid:98) L ) the SILVarmodel (16) from ( Y , X ) without access to Z . As comparison,we use an Oracle GLM model, in which the true g is givenbut the parameters ( (cid:98) A o , (cid:98) L o ) still need to be learned. We notethat while there is a slight mismatch between the true andassumed noise distributions, the task is nonetheless difﬁcult,yet our estimation using the SILVar model can still exhibitgood performance with respect to the Oracle. EEE TRANSACTIONS ON SIGNAL PROCESSING 8

The experiments were carried out across a range of problemsizes. The dimension of y i was ﬁxed at m = 25 . Thedimension of the observed x i was set to p ∈ { , } ,and the proportion of the dimension of hidden z i was setto h ∈ { . , . } . The number of data samples was variedin k ∈ { , , , , } . Validation was performedusing a set of samples generated independently but usingthe same A f and g , and using a grid of ( λ S , λ L ) ∈ (cid:8) i/ (cid:12)(cid:12) i ∈ {− , − , . . . , , } (cid:9) . The whole process of datageneration and model learning is repeated 20 times for eachexperimental condition. The average (cid:96) errors between thetrue A and the best estimated (cid:98) A (determined via validation)are shown in Figure 1. The ﬁrst row is for the function g and the second row is for the function g for the variousexperimental conditions. The empirical results show that theSILVar model and learning algorithm lose out slightly onperformance for estimating the link function g but overallstill match the Oracle fairly well in most cases. We also seethat SILVar performs better than just the sparse SIM [12],but that the purely sparse Oracle GLM can still provide animprovement over SILVar for not knowing the link function.We also see several other intuitive behaviors for both theSILVar and Oracle models: as we increase the amount of data n , the performance increases. Note that due to the scalingof weights B by the factor of / √ H to keep the averagepower (cid:107) B (cid:107) F roughly constant, increasing the proportion ofhidden variables (from h = 0 . to h = 0 . ) did not directlylead to signiﬁcant performance decreases. We also note that g seems to be somehow more difﬁcult to estimate than g , andadditionally that the performance of the SILVar model w.r.t.the Oracle model is worse with g than with g . This couldhave something to do with the 2 saturations in g as opposedto the 1 saturation in g , corresponding in an intuitive senseto a more non-linear setting that contains more informationneeding to be captured while estimating (cid:98) g . B. Temperature Data

In this setting, we wish to learn the graph capturing relationsbetween the weather patterns at different cities. The datais a real world multivariate time series consisting of dailytemperature measurements (in ◦ F) for consecutive daysduring the year taken at each of different cities acrossthe continental USA.Previously, the analysis on this dataset has been performedby ﬁrst ﬁtting with a th order polynomial and then estimatinga sparse graph from an autoregressive model using a knownlink function g ( x ) = x assuming Gaussian noise [8].Here, we ﬁt the time series using a nd order AR SILVarmodel (20) with regularizers for group sparsity h ( A ) = λ (cid:80) i,j (cid:13)(cid:13)(cid:13)(cid:16) a (1) ij . . . a ( M ) ij (cid:17)(cid:13)(cid:13)(cid:13) where a ( m ) ij is the ij entry of matrix A ( m ) , and nuclear norm h ( L ) = λ M (cid:80) i =1 (cid:13)(cid:13) L ( i ) (cid:13)(cid:13) ∗ . We alsoestimate the underlying trend using (24).In Figure 2, we plot the original time series, estimatedtrends, the estimated network, and residuals. The plots areall from time steps to , since those are the indices for the trends that we can reliably estimate using the simpleprocedure (24). In Figure 2a, we show the original time series,which clearly exhibit a seasonal trend winter–summer–winter.Figure 2b shows the estimated link function (cid:98) g , which turns outto be linear, though a priori we might not intuitively expect aprocess as complicated as weather to behave this way whensampled so sparsely. Figures 2c and 2d show the full prediction (cid:98) x k = M (cid:80) i =1 ( (cid:98) A ( i ) + (cid:98) L ( i ) ) x k − i and the residuals x k − (cid:98) x k ,respectively. Note the much lower vertical scale in Figure 2d.Figures 2e and 2f show the trend (cid:98) L (cid:48) learned using (24) andthe time series with the estimated trend removed, respectively.The trend estimation procedure captures the basic shape ofthe seasonal effects on the temperature. Several of the fasterﬂuctuations in the beginning of the year are captured aswell, suggesting that they were caused by some larger scalephenomena. Indeed there were several notable storm systemsthat affected the entire USA in the beginning of the year inshort succession [33], [34].Figure 3 compares two networks (cid:98) A (cid:48) estimated using SILVarand using just sparse SIM without accounting for the low-ranktrends, both with the same sparsity level of non-zerosfor display purposes, and where (cid:98) a (cid:48) ij = (cid:13)(cid:13)(cid:13)(cid:16)(cid:98) a (1) ij . . . (cid:98) a ( M ) ij (cid:17)(cid:13)(cid:13)(cid:13) .Figure 3a shows the network (cid:98) A (cid:48) that is estimated using SILVar.The connections imply predictive dependencies between thetemperatures in cities connected by the graph. It is intuitivelypleasing that the patterns discovered match well previouslyestablished results based on ﬁrst de-trending the data andthen separately estimating a network [8]. That is, we seethe effect of the Rocky Mountain chain around − ◦ to − ◦ longitude and the overall west-to-east direction of theweather patterns, matching the prevailing winds. In contrastto that of SILVar, the graph estimated by the sparse SIMshown in Figure 3b on the other hand has many additionalconnections with no basis in actual weather patterns. Twoparticularly unsatisfying cities are: sunny Los Angeles, Cali-fornia at ( − , , with its multiple connections to snowynorthern cities including Fargo, North Dakota at ( − , ;and Caribou, Maine at ( − , , with its multiple connectionsgoing far westward against prevailing winds including toHelena, Montana at ( − , . These do not show in thegraph estimated by SILVar and shown in Figure 3a. C. Bike Trafﬁc Data

The bike trafﬁc data was obtained from HealthyRide Pitts-burgh [35]. The dataset contains the timestamps and stationlocations of departure and arrival (among other information)for each of 127,559 trips taken between 50 stations within thecity from May 31, 2015 to September 30, 2016, a total of 489days.We consider the task of using the total number of ridesdeparting from and arriving in each location at 6:00AM-11:00AM to predict the number of rides departing from eachlocation during the peak period of 11:00AM-2:00PM for eachday. This corresponds to Y ∈ N × and X ∈ N × ,where N is the set of non-negative integers, and A , L ∈ R × . We estimate the SILVar model (16) and compare EEE TRANSACTIONS ON SIGNAL PROCESSING 9

50 100 150 200

Number of samples (n) k b A − A k / ( m p ) ℓ error in A for p=25 g ( x ) = log(1 + e cx ) h=0.2; SILVarh=0.2; oracleh=0.2; Sp. SIMh=0.2; Sp. oracleh=0.1; SILVarh=0.1; oracleh=0.1; Sp. SIMh=0.1; Sp. oracle (a)

50 100 150 200

Number of samples (n) k b A − A k / ( m p ) ℓ error in A for p=50 g ( x ) = log(1 + e cx ) h=0.2; SILVarh=0.2; oracleh=0.2; Sp. SIMh=0.2; Sp. oracleh=0.1; SILVarh=0.1; oracleh=0.1; Sp. SIMh=0.1; Sp. oracle (b)

50 100 150 200

Number of samples (n) k b A − A k / ( m p ) ℓ error in A for p=25 g ( x ) = 2 / (1 + e − cx ) − h=0.2; SILVarh=0.2; oracleh=0.2; Sp. SIMh=0.2; Sp. oracleh=0.1; SILVarh=0.1; oracleh=0.1; Sp. SIMh=0.1; Sp. oracle (c)

50 100 150 200

Number of samples (n) k b A − A k / ( m p ) ℓ error in A for p=50 g ( x ) = 2 / (1 + e − cx ) − h=0.2; SILVarh=0.2; oracleh=0.2; Sp. SIMh=0.2; Sp. oracleh=0.1; SILVarh=0.1; oracleh=0.1; Sp. SIMh=0.1; Sp. oracle (d) Fig. 1: Different errors for estimating A in Toy Data (a) Temperature time series (b) Learned (cid:98) g (c) One step prediction -30-20-100102030 (d) Prediction error

50 100 150 200 250 300 350050100 (e) Estimated trend

50 100 150 200 250 300 350-50050 (f) De-trended time series

Fig. 2: Time series, link function, trends, and prediction errorscomputed using the learned SILVar modelits performance against a sparse plus low-rank GLM modelwith an underlying Poisson distribution and ﬁxed link function g GLM ( x ) = log(1 + e x ) . We use n ∈ { , , , } training samples and compute errors on validation and testsets of size each, and learn the model on a grid of ( λ S , λ L ) ∈ (cid:8) i/ (cid:12)(cid:12) i ∈ {− , − , . . . , , } (cid:9) . We repeatthis 10 times for each setting, using an independent set oftraining samples each time. We compute testing errors in thesecases for the optimal ( λ S , λ L ) with lowest validation errors forboth SILVar and GLM models.We also demonstrate that the low-rank component of theestimated SILVar model captures something intrinsic to thedata. Naturally, we expect people’s behavior and thus trafﬁcto be different on business days and on non-business days.A standard pre-processing step would be to segment the dataalong this line and learn two different models. However, as weuse the full dataset to learn one single model, we hypothesizethat the learned low-rank (cid:98) L captures some aspects of thisunderlying behavior.Figure 4a shows the test Root Mean Squared Errors (RM-SEs) for both SILVar and GLM models for varying trainingsample sizes, averaged across the 10 trials. We see that theSILVar model outperforms the GLM model by learning thelink function in addition to the sparse and low-rank regressionmatrices. Figure 4b shows an example of the link functionlearned by the SILVar model with n = 360 training samples.Note that the learned SILVar link function performs non-negative clipping of the output, which is consistent with thecount-valued nature of the data. EEE TRANSACTIONS ON SIGNAL PROCESSING 10 (a) Weather graph learned using SILVar(b) Weather graph learned using Sp. SIM (without low-rank)

Fig. 3: Learned weather stations graphs

50 100 150 200 250 300 350 400

Number of Training Samples (n) R M SE Prediction Errors

GLMSILVar (a) -10 0 10 20 30 40 50 θ b g ( θ ) Estimated link function (b)

Fig. 4: (a) Root mean squared errors (RMSEs) from SILVarand Oracle models; (b) Link function learned using SILVarmodelWe also test the hypothesis that the learned (cid:98) L containsinformation about whether a day is business or non-businessthrough its relatively low-dimensional effects on the data.We perform the singular value decomposition (SVD) on theoptimally learned (cid:98) L = (cid:98) U (cid:98) Σ (cid:98) V (cid:62) for n = 360 and project thedata onto the r top singular components (cid:101) X r = (cid:98) Σ r (cid:98) V (cid:62) r X . Wethen use (cid:101) X r to train a linear support vector machine (SVM)to classify each day as either a business day or a non-businessday, and compare the performance of this lower dimensionalfeature to that of using the full vector X to train a linear SVM.If our hypothesis is true then the performance of the classiﬁertrained on (cid:101) X r should be competitive with that of the classiﬁertrained on X . We use 50 training samples of (cid:101) X r and of X andtest on the remainder of the data. We repeat this 50 times bydrawing a new batch of samples each time. We then varythe proportion of business to non-business days in the trainingsample to trace out a receiver operating characteristic (ROC).In Figure 5, we see the results of training linear SVM on (cid:101) X r False Alarm Rate D e t ec t i on R a t e ROC for classifying business day rank 1rank 2rank 3rank 4rank 5rank 6full

Fig. 5: Receiver operating characteristics (ROCs) for classify-ing each day as a business day or non-business day, using thelow-rank embedding provided by (cid:98) L learned from the SILVarmodel and using the full dataFig. 6: Intensities of the self-loop at each stationfor r ∈ { , ..., } and on the full data for classifying businessand non-business days. We see that using only the ﬁrst twosingular vectors, the performance is fairly poor. However,by simply taking 3 or 4 singular vectors, the classiﬁcationperformance almost matches that of the full data. Surprisingly,using the top 5 or 6 singular vectors achieves performancegreater than that of the full data. This suggests that theprojection may even play the role of a de-noising ﬁlter in somesense. Since we understand that behavior should correlate wellwith the day being business or non-business, this competitiveperformance of the classiﬁcation using the lower dimensionalfeatures strongly suggests that the low-rank (cid:98) L indeed capturesthe effects of latent behavioral factors on the data.Finally, in Figure 6, we plot the ( i, i ) entries of the optimal (cid:98) A at n = 360 . This corresponds to locations for whichincoming bike rides at 6:00AM-11:00AM are good predictorsof outgoing bike rides at 11:00AM-2:00PM, beyond the effectof latent factors such as day of the week. We may intuitivelyexpect this to correlate with locations that have restaurantsopen for lunch service, so that people would be likely to ridein for lunch or ride out after lunch. This is conﬁrmed byobserving that these stations are in Downtown (-80,40.44), EEE TRANSACTIONS ON SIGNAL PROCESSING 11 the Strip District (-79.975, 40.45), Lawrenceville (-79.96,40.47), and Oakland (-79.96, 40.44), known locations of manyrestaurants in Pittsburgh. It is especially interesting to notethat Oakland, sandwiched between the University of Pittsburghand Carnegie Mellon University, is included. Even though thetarget demographic is largely within walking distance, thereis a high density of restaurants open for lunch, which mayexplain its non-zero coefﬁcient. The remainder of the locationswith non-zero coefﬁcients a ii are also near high densitiesof lunch spots, while the other locations with coefﬁcients a ii of zero are largely either near more residential areas ornear restaurants more known for dinner or nightlife ratherthan lunch, such as Shadyside ( x ≥ − . ) and Southside( y ≤ . )). VII. C ONCLUSION

Data exhibit complex dependencies, and it is often achallenge to deal with non-linearities and unmodeled effectswhen attempting to uncover meaningful relationships amongvarious interacting entities that generate the data. We intro-duce the SILVar model for performing semi-parametric sparseregression and estimating sparse graphs from data under thepresence of non-linearities and latent factors or trends. TheSILVar model estimates a non-linear link function g as wellas structured regression matrices A and L in a sparse and low-rank fashion. We justify the form of the model and relate it toexisting methods for general PCA, multi-task regression, andvector autoregression. We provide computationally tractablealgorithms for learning the SILVar model and demonstrate itsperformance against existing regression models and methodson both simulated and real data sets, namely 2011 US weathersensor network data and 2015-2016 Pittsburgh bike trafﬁc data.We see from the simulated data that the SILVar model matchesthe performance of an Oracle GLM that knows the true linkfunction and only needs to estimate A and L ; we show empiri-cally on the temperature data that the learned L can capture theeffects of underlying trends in time series while A representsa graph consistent with US weather patterns; and we see that,in the bike data, SILVar outperforms a GLM with a ﬁxed linkfunction, the learned L encodes latent behavioral aspects ofthe data, and A discovers notable locations consistent with therestaurant landscape of Pittsburgh.A PPENDIX AP ROOF OF T HEOREM

Theorem 1.

Assume that (cid:98) g (cid:48) (0) (cid:54) = 0 and that | (cid:98) g (cid:48)(cid:48) | ≤ J and | g (cid:48)(cid:48) | ≤ J for some J < ∞ . Furthermore, assume inmodels (14) and (15) that max j (cid:0) (cid:107) (cid:98) a j (cid:107) , (cid:107) a j (cid:107) + (cid:107) b j (cid:107) (cid:1) ≤ k, max (cid:0) E [ (cid:107) x (cid:107) (cid:107) x (cid:107) ∞ ] , E [ (cid:107) x (cid:107) (cid:107) x (cid:107) ∞ (cid:107) z (cid:107) ] , E [ (cid:107) x (cid:107) (cid:107) z (cid:107) ] (cid:1) ≤ s Nr , where subscripts in the parameters on the RHS indicate thatthe bounds may grow with the values in the subscript. Then,the parameters (cid:98) A and A from models (14) and (15) are relatedas (cid:98) A = q ( A + L ) + E , where q = g (cid:48) (0) (cid:98) g (cid:48) (0) , µ x = E [ x i ] , L = (cid:16) B E [ zx (cid:62) ] + ( g ( ) − (cid:98) g ( )) µ (cid:62) x (cid:17) (cid:0) E [ xx (cid:62) ] (cid:1) † ⇒ rank ( L ) ≤ r + 1 , and E = (cid:98) A − q ( A + L ) is bounded as M N (cid:107) E (cid:107) F ≤ Jσ (cid:96) √ N (cid:98) g (cid:48) (0) M s Nr k , where σ (cid:96) = (cid:13)(cid:13)(cid:13)(cid:0) E (cid:2) xx (cid:62) (cid:3)(cid:1) † (cid:13)(cid:13)(cid:13) , the largest singular value of thepseudo-inverse of the covariance.Proof. We begin with the Taylor series expansion, using theLagrange form of the remainder, E (cid:104)(cid:16)(cid:98) g ( )+ (cid:98) g (cid:48) (0) (cid:98) Ax − g ( ) − g (cid:48) (0)( Ax + Bz ) (cid:17) x (cid:62) (cid:105) = E (cid:104)(cid:16)(cid:98) g (cid:48)(cid:48) ( ξ ) (cid:12) ( (cid:98) Ax i ) . ∧ − g (cid:48)(cid:48) ( η ) (cid:12) ( Ax i + Bz ) . ∧ (cid:17) x (cid:62) (cid:105) , (33) where | ξ j | ≤ | (cid:98) a j x | , | η j | ≤ | a j x + b j z | , B = (cid:0) b . . . b m (cid:1) (cid:62) similarly to before, x (cid:12) y denotes theelement-wise (Hadamard) product, and x . ∧ x (cid:12) x denoteselement-wise squaring. First let us consider the quantity ( a j x + b j z ) = ( a j x ) +2 a j xb j z + ( b j z ) ≤ ( (cid:107) a j (cid:107) (cid:107) x (cid:107) ∞ + (cid:107) b j (cid:107) (cid:107) z (cid:107) ) Then, (cid:107) (cid:0) ( Ax + Bz ) . ∧ (cid:1) x (cid:62) (cid:107) F ≤ (cid:107) ( Ax + Bz ) . ∧ (cid:107) (cid:107) x (cid:107) ≤ (cid:107) x (cid:107) (cid:118)(cid:117)(cid:117)(cid:116) N (cid:88) j =1 ( a j x + b j z ) a ) ≤ (cid:107) x (cid:107) N (cid:88) j =1 ( a j x + b j z ) ≤ (cid:107) x (cid:107) N (cid:88) j =1 ( (cid:107) a j (cid:107) (cid:107) x (cid:107) ∞ + (cid:107) b j (cid:107) (cid:107) z (cid:107) ) ⇒ E [ (cid:107) (cid:0) ( Ax + Bz ) . ∧ (cid:1) x (cid:62) (cid:107) F ] ≤ s Nr N (cid:88) j =1 ( (cid:107) a j (cid:107) + (cid:107) b j (cid:107) ) ≤ s Nr Nk where ( a ) follows from the fact that (cid:107) x (cid:107) ≤ (cid:107) x (cid:107) ∀ x ∈ R N .Similarly, we have ⇒ E [ (cid:107) (( (cid:98) Ax ) . ∧ x (cid:62) (cid:107) F ] ≤ (cid:107) x (cid:107) N (cid:88) j =1 ( (cid:107) (cid:98) a j (cid:107) (cid:107) x (cid:107) ∞ ) ≤ s Nr N k To ﬁnish off, substituting the deﬁnition of E and q into (33)we have (cid:107) E (cid:107) F ≤ J E (cid:104)(cid:13)(cid:13)(cid:13)(cid:16) ( (cid:98) Ax i ) . ∧ Ax i + Bz i ) . ∧ (cid:17) x (cid:62) i (cid:13)(cid:13)(cid:13) F (cid:105)(cid:13)(cid:13)(cid:13)(cid:13)(cid:16)(cid:98) g (cid:48) (0) E [ x i x (cid:62) i ] (cid:17) † (cid:13)(cid:13)(cid:13)(cid:13) F ≤ Jσ (cid:96) (cid:98) g (cid:48) (0) s Nr k N √ N. EEE TRANSACTIONS ON SIGNAL PROCESSING 12 A PPENDIX BP ROOF OF T HEOREM Φ = (cid:98) A − (cid:101) A and Ψ = (cid:98) L − (cid:101) L . A. Propositions

Proposition 4.1.

Under Assumption 2, the solution ( (cid:98) A , (cid:98) L ) tothe optimization problem (21) satisﬁes γ (cid:107) Φ S c (cid:107) + (cid:107) Ψ R c (cid:107) ∗ ≤ γ (cid:107) Φ S (cid:107) + (cid:107) Ψ R (cid:107) ∗ ) γ (cid:107) Φ (cid:107) + (cid:107) Ψ (cid:107) ∗ ≤ √ r L ( τ (cid:107) Φ (cid:107) F + (cid:107) Ψ (cid:107) F ) . Proof.

We start with the convexity of the marginalized objec-tive functional at ( (cid:101) A , (cid:101) L ) , m ( (cid:98) A + (cid:98) L ) ≥ m ( (cid:101) A + (cid:101) L )+ tr (cid:32)(cid:20) K (cid:98)(cid:101) ΓX (cid:62) (cid:21) (cid:62) ( (cid:98) A + (cid:98) L − ( A + L )) (cid:33) Then consider the optimality of the full objective functionalat ( (cid:98) g, (cid:98) A , (cid:98) L ) , m ( (cid:98) A + (cid:98) L ) + λ ( γ (cid:107) (cid:98) A (cid:107) + (cid:107) (cid:98) L (cid:107) ∗ )= (cid:98) F ( Y , X , (cid:98) g, (cid:98) A + (cid:98) L ) + ( γ (cid:107) (cid:98) A (cid:107) + (cid:107) (cid:98) L (cid:107) ∗ ) ≤ (cid:98) F ( Y , X , (cid:98)(cid:101) g, (cid:101) A + (cid:101) L ) + ( γ (cid:107) (cid:101) A (cid:107) + (cid:107) (cid:101) L (cid:107) ∗ )= m ( (cid:101) A + (cid:101) L ) + λ ( γ (cid:107) (cid:101) A (cid:107) + (cid:107) (cid:101) L (cid:107) ∗ ) ⇒ m ( (cid:98) A + (cid:98) L ) − m ( (cid:101) A + (cid:101) L ) ≤ λ ( γ (cid:107) (cid:101) A (cid:107) + (cid:107) (cid:101) L (cid:107) ∗ ) − λ ( γ (cid:107) (cid:98) A (cid:107) + (cid:107) (cid:98) L (cid:107) ∗ ) ⇒ tr (cid:32)(cid:20) K (cid:98)(cid:101) ΓX (cid:62) (cid:21) (cid:62) ( (cid:98) A + (cid:98) L − ( (cid:101) A + (cid:101) L )) (cid:33) ≤ λ ( γ ( (cid:107) (cid:101) A (cid:107) −(cid:107) (cid:98) A (cid:107) )+ (cid:107) (cid:101) L (cid:107) ∗ −(cid:107) (cid:98) L (cid:107) ∗ ) where the last inequality utilizes convexity of the marginalizedobjective. Then using Assumption 2, − λγ (cid:107) Φ (cid:107) − λ (cid:107) Ψ (cid:107) ∗ ≤ λγ ( (cid:107) (cid:101) A (cid:107) −(cid:107) (cid:98) A (cid:107) )+ λ ( (cid:107) (cid:101) L (cid:107) ∗ −(cid:107) (cid:98) L (cid:107) ∗ ) ⇒ ≤ γ (cid:107) Φ (cid:107) + γ ( (cid:107) (cid:101) A (cid:107) −(cid:107) (cid:98) A (cid:107) )+ 12 (cid:107) Ψ (cid:107) ∗ +( (cid:107) L (cid:107) ∗ −(cid:107) (cid:98) L (cid:107) ∗ ) ⇒ ≤ γ (cid:107) Φ S (cid:107) + (cid:107) Φ S c (cid:107) )+ γ ( (cid:107) Φ S (cid:107) −(cid:107) Φ S c (cid:107) )+ 12 ( (cid:107) Ψ R (cid:107) ∗ + (cid:107) Ψ R c (cid:107) ∗ )+( (cid:107) Ψ R (cid:107) ∗ −(cid:107) Ψ R c (cid:107) ∗ ) ⇒ ≤ − ( γ (cid:107) Φ S c (cid:107) + (cid:107) Ψ R c (cid:107) ∗ )+3( γ (cid:107) Φ S (cid:107) + (cid:107) Ψ R (cid:107) ∗ ) where the penultimate inequality arises from decomposabilityof the norm. Speciﬁcally, (cid:12)(cid:12)(cid:12) [ (cid:101) A ] s (cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12) [ (cid:98) A ] s (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) [ (cid:101) A ] s − [ (cid:98) A ] s (cid:12)(cid:12)(cid:12) = | [ Φ ] s | , s ∈ S (cid:12)(cid:12)(cid:12) [ (cid:101) A ] s (cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12) [ (cid:98) A ] s (cid:12)(cid:12)(cid:12) = − (cid:12)(cid:12)(cid:12) [ (cid:98) A ] s (cid:12)(cid:12)(cid:12) = − | [ Φ ] s | , s / ∈ S Thus, we have the ﬁrst inequality γ (cid:107) Φ S c (cid:107) + (cid:107) Ψ R c (cid:107) ∗ ≤ γ (cid:107) Φ S (cid:107) + (cid:107) Ψ R (cid:107) ∗ ) Finally, for the second inequality, γ (cid:107) Φ (cid:107) + (cid:107) Ψ (cid:107) ∗ ≤ γ (cid:107) Φ S (cid:107) + (cid:107) Ψ R (cid:107) ∗ ) ≤ γ √ s A (cid:107) Φ S (cid:107) F + √ r L (cid:107) Ψ R (cid:107) F ) ≤ √ r L ( τ (cid:107) Φ S (cid:107) F + (cid:107) Ψ R (cid:107) F ) ≤ √ r L ( τ (cid:107) Φ (cid:107) F + (cid:107) Ψ (cid:107) F ) Proposition 4.2.

Under Assumptions 2 and 3, the solution of ( (cid:98) A , (cid:98) L ) to the optimization problem (21) satisﬁes | tr ( Φ (cid:62) Ψ ) | ≤ µ ( γ (cid:107) Φ (cid:107) + (cid:107) Ψ (cid:107) ∗ ) . Proof.

This follows directly from Proposition 4.1, Assump-tion 3, and applying Cauchy-Schwarz twice | tr ( Φ (cid:62) Ψ ) | ≤ (cid:107) Ψ (cid:107) ∞ (cid:107) Φ (cid:107) + (cid:107) Ψ (cid:107) ∗ (cid:107) Φ (cid:107) ≤ µγ ( γ (cid:107) Φ (cid:107) + (cid:107) Ψ (cid:107) ∗ ) (cid:107) Φ (cid:107) + µ ( γ (cid:107) Φ (cid:107) + (cid:107) Ψ (cid:107) ∗ ) (cid:107) Ψ (cid:107) ∗ . B. Theorem

Now we restate the theorem again for convenience.

Theorem 3.

Under Assumptions 1-3, the solution to theoptimization problem (21) satisﬁes (cid:107) (cid:98) A − (cid:101) A (cid:107) F ≤ λγ √ s A α (cid:32) (cid:114) τ (cid:33) ≤ λγ √ s A α (cid:107) (cid:98) L − (cid:101) L (cid:107) F ≤ λ √ r L α (cid:32) (cid:114) τ τ (cid:33) ≤ λ √ r L α . Proof.

Starting with Proposition 4.1, since our solution is inthis restricted set, we can use the stronger convexity conditionimplied by Assumption 1, m ( (cid:98) A + (cid:98) L ) ≥ m ( (cid:101) A + (cid:101) L )+ tr (cid:32)(cid:20) K (cid:98)(cid:101) ΓX (cid:62) (cid:21) (cid:62) (cid:16) (cid:98) A + (cid:98) L − ( A + L ) (cid:17)(cid:33) + α (cid:107) Φ + Ψ (cid:107) F Revisiting the objective functional at optimality and skip-ping repetitive algebra (see proof for Proposition 4.1), α (cid:107) Φ + Ψ (cid:107) F ≤ λ γ (cid:107) Φ S (cid:107) + (cid:107) Ψ R (cid:107) ∗ ) − ( γ (cid:107) Φ S c (cid:107) + (cid:107) Ψ R c (cid:107) ∗ )) ≤ λ γ (cid:107) Φ (cid:107) + (cid:107) Ψ (cid:107) ∗ ) ≤ λ √ r L ( τ (cid:107) Φ (cid:107) F + (cid:107) Ψ (cid:107) F ) Now from Proposition 4.2, we have (cid:107) Φ + Ψ (cid:107) F ≥ (cid:107) Φ (cid:107) F + (cid:107) Ψ (cid:107) F − | tr ( Φ (cid:62) Ψ ) |≥ (cid:107) Φ (cid:107) F + (cid:107) Ψ (cid:107) F − µ ( γ (cid:107) Φ (cid:107) + (cid:107) Ψ (cid:107) ∗ ) Combining the previous two inequalities, we have α ( (cid:107) Φ (cid:107) F + (cid:107) Ψ (cid:107) F ) − αµ ( γ (cid:107) Φ (cid:107) + (cid:107) Ψ (cid:107) ∗ ) ≤ λ √ r L ( τ (cid:107) Φ (cid:107) F + (cid:107) Ψ (cid:107) F ) ⇒ α ( (cid:107) Φ (cid:107) F + (cid:107) Ψ (cid:107) F ) − αµr L ( τ (cid:107) Φ (cid:107) F + (cid:107) Ψ (cid:107) F ) ≤ λ √ r L ( τ (cid:107) Φ (cid:107) F + (cid:107) Ψ (cid:107) F ) ⇒ ζ (cid:62) Q ζ ≤ EEE TRANSACTIONS ON SIGNAL PROCESSING 13 where ζ = ( (cid:107) Φ (cid:107) F (cid:107) Ψ (cid:107) F (cid:62) Q =  α (cid:16) τ τ ) (cid:17) − α (cid:16) τ τ ) (cid:17) − λ (cid:48) τ − α (cid:16) τ τ ) (cid:17) α (cid:16) τ τ ) (cid:17) − λ (cid:48) − λ (cid:48) τ − λ (cid:48)  , which is a conic in standard form, λ (cid:48) = λ √ r L , and the entriesof Q follow from taking µ given in Assumption 3.Checking its discriminant, (cid:0) τ (cid:1) (1 + 2 τ ) − τ > so we have that the conic equation describes an ellipse, andthere are bounds on the values of (cid:107) Φ (cid:107) F and (cid:107) Ψ (cid:107) F .For these individual bounds, we consider the points at whichthe gradients of ζ (cid:62) Q ζ vanish w.r.t. each of (cid:107) Φ (cid:107) F and (cid:107) Ψ (cid:107) F .For (cid:107) Φ (cid:107) F , ∂ (cid:107) Φ (cid:107) F ζ (cid:62) Q ζ = 0 ⇒ α (cid:18) τ τ ) (cid:19) (cid:107) Φ (cid:107) ∗ F = 3 λ (cid:48) τ + α (cid:18) τ τ ) (cid:19) (cid:107) Ψ (cid:107) ∗ F . Plugging this into the equation deﬁned by ζ (cid:62) Q ζ = 0 yields (cid:107) Φ (cid:107) ∗ F = 3 λ (cid:48) τα (cid:32) ± (cid:114)

22 + τ (cid:33) . Since we are seeking an upper bound for (cid:107) Φ (cid:107) F , it can be seenthat we take the positive root, (cid:107) Φ (cid:107) F ≤ λ (cid:48) τα (cid:32) (cid:114)

22 + τ (cid:33) ≤ λ (cid:48) τα Similarly, for (cid:107) Ψ (cid:107) F , ∂ (cid:107) Ψ (cid:107) F ζ (cid:62) Q ζ = 0 ⇒ α (cid:18) τ τ ) (cid:19) (cid:107) Ψ (cid:107) ∗ F = 3 λ (cid:48) τ + α (cid:18) τ τ ) (cid:19) (cid:107) Φ (cid:107) ∗ F . Finally, plugging this into ζ (cid:62) Q ζ = 0 and solving for theupper bound yields (cid:107) Ψ (cid:107) F ≤ λ (cid:48) α (cid:32) (cid:114) τ τ (cid:33) ≤ λ (cid:48) α R EFERENCES[1] R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,”

Journal of the Royal Statistical Society. Series B (Methodological)

Biostatistics

IEEE Transactions on Signal Processing ,vol. 59, no. 6, pp. 2628–2641, Jun. 2011.[4] S. Basu and G. Michailidis, “Regularized estimation in sparsehigh-dimensional time series models,”

The Annals of Statistics ,vol. 43, no. 4, pp. 1535–1567, Aug. 2015. [Online]. Available:http://projecteuclid.org/euclid.aos/1434546214 [5] C. W. J. Granger, “Investigating causal relations by econometric modelsand cross-spectral methods,”

Econometrica

IEEE Transactions on Signal Processing , vol. 61, no. 7, pp.1644–1656, Apr. 2013.[7] ——, “Discrete signal processing on graphs: Frequency analysis,”

IEEETransactions on Signal Processing , vol. 62, no. 12, pp. 3042–3054, Jun.2014.[8] J. Mei and J. M. F. Moura, “Signal processing on graphs: causal mod-eling of unstructured data,”

IEEE Transactions on Signal Processing ,vol. 65, no. 8, pp. 2077–2092, Apr. 2017.[9] V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky, “Latent variablegraphical model selection via convex optimization,”

The Annals ofStatistics , vol. 40, no. 4, pp. 1935–1967, Aug. 2012. [Online].Available: http://projecteuclid.org/euclid.aos/1351602527[10] A. Jalali and S. Sanghavi, “Learning the dependence graph of timeseries with latent factors,” arXiv:1106.1887 [cs] , Jun. 2011, arXiv:1106.1887. [Online]. Available: http://arxiv.org/abs/1106.1887[11] M. T. Bahadori, Y. Liu, and E. P. Xing, “Fast structure learning ingeneralized stochastic processes with latent factors,” in

Proceedingsof the 19th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , ser. KDD ’13. New York, NY, USA:ACM, 2013, pp. 284–292. [Online]. Available: http://doi.acm.org/10.1145/2487575.2487578[12] R. Ganti, N. Rao, R. M. Willett, and R. Nowak, “Learning single indexmodels in high dimensions,” arXiv:1506.08910 [cs, stat] , Jun. 2015,arXiv: 1506.08910. [Online]. Available: http://arxiv.org/abs/1506.08910[13] L. M. Bregman, “The relaxation method of ﬁnding the common pointof convex sets and its application to the solution of problems in convexprogramming,”

USSR Computational Mathematics and MathematicalPhysics

Journal of Machine Learning Research

Proceedings ofthe Eighteenth International Conference on Artiﬁcial Intelligence andStatistics

Journal of Econometrics

J. ACM , vol. 58, no. 3, pp. 11:1–11:37, Jun.2011. [Online]. Available: http://doi.acm.org/10.1145/1970392.1970395[18] R. S. Ganti, L. Balzano, and R. Willett, “Matrix completion undermonotonic Single Index Models,” in

Advances in Neural InformationProcessing Systems 28 , C. Cortes, N. D. Lawrence, D. D. Lee,M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc.,2015, pp. 1873–1881. [Online]. Available: http://papers.nips.cc/paper/5916-matrix-completion-under-monotonic-single-index-models.pdf[19] M. Udell, C. Horn, R. Zadeh, and S. Boyd, “Generalized LowRank Models,”

Foundations and Trends in Machine Learning ,vol. 9, no. 1, pp. 1–118, Jun. 2016. [Online]. Available: http://ftp.nowpublishers.com/article/Details/MAL-055[20] C. W. J. Granger, “Causality, cointegration, and control,”

J. ofEcon. Dynamics and Control , Apr. 2015, pp.5495–5499.[22] R. L. Dykstra, “An isotonic regression algorithm,”

Journal ofStatistical Planning and Inference

Computational Optimization and Applications , vol. 63,no. 1, pp. 29–44, Jan. 2016. [Online]. Available: https://link.springer.com/article/10.1007/s10589-015-9768-y[24] L. Yeganova and W. J. Wilbur, “Isotonic Regression under LipschitzConstraint,”

Journal of Optimization Theory and Applications , EEE TRANSACTIONS ON SIGNAL PROCESSING 14 vol. 141, no. 2, pp. 429–443, May 2009. [Online]. Available:https://link.springer.com/article/10.1007/s10957-008-9477-0[25] S. Kakade, A. T. Kalai, V. Kanade, and O. Shamir, “EfﬁcientLearning of Generalized Linear and Single Index Models with IsotonicRegression,” arXiv:1104.2018 [cs, stat] , Apr. 2011, arXiv: 1104.2018.[Online]. Available: http://arxiv.org/abs/1104.2018[26] B. O’Donoghue and E. Cand`es, “Adaptive restart for acceleratedgradient schemes,”

Foundations of Computational Mathematics , vol. 15,no. 3, pp. 715–732, Jul. 2013. [Online]. Available: http://link.springer.com/article/10.1007/s10208-013-9150-3[27] M. Schmidt, E. V. D. Berg, M. P. Friedl, and K. Murphy, “Optimizingcostly functions with simple constraints: A limited-memory projectedquasi-Newton algorithm,” in

Proc. of Conf. on Artiﬁcial Intelligenceand Statistics , 2009, pp. 456–463.[28] J. Nocedal and S. J. Wright,

Numerical Optimization . Springer, 1999.[29] Y. Lucet, “Faster than the Fast Legendre Transform, the Linear-timeLegendre Transform,”

Numerical Algorithms , vol. 16, no. 2, pp.171–185, Mar. 1997. [Online]. Available: https://link.springer.com/article/10.1023/A:1019191114493[30] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu, “Auniﬁed framework for high-dimensional analysis of M-estimators withdecomposable regularizers,”

Statistical Science , vol. 27, no. 4, pp.538–557, Nov. 2012. [Online]. Available: http://projecteuclid.org/euclid.ss/1356098555[31] V. Chandrasekaran, S. Sanghavi, P. Parrilo, and A. Willsky, “Rank-sparsity incoherence for matrix decomposition,”

SIAM Journal onOptimization , vol. 21, no. 2, pp. 572–596, Apr. 2011. [Online].Available: http://epubs.siam.org/doi/abs/10.1137/090761793[32] E. Cand`es and Y. Plan, “Matrix completion with noise,”

Proceedings ofthe IEEE , vol. 98, no. 6, pp. 925–936, Jun. 2010.[33] C. Hedge,

Summary of February 1-3, 2011 Central and EasternU.S. Winter Storm

Mid-Atlantic and Northeast U.S. Winter StormJanuary 26-27, 2011

Jonathan Mei received his B.S. and M.Eng. de-grees in electrical engineering both from the Mas-sachusetts Institute of Technology in Cambridge,MA in 2013. He is currently pursuing his Ph.D.degree in electrical and computer engineering atCarnegie Mellon University in Pittsburgh, PA. Hisresearch interests include computational photog-raphy, image processing, reinforcement learning,graph signal processing, non-stationary time seriesanalysis, and high-dimensional optimization.

Jos´e M. F. Moura ˇˇ