[PDF] Estimation of Monotone Multi-Index Models

Abstract

Full PDF

aa r X i v : . [ m a t h . S T ] J un Estimation of Monotone Multi-Index Models

David Gamarnik ∗ Sloan School of ManagementMassachusetts Institute of TechnologyCambridge, MA 02139 [email protected]

Julia Gaudio † Department of MathematicsMassachusetts Institute of TechnologyCambridge, MA 02139 [email protected]

Abstract

In a multi-index model with k index vectors, the input variables are transformedby taking inner products with the index vectors. A transfer function f : R k → R is applied to these inner products to generate the output. Thus, multi-index modelsare a generalization of linear models. In this paper, we consider monotone multi-index models. Namely, the transfer function is assumed to be coordinate-wisemonotone. The monotone multi-index model therefore generalizes both linearregression and isotonic regression, which is the estimation of a coordinate-wisemonotone function. We consider the case of nonnegative index vectors. We pro-vide an algorithm based on integer programming for the estimation of monotonemulti-index models, and provide guarantees on the L loss of the estimated func-tion relative to the ground truth. Let β be a d × k matrix, and let f be a function from R k to R . The model E [ Y | X ] = f ( β T X ) isknown as a multi-index model . The columns of β are referred to as the index vectors and f is called a transfer function . Therefore, multi-index models generalize linear models. Typically, f is assumedto lie in a particular function class. In this paper, we assume that f is coordinate-wise monotone andsatisﬁes a mild Lipschitz condition. We treat the case where the components of X are i.i.d., and theentries of β are nonnegative.Supposing that the index vectors were known, the estimation of the function f would reduce to iso-tonic regression , which is the problem of estimating an unknown coordinate-wise monotone func-tion. Monotone multi-index models (MMI) thereby additionally generalize isotonic regression. Thesetting where the transfer function is known is called the Generalized Index Model , a widely appli-cable statistical model [2]. We are therefore considering a much more challenging model.We consider a high-dimensional setting, namely the dimension d is possibly much larger than thesample size, n . We solve a sparse high-dimensional model; we assume that the number of indexvectors (columns of β ) is constant, and that β has a constant number s of nonzero rows. Finally,we assume that β is a nonnegative matrix, which is natural in many applications. For example,consider the following ﬁnance application. Suppose there are k future time periods, and d products.Let β ( i, j ) be the predicted monetary value of owning one unit of product i at a time j . Given avector x of product quantities, the value β T x is a k -dimensional vector indicating the value of theproducts over the k time periods. Let f be a time-discounted measure of the overall value of thegoods. Taking the example further, row sparsity would model an inventory restriction where one canstore only s distinct types of goods. ∗ † ork on index models has largely focused on the single-index model ( k = 1 ) (e.g. [7], [9], [10],[13],[11]). In particular, [13] provides the ﬁrst provably efﬁcient estimation algorithm for estimation ofsingle-index models under monotonicity and Lipschitz assumptions. This work is further improvedby [11]. To our knowledge, our paper is the ﬁrst work done on estimation of multi-index modelsunder the monotone Lipschitz model. Let x be a vector in R d . The vector p -norm k x k p is deﬁned as k x k pp , P di =1 x pi . The ∞ -normis deﬁned as k x k ∞ , max i ∈ [ d ] | x i | . Let M d,k ( r ) be the set of d × k matrices with each columnhaving -norm at most r . Similarly, let M d,k ( r ) be the set of d × k matrices with each columnhaving -norm equal to r . Let O d,k be the set of d × k orthonormal matrices. Let P k denote the setof k × k rotation matrices, i.e. P k = { P ∈ R k × k : P T = P − , det( P ) = 1 } .For a d × k matrix M , let M + ij , max { M ij , } , for i ∈ [ d ] and j ∈ [ k ] , i.e. M + is the matrixformed from M by replacing each negative entry by . Similarly, for a vector x ∈ R k , let x + denotethe positive part of x , i.e. x + i = max { x i , } . For a matrix M ∈ R d × k and I ⊆ [ d ] , let M ( I ) ij = (cid:26) M ij i ∈ I i I. In other words, the matrix M ( I ) is formed from M by zeroing all rows with index not belonging to I . Similarly, for a vector x ∈ R d , let x ( I ) i = i ∈ I x i . Note that ( M ( I )) T x = M T ( x ( I )) .Let k · k p also denote the component-wise p -norm of a matrix, i.e. for a d × k matrix M , we have k M k pp = P di =1 P kj =1 M pij . The Frobenius norm k M k F is equal to k M k under this notation.We say a function f : R k → R is l -Lipschitz if for every x, y ∈ R k it holds that | f ( x ) − f ( y ) | ≤ l k x − y k . We say that f : R k → R is coordinate-wise monotone if for all x, y ∈ R k with x i ≤ y i for eachcoordinate i , it holds that f ( x ) ≤ f ( y ) . In other words, f is coordinate-wise monotone if it ismonotone with respect to the Euclidean partial order. Fix b > . Let C ( b ) be the set of coordinate-wise monotone functions f : R k → [0 , b ] , and let L ( b ) be the set of -Lipschitz coordinate-wisemonotone functions f : R k → [0 , b ] . Note that L ( b ) ⊂ C ( b ) .For a matrix β and function f , write ( f ◦ β )( x ) , f ( β T x ) . Finally, let L ( x, y, f ) , ( f ( x ) − y ) bethe loss function we consider. We now describe the model. Some of the assumptions are carried from [14]. All parameters exceptthe dimension d are considered constant.Let β ⋆ ∈ M d,k ( r ) be a d × k matrix of rank k , where each column has -norm equal to r . Assumethat β ⋆ is s ⋆ -row sparse, meaning that β ⋆ has at most s ⋆ nonzero rows. Let I ⋆ ⊂ [ d ] be the set ofnon-zero rows of β ⋆ , so that β ⋆ ( I ⋆ ) = β ⋆ . Since β ⋆ has full column rank, we can write β ⋆ = Q ⋆ R ⋆ as its QR decomposition, where Q ⋆ ∈ O d,k and R ⋆ ∈ M k,k ( r ) is invertible. We further assumethat β ⋆ ≥ entrywise.Let p be a twice-differentiable density supported on X ⊂ R . Let p ⋆ = max x ∈ R p ( x ) . Furtherassume that X ⊆ [ − C, C ] . Let X ∈ R d be a random variable with density f X ( x ) = Q di =1 p ( x i ) .We additionally assume that E [ X ] = 0 . This is without loss of generality, as we could treat therandom variable X − E [ X ] with support contained in the set [ − C, C ] d .Let s ( x ) = p ′ ( x ) p ( x ) for x ∈ X . Let f ⋆ ∈ L ( b ) be a twice-differentiable function. We assume that E (cid:2) ∇ f ⋆ ( β ⋆T X ) (cid:3) ≻ , a restriction that ensures that estimation of β ⋆ is information-theoreticallyfeasible [14]. Let ρ be the smallest eigenvalue of E (cid:2) ∇ f ⋆ ( β ⋆T X ) (cid:3) . Note that since β ⋆ has aconstant number of columns and a constant number of nonzero rows, the value ρ is itself a constant.2he model is Y = ( f ⋆ ◦ β ⋆ )( X ) + Z. (1)Here Z is independent from X and satisﬁes E [ Z ] = 0 . We assume that | Z | ≤ η almost surely sothat Y ∈ Y , [ − η, b + η ] almost surely. Let F ( x, y ) denote the joint density of X and Y . We makea mild distribution assumption, which is that there exists θ such that E [ s ( X ) ] ≤ θ and E [ Y ] ≤ θ .Note that since Y ∈ [ − η, b + η ] , then Y ≤ ( b + η ) almost surely.Given i.i.d. samples ( X , Y ) , . . . , ( X n , Y n ) drawn from the model (1), our goal is to estimate thefunction f ⋆ ◦ β ⋆ , which is an element of the function class F d,k , (cid:8) f ◦ β ( I ) : f ∈ L ( b ) , I ⊂ [ d ] , | I | = s ⋆ , β ( I ) ∈ M d,k ( r ) (cid:9) . Proposition 1.

Let F d,k , { f ◦ β ( I ) : f ∈ L ( b ) , I ⊂ [ d ] , | I | = s ⋆ , β ( I ) ∈ M d,k ( r ) } . It holdsthat F d,k = F d,k . By Proposition 1, the model captures β ⋆ ∈ M k,k ( r ) ⊂ M k,k ( r ) as well. Observe that for l > , f ◦ β ≡ f ( lx ) ◦ βl . By this identity, the assumption that f ⋆ is -Lipschitz and β ⋆ has columns of norm r is withoutloss of generality; the assumption is equivalent to the assumption that f ⋆ is l -Lipschitz and β ⋆ hascolumns of norm r / l . We combine the results of two recent papers in order to design an algorithm for estimation in MMImodels. [14] provide an algorithm for estimation of Q ⋆ up to rotation given samples from the model(1). In other words, they ﬁnd Q such that QP ≈ Q ⋆ for some rotation matrix P . In Section 2, wesummarize the approach of [14] to estimate the matrix Q ⋆ , up to rotation.Informally, observe that if QP ≈ Q ⋆ and R ≈ P R ⋆ , then QR ≈ Q ⋆ R ⋆ . Given a Q that approxi-mates Q ⋆ up to rotation, it remains to ﬁnd R ∈ M k × k ( r ) , an index set I of cardinality s ⋆ , as wellas a function f . Thus, the estimation of Q ⋆ up to rotation reduces the high-dimensional estimationproblem to a lower-dimensional problem.Our approach is to form a collection of candidate k × k matrices (Section 3). For each candidate ma-trix, we ﬁnd the optimal index set and accompanying coordinate-wise monotone function (Section4). We call the problem of ﬁnding the optimal index set I and coordinate-wise monotone function f the Sparse Matrix Isotonic Regression Problem. We extend the recent work of [5], who consider arelated isotonic regression problem.In Section 5, we tie together the results of the previous three sections in order to provide an algorithmfor estimation in the high-dimensional monotone multi-index model. The algorithm ﬁnds a functionof the form f ◦ ( QR ) + ( I ) minimizing the sample loss over the candidate matrices R . Here I isan index set and f is a coordinate-wise monotone function obtained by solving the Sparse MatrixIsotonic Regression Problem. We give estimation guarantees for our algorithm in terms of L loss.Finally, Section 6 outlines some future directions. Let k f ⋆ ◦ Q ⋆ R ⋆ − g k denote the expected L loss of a function g with respect to the ground truth: k f ⋆ ◦ Q ⋆ R ⋆ − g k , Z x ∈X [( f ⋆ ◦ Q ⋆ R ⋆ )( x ) − g ( x )] f X ( x ) dx. Let z ( ǫ , ǫ , C ) , ηC √ k ( ǫ + ǫ r ) + C k ( ǫ + ǫ r ) . The main result of our paper is the fol-lowing.

Theorem 1.

Fix ǫ > . Let δ = δ ( ǫ ) be the solution to z ( δ, δ, C ) = ǫ / . Suppose d ≥ q ǫ . Given n independent samples ( X i , Y i ) ni =1 from the model (1) , there exists an algorithm that produces anestimator f n ◦ M + n ( I n ) such that P (cid:0) k f n ◦ M + n ( I n ) − f ⋆ ◦ Q ⋆ R ⋆ k ≥ ǫ (cid:1) ≤ ǫ. whenever n ≥ C log( d ) + C , for constants C and C depending on C , b , s ⋆ , p ⋆ , k , ρ , θ , and η . d is much larger than the number of samples n . The proofs aredeferred to the supplementary material, with the exception of the proof of our key result, Theorem3, that immediately implies Theorem 1. Q ⋆ We summarize the work of [14], who estimate Q ⋆ up to rotation. The approach of [14] uses thesecond-order Stein condition. For x ∈ R d , let T ( x ) be the d × d matrix deﬁned as follows. T ( x ) ij = (cid:26) s ( x i ) s ( x j ) i = js ( x i ) − s ′ ( i ) i = j. [14] show the identity E [ Y · T ( X )] = Q ⋆ D Q ⋆ , where D = E (cid:2) ∇ f ( β ⋆T X ) (cid:3) . Therefore, onecan estimate Q ⋆ from the leading eigenvectors of the sample average of the quantity Y · T ( X ) . [14]use a robust estimator for Y · T ( X ) . For τ > , deﬁne the truncated random variables ˜ Y i , sign ( Y i ) · min {| Y i | , τ } and ˜ T jk ( X i ) , sign ( T jk ( X i )) · min {| T jk ( X i ) | , τ } . The robust estimator is given by ˜Σ = ˜Σ( τ ) , n n X i =1 ˜ Y i · ˜ T ( X i ) . [14] propose the following approach to estimate Q ⋆ up to rotation. Algorithm 1

Estimation of Q ⋆ [14] Input:

Values ( X , Y ) , . . . , ( X n , Y n ) , τ > , and λ > Output: ˆ Q ∈ O d,k Compute the estimator ˜Σ( τ ) using the samples ( X i , Y i ) ni =1 . Solve the following optimization problem. max

T r ( W T ˜Σ( τ )) + λ k W k (2)s.t. (cid:22) W (cid:22) I d (3) T r ( W ) = k. (4) Let ˆ Q be the matrix whose columns are the k leading eigenvectors of ˆ W . Theorem 2 (Adapted from Theorem 3.3 from [14]) . Let τ = (cid:16) θn d (cid:17) and λ = 10 q θ log dn . Underthe assumptions of Section 1.2, With probability at least − d − , Algorithm 1 applied to samples ( X i , Y i ) ni =1 , τ , and λ produces an estimator ˆ Q satisfying inf P ∈P k k ˆ QP − Q ⋆ k F ≤ ρ √ s ⋆ λ. Assuming that d grows with n , Theorem 2 shows that with high probability as n → ∞ , the estimateof Q ⋆ is correct up to rotation, with error on the order of p log( d ) / n . Fix δ > . We construct a random set of matrices R that will serve to approximate the set of k × k matrices with column norm r , with respect to the Frobenius norm. Given ǫ, δ > , we choose thecardinality of the set of approximating matrices so that a ﬁxed matrix from the set M k,k ( r ) is ǫ -closeto some element of R with probability − δ . For this reason, we call R a near-net . To construct R , we ﬁrst construct a random set of vectors R by choosing N vectors from the uniform measureof all vectors of -norm r . In other words, each element from R is chosen from the uniform4easure on the k -dimensional sphere of radius r , denoted by S kr . We may sample uniformly using k independent random variables Z , . . . , Z k ∼ N (0 , . The random vector P ki =1 Z i ( Z , . . . , Z k ) is uniformly distributed on the surface of S kr . Finally, we then construct the set R as the set of allmatrices with columns belonging to R . Then |R| = N k .For a vector x ∈ R k and ǫ > , let B ( x, ǫ ) , { y : k x − y k ≤ ǫ } denote the ball of radius ǫ around x with respect to the -norm. Lemma 1.

Let ǫ > . Consider the near-net R ( N ) described above, and let M ∈ M k,k ( r ) be aﬁxed matrix. With probability at least − k (cid:18) − (cid:12)(cid:12)(cid:12)(cid:12) S kr ∩ B (cid:18) e , ǫ √ k (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12) S kr (cid:12)(cid:12) − (cid:19) N , (5) there exists R ∈ R ( N ) such that k M − R k F ≤ ǫ , where e = (1 , , . . . , ∈ R k and | A | denotesthe measure of a set A . Remark 1. As N → ∞ , the probability (5) goes to . Our random construction is simple to implement. While deterministic constructions are possible,they are much more complex (see [3]).

Recently, Gamarnik and Gaudio introduced the Sparse Isotonic Regression model [5]. We nowintroduce a new related model,

Sparse Matrix Isotonic Regression . We are given a d × k matrix M with nonnegative entries as well as samples ( X i , Y i ) ni =1 . For a given sparsity level s ∈ N andbound b > , our goal is to ﬁnd a set I ⊂ [ d ] with cardinality s , and a coordinate-wise monotonefunction f : R k → [0 , b ] minimizing P ni =1 ( Y i − ( f ◦ M ( I ))( X i )) . Our approach is to estimatethe function values at the points X , . . . , X n and interpolate. We emphasize that we do not requirethe function f to be -Lipschitz.The Integer Programming Sparse Matrix Isotonic Regression algorithm ﬁnds the optimal index setand function values on a given set of points, given a matrix M with nonnegative entries. Binaryvariables v l determine the index set I . The variables F i represent the estimated function values atdata points X i . Auxiliary variables z ij and q ijp are used to model the monotonicity constraints. Thefunction that is returned is an interpolation of the points ( M ( I ) T X i , F i ) ni =1 . Algorithm 2

Integer Programming Matrix Isotonic Regression

Input:

Values ( X , Y ) , . . . , ( X n , Y n ) , sparsity level s , M ≥ ∈ R d × k , C > , b > Output:

An index set I ⊂ [ d ] satisfying | I | = s ; a coordinate-wise monotone function f : R k → [0 , b ] Let B = 2 C P dl =1 P kp =1 M lp . Let µ = min { M lp > l ∈ [ d ] , p ∈ [ k ] } · min i,j ∈ [ n ] ,i = j | X il − X jl | . Solve the following optimization problem. min v,F,z n X i =1 ( Y i − F i ) (6)s.t. d X l =1 v l = s (7) F i − F j ≤ bz ij (8) k X p =1 q ijp ≥ z ij ∀ i, j ∈ [ n ] (9) X l =1 v l M lp ( X il − X jl ) − µ ≥ − (cid:16) B + µ (cid:17) (1 − q ijp ) ∀ i, j ∈ [ n ] , p ∈ [ k ] (10) v l ∈ { , } ∀ l ∈ [ d ] F i ∈ [0 , b ] ∀ i ∈ [ n ] z ij ∈ { , } ∀ i, j ∈ [ n ] q ijp ∈ { , } ∀ i, j ∈ [ n ] , p ∈ [ k ] Let I n = { l ∈ [ d ] : v l = 1 } . Let ˆ f n ( x ) = max { F i : M ( I n ) T X i (cid:22) x } and ˆ f n ( x ) = 0 if { M ( I n ) T X i (cid:22) x } ni =1 = ∅ Return ( I n , ˆ f n ) . Proposition 2.

Suppose X i ∈ [ − C, C ] d for i ∈ [ n ] . On input ( X i , Y i ) ni =1 , s, M, C, b , Algorithm2 ﬁnds a function ˆ f n ∈ C ( b ) and index set I n that minimize the empirical loss P ni =1 L ( X i , Y i , f ◦ M ( I )) , over functions f ∈ C ( b ) and index sets I with cardinality s . The integer program in Algorithm 2 has a convex objective and linear constraints. While integer pro-gramming is NP-hard in general, modern solvers achieve excellent performance on such problems.We note that it is possible to ensure that the function ˆ f n be -Lipschitz in addition to coordinate-wise monotone, by modifying the optimization problem in Algorithm 2. However, the resultingoptimization problem is an integer program with nonlinear constraints, a less tractable formulation.For further details, please see Section 8 in the supplementary material. In this section, we provide estimation guarantees for the model Y = ( f ⋆ ◦ Q ⋆ R ⋆ )( X ) + Z. Let N , n be the sample size. We use n samples for estimation of Q ⋆ (up to rotation), obtaining amatrix Q n , and another n samples to obtain a matrix R n a function f n , and index set I n . The ﬁnalresult is an estimated function f n ◦ ( Q n R n ) + ( I n ) .We now outline the approach. First, by Theorem 2, the matrix Q n obtained from the semideﬁniteprogramming approach satisﬁes k Q n P n − Q ⋆ k F ≤ ρ √ s ⋆ λ with probability at least − d − . Weuse this matrix Q n to estimate R n , f n and I n , assuming that k Q n P n − Q ⋆ k F ≤ ρ √ s ⋆ λ for someunknown rotation matrix P n . The joint estimation of ( R n , f n , I n ) is intractable; instead, we createa net of candidate matrices from the set M k,k ( r ) . For each net element R , we apply Algorithm 2to ﬁnd the optimal pair ( f R , I R ) minimizing the empirical loss P ni = n +1 L ( X i , Y i , f ◦ ( Q n R ) + ( I )) .Finally, we output the best combination over the net elements.Recall that f ⋆ ◦ β ⋆ ∈ F d,k ( r ) . While Q n ∈ O d,k and R n ∈ M k,k ( r ) , the matrix ( Q n R n ) + ( I n ) may not be an element of M d,k ( r ) . Further, the estimated function f n may not be -Lipschitz.Nevertheless, we are able to give an L loss guarantee, as we will see in the proof of Theorem 3. Algorithm 3

MMI Regression

Let X ∈ R d be a random variable with independent entries of density p ≤ p ⋆ andsupport contained within the set [ − C, C ] d . Assume that f ⋆ : R k → [0 , b ] for b > . Fix n . Let = (cid:16) θn d (cid:17) and λ = 10 q θ log dn . Let ǫ > and let δ > be such that ǫ > z (cid:16) δ, ρ √ s ⋆ λ, C (cid:17) . Let ( f n , Q n , R n , I n ) be the result of applying Algorithm 3 on inputs N , ( X , Y ) , . . . , ( X n , Y n ) , C , b , τ , and λ . Let M n = Q n R n . Then P (cid:0) k f n ◦ M + n ( I n ) − f ⋆ ◦ Q ⋆ R ⋆ k ≥ ǫ (cid:1) ≤ k (cid:18) − (cid:12)(cid:12)(cid:12)(cid:12) S k − r ∩ B (cid:18) e , δ √ k (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) S k − r (cid:12)(cid:12)(cid:12) − (cid:19) N + 1 d + 4 ds ⋆ ! N k exp (cid:20)(cid:18) bα + 2 bα (4 p ⋆ C ) s ⋆ (cid:19) n s⋆ − s⋆ − ǫ n b (cid:21) , where ǫ = ǫ − z (cid:16) δ, ρ √ s ⋆ λ, C (cid:17) and α = ǫ ( b + η ) − . The following results are used in the proof of Theorem 3. Lemma 2 establishes a sensitivity result.The Lipschitz assumption on f ⋆ is a key element in proving Lemma 2. Lemma 2.

Let X ∈ R d be a random variable with independent entries of density p ≤ p ⋆ andsupport contained within the set [ − C, C ] d . Suppose that R ∈ M k,k ( r ) satisﬁes k P R ⋆ − R k F ≤ ǫ .Suppose also that T ∈ O d,k satisﬁes k T P − Q ⋆ k F ≤ ǫ for some rotation matrix P ∈ P k,k . Then Z L ( x, y, f ⋆ ◦ ( T R ) + ( I ⋆ )) dF ( x, y ) − Z L ( x, y, f ⋆ ◦ Q ⋆ R ⋆ ) dF ( x, y ) ≤ z ( ǫ , ǫ , C ) . Lemma 3 relates the -norm difference of two functions to a difference of integrals. Lemma 3.

Let g be any function from R k to R . Then k g − f ⋆ ◦ Q ⋆ R ⋆ k = Z L ( x, y, g ) dF ( x, y ) − Z L ( x, y, f ⋆ ◦ Q ⋆ R ⋆ ) dF ( x, y ) . Fix b > . For T ∈ O d,k , R ∈ M k,k ( r ) , and I ⊂ [ d ] with | I | = s ⋆ , let G ( T, R, I ) = { f ◦ ( T R ) + ( I ) : f ∈ C ( b ) } and G ( T, R ) , ∪ R ∈R ∪ I ⊂ [ d ]: | I | = s ⋆ G ( T, R, I ) . We see that Algorithm 3 optimizes the empirical loss over functions in G ( Q n , R ) . We follow a VCentropy approach to give an L loss bound for the function f n ◦ M + n ( I n ) estimated by Algorithm 3. Deﬁnition 1.

Let F be a class of functions from R d to R . Given ( x , y ) , . . . , ( x n , y n ) ∈ R d × R , let L F (( x , y ) , . . . , ( x n , y n )) , { ( L ( x , y , f ) , . . . , L ( x n , y n , f )) : f ∈ F} . In other words, L F is the set of loss vectors formed by ranging over functions f in the class F . Let N F (( x , y ) , . . . , ( x n , y n ) , ǫ ) denote the size of the smallest ǫ -net for L F (( x , y ) , . . . , ( x n , y n )) ,with respect to the ∞ -norm. In other words, for every u ∈ L F (( x , y ) , . . . , ( x n , y n )) , there exists v ∈ N F (( x , y ) , . . . , ( x n , y n )) such that k u − v k ∞ ≤ ǫ . Finally, let N F ( ǫ, n ) , E X,Y [ N F (( X , Y ) , . . . , ( X n , Y n ) , ǫ )] be the expected size of the net, where the expectation is over independent samples drawn from thedistribution F ( x, y ) deﬁned above. Lemmas 4 and 5 together provide a probabilistic bound on the difference between expected loss andempirical loss for functions in the class G ( T, R ) . The nonnegative matrix assumption is crucial forthe proof of Lemma 5. Lemma 4.

Let T ∈ O d,k and let δ > . Let R ⊂ M k,k ( r ) . For ǫ > , P sup h ∈G ( T, R ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Z L ( x, y, h ) dF ( x, y ) − n n X i =1 L ( X i , Y i , h ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ǫ ! ≤ N G ( T, R ) (cid:16) ǫ , n (cid:17) exp (cid:18) − ǫ n b (cid:19) . Lemma 5.

Under the assumptions of Lemma 4, it holds that N G ( T, R ) ( ǫ, n ) ≤ (cid:18) ds ⋆ (cid:19) N k exp (cid:20)(cid:18) bα + 2 bα (4 p ⋆ C ) s ⋆ (cid:19) n s⋆ − s⋆ (cid:21) , where α = ǫ ( b + η ) − . Proof of Theorem 3.

With probability at least − d − , the matrix Q n satisﬁes k Q n P n − Q ⋆ k F ≤ ρ √ s ⋆ λ for some rotation matrix P n (Theorem 2). For the remainder, we condition on this property of Q n ,since we use this matrix on an independent batch of samples ( X i , Y i ) ni = n +1 . By Lemma 3, k f n ◦ M + n ( I n ) − f ⋆ ◦ Q ⋆ R ⋆ k = Z L ( x, y, f n ◦ M + n ( I n )) dF ( x, y ) − Z L ( x, y, f ⋆ ◦ Q ⋆ R ⋆ ) dF ( x, y ) . Recall that I ⋆ is the set of non-zero rows of β ⋆ . Let E be the event that the near-net R contains anelement R ∈ R such that k P R ⋆ − R k F ≤ δ . Conditioned on E , let R be the (random) matrix that is δ -close to P R ⋆ . By Proposition 2, the function f n ◦ M + n ( I n ) is optimal over the samples. Therefore, P ni = n +1 L ( X i , Y i , f n ◦ M + n ( I n )) ≤ P ni = n +1 L ( X i , Y i , f ⋆ ◦ ( Q n R ) + ( I ⋆ )) . We have k f n ◦ M + n ( I n ) − f ⋆ ◦ Q ⋆ R ⋆ k ≤ Z L ( x, y, f n ◦ M + n ( I n )) dF ( x, y ) − n n X i = n +1 L ( X i , Y i , f n ◦ M + n ( I n ))+ 1 n n X i = n +1 L ( X i , Y i , f ⋆ ◦ ( Q n R ) + ( I ⋆ )) − Z L ( x, y, f ⋆ ◦ ( Q n R ) + ( I ⋆ )) dF ( x, y )+ Z L ( f ⋆ ◦ ( Q n R ) + ( I ⋆ )) dF ( x, y ) − Z L ( x, y, f ⋆ ◦ Q ⋆ R ⋆ ) dF ( x, y ) . By Lemma 2 applied to T = Q n , ǫ = δ , and ǫ = ρ √ s ⋆ λ , Z L ( f ⋆ ◦ ( Q n R ) + ( I ⋆ )) dF ( x, y ) − Z L ( x, y, f ⋆ ◦ Q ⋆ R ⋆ ) dF ( x, y ) ≤ z (cid:18) δ, ρ √ s ⋆ λ, C (cid:19) . Therefore, P (cid:16) k f n ◦ M + n ( I n ) − f ⋆ ◦ Q ⋆ R ⋆ k ≥ ǫ (cid:12)(cid:12)(cid:12) E (cid:17) ≤ P Z L ( x, y, f n ◦ M + n ( I n )) dF ( x, y ) − n n X i =1 L ( X i , Y i , f n ◦ M + n ( I n ))+ 1 n n X i =1 L ( X i , Y i , f ⋆ ◦ ( QR ) + ( I ⋆ )) − Z L ( x, y, f ⋆ ◦ ( QR ) + ( I ⋆ )) dF ( x, y ) ≥ ǫ (cid:12)(cid:12)(cid:12) E ! . Observe that the functions f n ◦ M + n ( I n ) and f ⋆ ◦ ( Q n R ) + ( I ⋆ ) are elements of G ( Q n , R ) . Sincethe event E is independent from the samples ( X i , Y i ) , we apply Lemmas 4 and 5. P (cid:16) k f n ◦ M + n ( I n ) − f ⋆ ◦ Q ⋆ R ⋆ k ≥ ǫ (cid:12)(cid:12)(cid:12) E (cid:17) ≤ P sup h ∈G ( Q n , R ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Z L ( x, y, h ) dF ( x, y ) − n n X i =1 L ( X i , Y i , h ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ǫ (cid:12)(cid:12)(cid:12) E ! ≤ (cid:18) ds ⋆ (cid:19) |R| exp (cid:20)(cid:18) bα + 2 bα (4 p ⋆ C ) s ⋆ (cid:19) n s⋆ − s⋆ − ǫ n b (cid:21) , where α = ǫ ( b + η ) − . The result follows by Lemma 1: P ( E c ) ≤ k (cid:18) − (cid:12)(cid:12)(cid:12)(cid:12) S k − r ∩ B (cid:18) e , ǫ √ k (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12) S k − r (cid:12)(cid:12) − (cid:19) N . Conclusion

In this paper, we have provided an estimation algorithm for multi-index models with a coordinate-wise monotone transfer function. Our algorithm enables future work on wide-ranging applicationsnaturally modeled as a monotone multi-index model. Promising future directions include ﬁndingan efﬁcient method for the sparse matrix isotonic regression problem, dropping the nonnegativityassumption of the index vectors, as well as studying multi-index models with other classes of transferfunctions. Studying the non-sparse setting would be interesting as well, and would present severalchallenging technical hurdles.

References [1] Gleb Beliakov. Monotonicity preserving approximation of multivariate scattered data.

BITNumerical Mathematics , 45(4):653–677, 2005.[2] Annette J. Dobson and Adrian G. Barnett.

An introduction to generalized linear models . CRCPress, 2018.[3] Ilya Dumer. Covering spheres with spheres.

Discrete & Computational Geometry , 38(4):665–679, 2007.[4] David Gamarnik. Efﬁcient learning of monotone concepts via quadratic optimization. In

COLT , 1999.[5] David Gamarnik and Julia Gaudio. Sparse high-dimensional isotonic regression. , 2019.[6] Dimitris Bertsimas David Gamarnik and John N. Tsitsiklis. Estimation of time-varying pa-rameters in statistical models: an optimization approach.

Machine Learning , 35(3):225–245,1999.[7] Wolfgang Härdle, Peter Hall, and Hidehiko Ichimura. Optimal smoothing in single-indexmodels.

The Annals of Statistics , pages 157–178, 1993.[8] David Haussler. Overview of the Probably Approximately Correct (PAC) learning framework. https://hausslergenomics.ucsc.edu/wp-content/uploads/2017/08/smo_0.pdf ,1995.[9] Joel L. Horowitz and Wolfgang Härdle. Direct semiparametric estimation of single-index mod-els with discrete covariates.

Journal of the American Statistical Association , 91(436):1632–1640, 1996.[10] Hidehiko Ichimura. Semiparametric least squares (SLS) and weighted SLS estimation ofsingle-index models. Technical report, Center for Economic Research, Department of Eco-nomics, University of Minnesota, 1991.[11] Sham M. Kakade, Adam Tauman Kalai, Varun Kanade, and Ohad Shamir. Efﬁcient learningof generalized linear and single index models with isotonic regression. In

Advances in NeuralInformation Processing Systems 24 (NeurIPS 2011) , 2011.[12] Guy Moshkovitz and Asaf Shapira. Ramsey theory, integer partitions and a new proof of theErd˝os-Szekeres theorem.

Advances in Mathematics , 262:1107–1129, 2014.[13] Adam Tauman Kalai and Ravi Sastry. The Isotron algorithm: High-dimensional isotonic re-gression.

Conference on Learning Theory , 2009.[14] Zhuoran Yang, Krishna Balasubramanian, Zhaoran Wang, and Han Liu. Learning non-Gaussian multi-index model via a second-order Stein’s method.

Advances in Neural Infor-mation Processing Systems , 30:6097–6106, 2017.9

Deferred proofs

Proof of Proposition 1.

Clearly F d,k ⊆ F d,k . It remains to show that F d,k ⊆ F d,k . Let g ∈ F d,k ,where g = f ◦ β ( I ) . We will show that g ∈ F d,k .Suppose the i th column of β ( I ) has norm t < r . Let β be equal to β ( I ) , with the i th column scaledby a factor of rt , so that the i th column of β has norm r . Note that β = β ( I ) . Next, deﬁne thefunction f by f ( x ) = f ( x , . . . , tr x i , . . . , x k ) . Observe that g = f ◦ β . We verify the monotonicityproperty for f . Let x (cid:22) y . Then also ( x , . . . , tr x i , . . . , x k ) (cid:22) ( y , . . . , tr y i , . . . , y k ) , and we have f ( x ) = f (cid:18) x , . . . , tr x i , . . . , x k (cid:19) ≤ f (cid:18) y , . . . , tr y i , . . . , y k (cid:19) = f ( y ) . It remains to show that f is -Lipschitz. Let x, y ∈ R k . By the Lipschitz condition applied to f , − ≤ f ( x ) − f ( y ) (cid:13)(cid:13)(cid:0) x , . . . , tr x i , . . . , x k (cid:1) − (cid:0) y , . . . , tr y i , . . . , y k (cid:1)(cid:13)(cid:13) ≤ ⇒ − ≤ f ( x ) − f ( y ) k x − y k ≤ . This shows that f is -Lipschitz. Repeating this argument for each column of β ( I ) , we concludethat g ∈ F d,k .To prove Lemma 1, we use the following helper lemma. Lemma 6.

Consider the near-net R with N described above, and let v be a ﬁxed vector of norm r . With probability − (cid:16) − (cid:12)(cid:12) S kr ∩ B ( e , δ ) (cid:12)(cid:12) (cid:12)(cid:12) S kr (cid:12)(cid:12) − (cid:17) N , there exists u ∈ R such that k v − u k ≤ δ .Proof. Let u be distributed uniformly at random on the surface of S k − r . Then P ( k v − u k ≤ δ ) = P ( u ∈ B ( v, δ ))= (cid:12)(cid:12) S k − r ∩ B ( v, δ ) (cid:12)(cid:12) (cid:12)(cid:12) S kr (cid:12)(cid:12) − = (cid:12)(cid:12) S k − r ∩ B ( e , δ ) (cid:12)(cid:12) (cid:12)(cid:12) S kr (cid:12)(cid:12) − , Therefore, P ( k v − u k > δ ) = 1 − (cid:12)(cid:12) S kr ∩ B ( e , δ ) (cid:12)(cid:12) (cid:12)(cid:12) S kr (cid:12)(cid:12) − . We conclude that the probability that there exists an element of R that is δ -close to v is equal to − (cid:16) − (cid:12)(cid:12) S kr ∩ B ( e , δ ) (cid:12)(cid:12) (cid:12)(cid:12) S kr (cid:12)(cid:12) − (cid:17) N . Proof of Lemma 1.

Let E be the event that there exists R ∈ R such that k M − R k F ≤ ǫ . Weneed to lower bound the probability of the event E . Let { M i } ki =1 denote the columns of M . For i ∈ [ k ] , let E i be the event that there exists u i ∈ R such that k M i − u i k ≤ ǫ √ k . We claim that P (cid:0) ∩ ki =1 E i (cid:1) ≤ P ( E ) . Indeed, suppose that the event E i occurs for each i . Let R ∈ R be the matrixwith columns { u i } ki =1 . Then k M − R k F = k X i =1 k M i − u i k ≤ k (cid:18) ǫ √ k (cid:19) = ǫ , and so k M − R k F ≤ ǫ . This shows that P (cid:0) ∩ ki =1 E i (cid:1) ≤ P ( E ) . We also have P ( E c ) ≤ P (cid:0) ∪ ki =1 E ci (cid:1) ≤ k X i =1 P ( E ci ) , so that P ( E ) ≥ − P ki =1 P ( E ci ) . i ∈ [ k ] , Lemma 6 shows that there exists u i ∈ R such that k M i − u i k ≤ ǫ √ k , withprobability − (cid:18) − (cid:12)(cid:12)(cid:12)(cid:12) S kr ∩ B (cid:18) e , ǫ √ k (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12) S kr (cid:12)(cid:12) − (cid:19) N . Therefore, P ( E ci ) ≤ (cid:18) − (cid:12)(cid:12)(cid:12)(cid:12) S kr ∩ B (cid:18) e , ǫ √ k (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12) S kr (cid:12)(cid:12) − (cid:19) N . Proof of Proposition 2.

We ﬁrst show that the constraints enforce the monotonicity requirement. Let I = { i : v i = 1 } . Consider two samples ( X i , Y i ) and ( X j , Y j ) . The monotonicity requirement is ( M ( I )) T X i ) (cid:22) ( M ( I )) T X j ) = ⇒ F i ≤ F j . The contrapositive of this statement is F i > F j = ⇒ ∃ p ∈ [ k ] : (( M ( I )) T X i ) p > (( M ( I )) T X j ) p . (11)The optimization encodes the contrapositive statement, as follows. There are two cases: either F i > F j or F i ≤ F j . We must ensure that if F i > F j holds, then the implication in (11) is satisﬁed.We must also verify that no additional constraints are introduced if F i ≤ F j .Suppose F i > F j . Then z ij = 1 by Constraint (8). By Constraint (9), at least one of the q ijp variables must be equal to . Then by Constraints (10), we have d X l =1 v l M lp ( X il − X jl ) > ⇐⇒ (( M ( I )) T X i ) p > (( M ( I )) T X j ) p , for at least one p ∈ [ k ] , due to the choice of µ . Next suppose F i ≤ F j . Then z ij is free to equalzero, and all the q ijp values may be set to zero as well. By the choice of B , Constraint (10) is thennon-binding.The objective minimizes the loss on the samples. Finally, we claim that the choice of ˆ f n is amonotone interpolation. First, ˆ f n ( M ( I n ) T X i ) = F i , so that ˆ f n interpolates. Next, observe that x (cid:22) y = ⇒ ˆ f n ( x ) ≤ ˆ f n ( y ) . Also, ˆ f n : R → [0 , b ] , by construction. Proof of Lemma 2.

Fix ( x, y ) . We have L ( x, y, f ⋆ ◦ ( T R ) + ( I ⋆ )) dF ( x, y ) − L ( x, y, f ⋆ ◦ Q ⋆ R ⋆ )= (cid:0) y − ( f ⋆ ◦ ( T R ) + ( I ⋆ ))( x ) (cid:1) − ( y − ( f ⋆ ◦ Q ⋆ R ⋆ )( x )) = (cid:0) ( f ⋆ ◦ ( T R ) + ( I ⋆ ))( x ) − ( f ⋆ ◦ Q ⋆ R ⋆ )( x ) (cid:1) (cid:0) ( f ⋆ ◦ ( T R ) + ( I ⋆ ))( x ) + ( f ⋆ ◦ Q ⋆ R ⋆ )( x ) − y (cid:1) ≤ (cid:12)(cid:12) ( f ⋆ ◦ Q ⋆ R ⋆ )( x ) − ( f ⋆ ◦ ( T R ) + ( I ⋆ ))( x ) (cid:12)(cid:12) · (cid:12)(cid:12) ( f ⋆ ◦ Q ⋆ R ⋆ )( x ) + ( f ⋆ ◦ ( T R ) + ( I ⋆ ))( x ) − y (cid:12)(cid:12) (12)Since f ⋆ ∈ L ( b ) , it holds that (cid:12)(cid:12) ( f ⋆ ◦ Q ⋆ R ⋆ )( x ) − ( f ⋆ ◦ T R + ( I ⋆ ))( x ) (cid:12)(cid:12) ≤ (cid:13)(cid:13) ( Q ⋆ R ⋆ ) T x − (( T R ) + ( I ⋆ )) T x (cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:0) Q ⋆ R ⋆ − ( T R ) + ( I ⋆ ) (cid:1) T x (cid:13)(cid:13)(cid:13) . For the second factor in the bound (12), we have (cid:12)(cid:12) ( f ⋆ ◦ Q ⋆ R ⋆ )( x ) + ( f ⋆ ◦ T R + ( I ⋆ ))( x ) − y (cid:12)(cid:12) ≤ | ( f ⋆ ◦ Q ⋆ R ⋆ )( x ) − y | + (cid:12)(cid:12) ( f ⋆ ◦ T R + ( I ⋆ ))( x ) − ( f ⋆ ◦ Q ⋆ R ⋆ )( x ) (cid:12)(cid:12) ≤ | ( f ⋆ ◦ Q ⋆ R ⋆ )( x ) − y | + (cid:13)(cid:13)(cid:13)(cid:0) Q ⋆ R ⋆ − ( T R ) + ( I ⋆ ) (cid:1) T x (cid:13)(cid:13)(cid:13) . Let A = Q ⋆ R ⋆ − ( T R ) + ( I ⋆ ) . Substituting into (12), we have Z L ( x, y, f ⋆ ◦ ( T R ) + ) dF ( x, y ) − Z L ( x, y, f ⋆ ◦ Q ⋆ R ⋆ ) dF ( x, y ) Z (cid:13)(cid:13) A T x (cid:13)(cid:13) (cid:0) | ( f ⋆ ◦ Q ⋆ R ⋆ )( x ) − y | + (cid:13)(cid:13) A T x (cid:13)(cid:13) (cid:1) dF ( x, y ) ≤ Z (cid:13)(cid:13) A T x (cid:13)(cid:13) (cid:0) η + (cid:13)(cid:13) A T x (cid:13)(cid:13) (cid:1) dF ( x, y )= Z (cid:13)(cid:13) A T x (cid:13)(cid:13) (cid:0) η + (cid:13)(cid:13) A T x (cid:13)(cid:13) (cid:1) dF X ( x )= 2 η E (cid:2)(cid:13)(cid:13) A T X (cid:13)(cid:13) (cid:3) + E h(cid:13)(cid:13) A T X (cid:13)(cid:13) i ≤ η r E h k A T x k i + E h(cid:13)(cid:13) A T x (cid:13)(cid:13) i , (13)where the last inequality follows from Jensen’s inequality. We now evaluate the expectation. E h(cid:13)(cid:13) A T X (cid:13)(cid:13) i = E  k X j =1 d X i =1 A ij X i !  = k X j =1 d X i =1 d X l =1 A ij A lj E [ X i X l ] . Recall that the coordinates of the random variable X are independent and have zero mean. There-fore, E h(cid:13)(cid:13) A T X (cid:13)(cid:13) i = k X j =1 d X i =1 A ij E (cid:2) X i (cid:3) ≤ C k X j =1 d X i =1 A ij = C k A k F . Using the fact that Q ⋆ R ⋆ ≥ and Q ⋆ R ⋆ = ( Q ⋆ R ⋆ )( I ⋆ ) , we have k A k F = k Q ⋆ R ⋆ − ( T R ) + ( I ⋆ ) k F ≤ k Q ⋆ R ⋆ − T R k F ≤ k Q ⋆ R ⋆ − T P R ⋆ k F + k T P R ⋆ − T R k F = k ( Q ⋆ − T P ) R ⋆ k F + k T ( P R ⋆ − R ) k F ≤ k Q ⋆ − T P k F k R ⋆ k F + k T k F k P R ⋆ − R k F ≤ ǫ k R ⋆ k F + ǫ k T k F = ǫ √ kr + ǫ √ k = √ k ( ǫ + ǫ r ) . Therefore, E h(cid:13)(cid:13) A T X (cid:13)(cid:13) i ≤ C k A k F ≤ C k ( ǫ + ǫ r ) . Substituting into (13), we conclude Z L ( x, y, f ⋆ ◦ ( T R ) + ) dF ( x, y ) − Z L ( x, y, f ⋆ ◦ Q ⋆ R ⋆ ) dF ( x, y ) ≤ ηC √ k ( ǫ + ǫ r ) + C k ( ǫ + ǫ r ) = z ( ǫ , ǫ , C ) . Proof of Lemma 3.

A similar result appears in [6]. We include the proof for completeness. Z L ( x, y, g ) dF ( x, y )= Z ( g ( x ) − y ) dF ( x, y )= Z ( g ( x ) − ( f ⋆ ◦ Q ⋆ R ⋆ )( x ) + ( f ⋆ ◦ Q ⋆ R ⋆ )( x ) − y ) dF ( x, y )= Z ( g ( x ) − ( f ⋆ ◦ Q ⋆ R ⋆ )( x )) + (( f ⋆ ◦ Q ⋆ R ⋆ )( x ) − y ) + 2 ( g ( x ) − ( f ⋆ ◦ Q ⋆ R ⋆ )( x )) (( f ⋆ ◦ Q ⋆ R ⋆ )( x ) − y ) dF ( x, y )= k g − f ⋆ ◦ Q ⋆ R ⋆ k + Z L ( x, y, f ⋆ ◦ Q ⋆ R ⋆ ) dF ( x, y ) + 2 E [( g ( X ) − ( f ⋆ ◦ Q ⋆ R ⋆ )( X )) (( f ⋆ ◦ Q ⋆ R ⋆ )( X ) − Y )] . Since E [ Y | X ] = ( f ⋆ ◦ Q ⋆ R ⋆ )( x ) , we have E [( g ( X ) − ( f ⋆ ◦ Q ⋆ R ⋆ )( X )) (( f ⋆ ◦ Q ⋆ R ⋆ )( X ) − Y )] = 0 . Rearranging completes the proof. 12 roof of Lemma 4.

Recalling that the range of any h ∈ G ( T, R ) is contained in [0 , b ] , the statementfollows from Corollary 1 (pp. 45) of [8].We now work towards a proof of Lemma 5. Recall α = α ( ǫ ) = ǫ ( b + η ) − and let S = { , α, α, . . . , (cid:0)(cid:6) bα (cid:7) − (cid:1) α } be a discretization of the range [0 , b ] . For f ∈ C ( b ) , let g f ( x ) = max { q ∈ S : q ≤ f ( x ) } . In other words, the function g f is formed by rounding each value down to the nearest incrementin S . Fix I ⊂ [ d ] with | I | = s ⋆ . Recall the deﬁnition G ( T, R, I ) = { f ◦ ( T R ) + ( I ) : f ∈C ( b ) } , and let H ( T, R, I ) , { g f ◦ ( T R ) + ( I ) : f ∈ C ( b ) } . Proposition 3 will show that the set L H ( Q,R,I ) (( x , y ) , . . . , ( x n , y n )) is an ǫ -net for the set L G ( Q,R,I ) (( x , y ) , . . . , ( x n , y n )) . Next,we relate the cardinality of L H ( Q,R,I ) (( x , y ) , . . . , ( x , y n )) to a labeling number . Deﬁnition 2 (Labeling Number [4]) . For a sequence of points x , . . . , x n ∈ R k and a positiveinteger m , the labeling number Λ( x , . . . , x n ; m ) is the number of functions φ : { x , . . . , x n } →{ , , . . . , m } such that φ ( x i ) ≤ φ ( x j ) whenever x i (cid:22) x j for i, j ∈ { , . . . , n } . Let M = ( QR ) + . Observe that the cardinality of L H ( Q,R,I ) (( x , y ) , . . . , ( x , y n )) is upper-boundedby the labeling number of the set { M ( I ) T x , . . . , M ( I ) T x n } with (cid:0)(cid:6) bα (cid:7) − (cid:1) labels. There-fore, the value of N G ( Q,R,I ) ( ǫ, n ) is upper-bounded by the expected labeling number of the set { M ( I ) T X , . . . M ( I ) T X n } with (cid:0)(cid:6) bα (cid:7) − (cid:1) labels.We therefore need to determine an upper bound on the expected labeling number of the set { M ( I ) T X , . . . M ( I ) T X n } with (cid:0)(cid:6) bα (cid:7) − (cid:1) labels. Let x ( I ) be the vector formed from the en-tries of x that are indexed by the set I . We will ﬁrst show that the labeling number of the set { M ( I ) T x , . . . M ( I ) T x n } is upper-bounded by the labeling number of the set { x ( I ) , . . . , x n ( I ) } with the same number of labels. Observe that the points { x i ( I ) } ni =1 have dimension s ⋆ . We willthen analyze the expected labeling number of a sequence of random variables { W , . . . , W n } thatare of dimension s ⋆ .The following proposition will be used to establish the net property. Proposition 3.

Let T ∈ O d,k , R ∈ M k,k ( r ) , and I ⊂ [ d ] . Fix ( f ◦ ( T R ) + ( I )) ∈ G ( T, R, I ) andthe accompanying ( g f ◦ ( T R ) + ) ∈ H ( T, R, I ) . Let x ∈ X and y ∈ Y . Then (cid:12)(cid:12) L ( x, y, f ◦ ( T R ) + ( I )) − L ( x, y, g f ◦ ( T R ) + ( I )) (cid:12)(cid:12) ≤ ǫ. Proof.

Let M = ( T R ) + . (cid:12)(cid:12) L ( x, y, f ◦ ( T R ) + ( I )) − L ( x, y, g f ◦ ( T R ) + ( I )) (cid:12)(cid:12) = | L ( x, y, f ◦ M ( I )) − L ( x, y, g f ◦ M ( I )) | = (cid:12)(cid:12)(cid:12) (( f ◦ M ( I ))( x ) − y ) − (( g f ◦ M ( I ))( x ) − y ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12) ( f ◦ M ( I )) ( x ) − ( g f ◦ M ( I )) ( x ) − y (( f ◦ M ( I ))( x ) − ( g f ◦ M ( I ))( x )) (cid:12)(cid:12) = | (( f ◦ M ( I ))( x ) − ( g f ◦ M ( I ))( x )) (( f ◦ M ( I ))( x ) + ( g f ◦ M ( I ))( x ) − y ) | = | ( f ◦ M ( I ))( x ) − ( g f ◦ M ( I ))( x ) | · | ( f ◦ M ( I ))( x ) + ( g f ◦ M ( I ))( x ) − y |≤ α · b + η )= ǫ. Let x , . . . , x n ∈ R d . The following result will allow us to relate the binary labeling number of theset { M ( I ) x , . . . M ( I ) x n } to the binary labeling number of the set { x ( I ) , . . . , x n ( I ) } . Proposition 4.

Let A be a d × k matrix with nonnegative entries. Let x , . . . , x n ∈ R d . Then for m ≥ , Λ (cid:0) A T x , . . . , A T x n ; m (cid:1) ≤ Λ ( x , . . . , x n ; m ) . Proof.

Suppose x i (cid:22) x j . Then also Ax i (cid:22) Ax j . Therefore any labeling that is feasible for thepoints { A T x , . . . , A T x n } is also feasible for the points { x , . . . , x n } .13e will now analyze the expected labeling number of a set of random variables { W , . . . , W n } . Theconcept of an integer partition is required for the labeling number analysis. Deﬁnition 3 (Integer Partition (as stated in [5])) . An integer partition of dimension ( k − withvalues in { , , . . . , t } is a collection of values A i ,i ,...,i k − ∈ { , , . . . , t } where i l ∈ { , . . . m } and A i ,i ,...,i d − ≤ A j ,j ,...,j k − whenever i l ≤ j l for all l ∈ { , . . . , k − } . The set of integerpartitions of dimension ( k − with values in { , , . . . , t } is denoted by P (cid:0) [ t ] k (cid:1) . For an illustration of a partition with k = 2 , see Figure 3 in [5]. The following result provides abound on the expected labeling number of a set of points with certain distribution assumptions. Itwill be used to bound the expected labeling number of the set { X , . . . , X n } . Lemma 7.

Let m ∈ N and B > . Let W ∈ R d be a random variable with support containedin the set [ − B, B ] d . Suppose that the density f W ( w ) is upper-bounded by D . Let W , . . . , W n beindependent samples with distribution f W . Then E [Λ( W , . . . , W n ; m )] ≤ exp h(cid:0) m −

1) + D m +2 d − B d (cid:1) n d − d i . Proof.

Note that since the labeling number is translation-invariant, the same result applies to a ran-dom variable W with support contained in [0 , B ] d . [5] considered the case B = and D = 1 . Wenow adapt their proof. By Proposition 5 in [5], Λ( w , . . . , w n ; m ) ≤ (Λ( w , . . . , w n ; 2)) m − for any w , . . . , w n ∈ R d . For clarity of notation, write Λ( w , . . . , w n ) in place of Λ( w , . . . , w n ; 2) .We therefore have E [Λ( W , . . . , W n ; m )] ≤ E h (Λ( W , . . . , W n )) m − i . Let t ∈ N . When B = and D = 1 , we have by Lemma 5 of [5] E h (Λ( W , . . . , W n )) m − i ≤ (cid:12)(cid:12) P ([ t ] d ) (cid:12)(cid:12) m − E h ( m − N i , where N ∼ Binom (cid:16) n, t d − ( t − d t d (cid:17) . The value t d − ( t − d t d is the probability that a uniform randomvariable in [0 , d falls in one of t d − ( t − d cubes out of t d cubes that partition [0 , d . To adaptthe proof, we instead partition [ − B, B ] d into t d cubes. Each cube therefore has volume (cid:0) Bt (cid:1) d , anddensity upper-bounded by D (cid:0) Bt (cid:1) d . We conclude that E h (Λ( W , . . . , W n )) m − i ≤ (cid:12)(cid:12) P ([ t ] d ) (cid:12)(cid:12) m − E h ( m − N ′ i , (14)where N ′ ∼ Binom (cid:16) n, D (cid:0) Bt (cid:1) d (cid:0) t d − ( t − d (cid:1)(cid:17) . Let p = D (cid:0) Bt (cid:1) d (cid:0) t d − ( t − d (cid:1) .[12] showed that (cid:12)(cid:12) P ([ t ] d ) (cid:12)(cid:12) ≤ (cid:18) tt (cid:19) t d − . We have E h ( m − N ′ i = E h e log(2)( m − N ′ ) i = M N ′ (log(2)( m − , where M N ′ is the moment-generating function of the random variable N ′ . It holds that M N ′ ( θ ) = (1 − p + pe θ ) n . Substituting into (14), E h (Λ( W , . . . , W n )) m − i ≤ (cid:18) tt (cid:19) t d − ! m − (cid:16) − p + pe log(2)( m − (cid:17) n = (cid:18) tt (cid:19) ( m − t d − (cid:16) − p + p ( m − (cid:17) n t ( m − t d − (cid:16) p ( m − (cid:17) n = exp h m − t d − + n log (cid:16) p ( m − (cid:17)i . Using the fact that log(1 + x ) ≤ x , we have E h (Λ( W , . . . , W n )) m − i ≤ exp h m − t d − + np ( m − i = exp " m − t d − + D ( m − (cid:18) Bt (cid:19) d (cid:0) t d − ( t − d (cid:1) n . By the Binomial Theorem, t d − ( t − d = t d − d X i =0 (cid:18) di (cid:19) t d − i ( − i = d X i =1 (cid:18) di (cid:19) t d − i ( − i +1 ≤ d X i =1 (cid:18) ki (cid:19) max i ∈{ ,...,d } t d − i ( − i +1 = (2 d − t d − . Substituting, E h (Λ( W , . . . , W n )) m − i ≤ exp " m − t d − + D ( m − (cid:18) Bt (cid:19) d (2 d − t d − n = exp h m − t d − + D ( m − (2 B ) d (2 d − t − n i ≤ exp (cid:2) m − t d − + D m +2 d − B d t − n (cid:3) Let t = n d . Substituting, E h (Λ( W , . . . , W n )) m − i ≤ exp h m − n d − d + D m +2 d − B d n d − d i = exp h(cid:0) m −

1) + D m +2 d − B d (cid:1) n d − d i . We now prove Lemma 5.

Proof of Lemma 5.

Recall the deﬁnitions G ( T, R, I ) = { f ◦ ( T R ) + ( I ) : f ∈ C ( b ) } and H ( T, R, I ) = { g f ◦ ( T R ) + ( I ) : f ∈ C ( b ) } for R ∈ R . We have N G ( T, R ) ( ǫ, n ) ≤ X R ∈R X I ⊂ [ d ]: | I | = s ⋆ N G ( T,R,I ) ( ǫ, n ) ≤ (cid:18) ds ⋆ (cid:19) |R| max R ∈R ,I ⊂ [ d ]: | I | = s ⋆ N G ( T,R,I ) ( ǫ, n ) . Consider an arbitrary R ∈ R and I ⊂ [ d ] with | I | = s ⋆ . By Proposition 3, the set L H ( T,R,I ) (( x , y ) , . . . , ( x n , y n )) is an ǫ -net for the set L G ( T,R,I ) (( x , y ) , . . . , ( x n , y n )) . Let M = ( T R ) + . Observe that the cardinality of L H ( T,R,I ) (( x , y ) , . . . , ( x , y n )) is upper-boundedby the labeling number of the set { ( M ( I )) T x , . . . , ( M ( I )) T x n } with (cid:0)(cid:6) bα (cid:7) − (cid:1) labels. Recallthe deﬁnition of x ( I ) , and similarly let M ( I ) be the matrix formed from the rows of M that areindexed by the set I . For x ∈ R d , it holds that ( M ( I )) T x = ( M ( I )) T ( x ( I )) . Note that x ( I ) is an s ⋆ -dimensional vector. By Proposition 4, Λ (cid:18) ( M ( I )) T x , . . . , ( M ( I )) T x n ; (cid:18)(cid:24) bα (cid:25) − (cid:19)(cid:19) ≤ Λ (cid:18) x ( I ) , . . . , x n ( I ); (cid:18)(cid:24) bα (cid:25) − (cid:19)(cid:19) . N G ( T,R,I ) ( ǫ, n ) is upper-bounded by the expected labeling number of the set { X ( I ) , . . . , X n ( I ) } with (cid:0)(cid:6) bα (cid:7) − (cid:1) labels. Applying Lemma 7 (setting d = s ⋆ , m = (cid:0)(cid:6) bα (cid:7) − (cid:1) , B = C and D = ( p ⋆ ) s ⋆ ), we have N G ( T,R,I ) ( ǫ, n ) ≤ exp (cid:20)(cid:18) (cid:18)(cid:24) bα (cid:25) − (cid:19) + 2 ⌈ bα ⌉ − s ⋆ ( p ⋆ C ) s ⋆ (cid:19) n s⋆ − s⋆ (cid:21) ≤ exp (cid:20)(cid:18) bα + 2 bα (4 p ⋆ C ) s ⋆ (cid:19) n s⋆ − s⋆ (cid:21) . We conclude that N G ( T, R ) ( ǫ, n ) ≤ (cid:18) ds ⋆ (cid:19) |R| exp (cid:20)(cid:18) bα + 2 bα (4 p ⋆ C ) s ⋆ (cid:19) n s⋆ − s⋆ (cid:21) . Recalling that |R| = N k concludes the proof. Remark 2.

In the proof of Lemma 5, we have taken advantage of the fact that s ⋆ is a constant.Another approach would be to analyze the labeling number directly in R k , since k is also a constant.However, there is a technical hurdle to overcome. The grid approach in Lemma 7 works well whenthe distribution of the random variable is not too concentrated. Without good control over theinduced distribution of M ( I ) T X , it would be difﬁcult to carry out a similar argument.Proof of Theorem 1. We show that Algorithm 3 achieves the desired statistical guarantee, using thebound in Theorem 3. We choose N =  log (cid:0) ǫ k (cid:1) log (cid:16) − (cid:12)(cid:12)(cid:12) S k − r ∩ B (cid:16) e , δ √ k (cid:17)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12) S k − r (cid:12)(cid:12) − (cid:17)  . The choice of N leads to k (cid:18) − (cid:12)(cid:12)(cid:12)(cid:12) S k − r ∩ B (cid:18) e , δ √ k (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12) S k − r (cid:12)(cid:12) − (cid:19) N ≤ ǫ . Similarly, d ≤ ǫ by the assumption on d .Before bounding the last term, we need to control the value of ǫ . Observe that the function z ( ǫ , ǫ , C ) is decreasing in both arguments. Therefore, by setting ρ √ s ⋆ λ ≤ δ, we ensure that ǫ ≥ ǫ . Substituting in the value of λ and solving, ρ √ s ⋆ · r θ log( d ) n ≤ δn ≥ s ⋆ θ log( d ) ρ δ . We now bound the last term: (cid:18) ds ⋆ (cid:19) N k exp (cid:20)(cid:18) bα + 2 bα (4 p ⋆ C ) s ⋆ (cid:19) n s⋆ − s⋆ − ǫ n b (cid:21) ≤ exp (cid:20) log(4) + s ⋆ log( d ) + (cid:18) bα + 2 bα (4 p ⋆ C ) s ⋆ (cid:19) n s⋆ − s⋆ − ǫ n b (cid:21) . (15)Recall that α = ǫ ( b + η ) − . For n ≥ s ⋆ θ log( d ) ρ − δ − , we have α ≥ ǫ ( b + η ) − . Wesee that there exists t = t ( N , C, b, s ⋆ , p ⋆ , k, η ) such that if n ≥ max (cid:8) t, s ⋆ θ log( d ) ρ − δ − (cid:9) ,then log(4) + (cid:18) bα + 2 bα (4 p ⋆ C ) s ⋆ (cid:19) n s⋆ − s⋆ − ǫ n b ≤ ǫ n b . n , we then bound (15) by exp h s ⋆ log( d ) − ǫ n b i . Setting this quantity to be less than ǫ ,we obtain exp (cid:20) s ⋆ log( d ) − ǫ n b (cid:21) ≤ ǫ s ⋆ log( d ) − ǫ n b ≤ log (cid:16) ǫ (cid:17) n ≥ b ǫ (cid:18) s ⋆ log( d ) + log (cid:18) ǫ (cid:19)(cid:19) . Taking n = max (cid:8) b ǫ − (cid:0) s ⋆ log( d ) + log (cid:0) ǫ (cid:1)(cid:1) , s ⋆ θ log( d ) ρ − δ − , t (cid:9) completes theproof. We now modify Algorithm 2 to ensure the Lipschitz property, in addition to the coordinate-wisemonotone property. When we only needed to ensure monotonicity, the interpolation step wasstraightforward; interpolation was possible as long as the points themselves satisﬁed the monotonic-ity property. The situation is slightly more complicated for Lipschitz functions. Algorithm 4 ensuresthat the estimated points are interpolable with respect to the class L ( b ) , as deﬁned below. Deﬁnition 4.

We say that a collection of points ( x i , y i ) ni =1 ∈ R k × R is interpolable with respect toa function class F if there exists f ∈ F such that f ( x i ) = y i for each i ∈ [ n ] . Algorithm 4 below ﬁnds the optimal index set and function values on a given set of points, compat-ible with interpolability. Binary variables v l determine the index set I . The variables F i representthe estimated function values at data points X i . Auxiliary variables z ij and w ijp are used to modelthe monotonicity and Lipschitz constraints. Algorithm 4

Integer Programming Sparse Matrix Isotonic Regression (Lipschitz)

Input:

Values ( X i , Y i ) ni =1 ∈ R d × R , sparsity level s , M ≥ ∈ R d × k , C > , b > Output:

An index set I ⊂ [ d ] satisfying | I | = s ; values F , F , . . . , F n ∈ [0 , b ] such that the points ( M ( I ) T X i , F i ) ni =1 are interpolable by a coordinate-wise monotone -Lipschitz function. Let B = 2 C P dl =1 P kp =1 M lp . Solve the following optimization problem. min v,F,z n X i =1 ( Y i − F i ) (16)s.t. d X l =1 v l = s (17) bz ij ≥ F i − F j ∀ i, j ∈ [ n ] (18) ( F i − F j ) ≤ k X p =1 w ijp d X l =1 v l M lp ( X il − X jl ) ! + (1 − z ij ) b ∀ i, j ∈ [ n ] (19) − B (1 − w ijp ) ≤ d X l =1 v l M lp ( X il − X jl ) ≤ Bw ijp ∀ i, j ∈ [ n ] , p ∈ [ k ] (20) z ij ∈ { , } ∀ i, j ∈ [ n ] v l ∈ { , } ∀ l ∈ [ d ] w ijp ∈ { , } ∀ i, j ∈ [ n ] , p ∈ [ k ] F i ∈ [0 , b ] ∀ i ∈ [ n ] : Return the set I n = { l ∈ [ d ] : v l = 1 } and the values F , F , . . . , F n . Remark 3.

Note that Constraints (19) contain products of three binary variables. We may encodearbitrary products of binary variables using linear constraints, as follows. Suppose x and y arebinary variables, and we wish to encode z = xy . This is equivalent to the constraints x + y − ≤ z ≤ ( x + y ) and z ∈ { , } . Longer products may be encoded recursively. We apply the construction of [1] to ﬁnd a coordinate-wise monotone, -Lipschitz interpolation. Proposition 5.

Suppose the points ( x i , y i ) ni =1 ∈ R k × [0 , b ] satisfy y i − y j ≤ k ( x i − x j ) + k (21) for each pair ( i, j ) ∈ [ n ] . Let ˆ g ( x ) = max i { y i − k ( x i − x ) + k } , and let ˆ f ( x ) = max { ˆ g ( x ) , } .Then ˆ f ∈ L ( b ) . Furthermore, ˆ f interpolates the points; i.e. ˆ f ( x i ) = y i for each i ∈ [ n ] . The proof follows from [1]. We therefore obtain the following approach for interpolation.

Algorithm 5

Monotone Lipschitz Interpolation

Input:

Points ( x i , y i ) ni =1 ∈ R k × [0 , b ] satisfying (21) for each i, j . Output:

An estimated function ˆ f ∈ L ( b ) that interpolates the points. Let ˆ g ( x ) = max i { y i − k ( x i − x ) + k } . Return ˆ f ( x ) = max { ˆ g ( x ) , } . Algorithm 6Input:

Values ( X i , Y i ) ni =1 ∈ R d × R , sparsity level s , M ≥ ∈ R d × k , C > , b > Output: I n ∈ [ d ] : | I n | = s ⋆ and f n ∈ L ( b ) Apply Algorithm 4 to input ( X i , Y i ) ni = n +1 , s ⋆ , M , C , and b , obtaining the index set I and values F , . . . , F n . Apply Algorithm 5 to input ( M ( I ) X n + i , F i ) ni =1 , obtaining the function ˆ f . Return ( I, ˆ f ) . Proposition 6.

Suppose X i ∈ [ − C, C ] d for i ∈ [ n ] . On input ( X i , Y i ) ni =1 , s, M, C, b , Algorithm 6ﬁnds a function ˆ f n ∈ L ( b ) and index set I n that minimize the empirical loss P ni =1 L ( X i , Y i , f ◦ M ( I )) , over functions f ∈ L ( b ) and index sets I with cardinality s . We now modify Algorithm 3 to include the Lipschitz assumption.

Algorithm 7

MMI Regression (Lipschitz)

Input: N ∈ N , values ( X , Y ) , . . . , ( X N , Y N ) , C > , b > , τ > , and λ > Output: f n ∈ L ( b ) , Q n ∈ O d,k , R n ∈ M k,k ( r ) , and I n ∈ [ d ] : | I n | = s ⋆ Construct a random near-net R ( N ) . Produce an estimate Q n using Algorithm 1 applied to ( X i , Y i ) ni =1 , τ , and λ . for each R ∈ R do Let M = ( Q n R ) + . Apply Algorithm 6 to input ( X i , Y i ) ni = n +1 ∈ R d × R , s ⋆ , M , C , and b ,obtaining the index set I R and function f R . end for Return the tuple ( f R , Q n , R, I R ) with the smallest empirical loss.The proof of Theorem 3 carries through exactly for Algorithm 7, since L ( b ) ⊂ C ( b ) . We note thata tighter analysis of the estimation error incurred by using Algorithm 7 would take advantage of theLipschitz property of the estimated function. In order to prove Proposition 5, we need to know when a collection of points is interpolable by acoordinate-wise monotone and -Lipschitz function. The following result provides a necessary andsufﬁcient condition for interpolability. 18 roposition 7 (From Proposition 3.3 and Proposition 4.1 in [1]) . A collection of points ( x i , y i ) ni =1 ∈ R k × R is interpolable with respect to the class of coordinate-wise monotone and -Lipschitz func-tions if and only if y i − y j ≤ k ( x i − x j ) + k for all i, j ∈ [ n ] . Further, if the collection is interpolable, then the function ˆ f ( x ) = max i { y i − k ( x i − x ) + k is an interpolation that is coordinate-wise monotone and -Lipschitz.Proof of Proposition 5. By Proposition 7, the data admits an interpolation by a coordinate-wisemonotone -Lipschitz function. Further, the function ˆ f interpolates the data, and is coordinate-wise monotone and -Lipschitz. Since y i ≥ for all i , the function ˆ g interpolates the data also.The zero function is -Lipschitz and coordinate-wise monotone. Therefore, ˆ g , which is the point-wise maximum of two -Lipschitz and coordinate-wise monotone functions, is itself -Lipschitz andcoordinate-wise monotone.It remains to show that ≤ ˆ g ( x ) ≤ b for all x . Clearly ˆ g ( x ) ≥ for all x . Since y i ≤ b for all i , itholds that ˆ f ( x ) ≤ max i { b − k ( x i − x ) + k } ≤ b = ⇒ ˆ g ( x ) ≤ b. Proof of Proposition 6.

First we show that Algorithm 4 ﬁnds an index set I and values F , . . . , F n ∈ [0 , b ] minimizing P ni =1 ( Y i − F i ) , such that the points ( M ( I ) T X i , F i ) ni =1 are interpolable by acoordinate-wise monotone -Lipschitz function. Later, we will show that the points are in factinterpolable by a coordinate-wise monotone -Lipschitz function with range [0 , b ] , a more restrictiverequirement.Let I = { i : v i = 1 } . Constraint (17) ensures that exactly s of the v l variables are set to , so that | I | = s . Given this index set, we will show that the points ( M ( I ) T X i , F i ) ni =1 are interpolable by acoordinate-wise monotone -Lipschitz function. By Proposition 7, this is equivalent to F i − F j ≤ k (cid:0) M ( I ) T X i − M ( I ) T X j (cid:1) + k , for each i, j . First observe that this is equivalent to either (a) F i ≤ F j or(b) ( F i − F j ) ≤ k (cid:0) M ( I ) T X i − M ( I ) T X j (cid:1) + k . We now show how the constraints encode this condition. First suppose F i > F j . Then by Constraint(18), z ij = 1 , and F i − F j ≤ k (cid:0) M ( I ) T X i − M ( I ) T X j (cid:1) + k ⇐⇒ ( F i − F j ) ≤ k (cid:0) M ( I ) T X i − M ( I ) T X j (cid:1) + k ⇐⇒ ( F i − F j ) ≤ k (cid:0) M ( I ) T X i − M ( I ) T X j (cid:1) + k + (1 − z ij ) b . Next suppose F i ≤ F j . Then z ij is free to equal , so that ( F i − F j ) ≤ k (cid:0) M ( I ) T X i − M ( I ) T X j (cid:1) + k + (1 − z ij ) b . We conclude that the points ( M ( I ) T X i , F i ) ni =1 are interpolable by a coordinate-wise monotone -Lipschitz function if and only if ( F i − F j ) ≤ k (cid:0) M ( I ) T X i − M ( I ) T X j (cid:1) + k + (1 − z ij ) b . (22)for each pair ( i, j ) , where z ij = 1 if F i > F j . Expanding the right hand side of (22), k (cid:0) M ( I ) T X i − M ( I ) T X j (cid:1) + k = k X p =1 d X l =1 ( M ( I ) T ) pl ( X il − X jl ) ! + , = k X p =1 d X l =1 v l M lp ( X il − X jl ) ! + , . w ijp = ( if P dl =1 v l M lp ( X il − X jl ) > if P dl =1 v l M lp ( X il − X jl ) < . Therefore, k (cid:0) M ( I ) T X i − M ( I ) T X j (cid:1) + k = k X p =1 w ijp d X l =1 v l M lp ( X il − X jl ) ! . We conclude that the points ( M ( I ) T X i , F i ) ni =1 are interpolable by a coordinate-wise monotone -Lipschitz function if and only if ( F i − F j ) ≤ k X p =1 w ijp d X l =1 v l M lp ( X il − X jl ) ! + (1 − z ij ) b , for each ( i, j ) , which is exactly Constraints (19).The objective (16) minimizes the loss on the samples. Finally, the interpolation step produces acoordinate-wise monotone Lipschitz function with range [0 , b ]]