Coordinate Methods for Matrix Games
CCoordinate Methods for Matrix Games
Yair Carmon Yujia Jin Aaron Sidford Kevin Tian {yairc,yujiajin,sidford,kjtian}@stanford.edu
Abstract
We develop primal-dual coordinate methods for solving bilinear saddle-point problems of theform min x ∈X max y ∈Y y (cid:62) Ax which contain linear programming, classification, and regression asspecial cases. Our methods push existing fully stochastic sublinear methods and variance-reduced methods towards their limits in terms of per-iteration complexity and sample complex-ity. We obtain nearly-constant per-iteration complexity by designing efficient data structuresleveraging Taylor approximations to the exponential and a binomial heap. We improve sam-ple complexity via low-variance gradient estimators using dynamic sampling distributions thatdepend on both the iterates and the magnitude of the matrix entries.Our runtime bounds improve upon those of existing primal-dual methods by a factor depend-ing on sparsity measures of the m by n matrix A . For example, when rows and columns haveconstant (cid:96) /(cid:96) norm ratios, we offer improvements by a factor of m + n in the fully stochasticsetting and √ m + n in the variance-reduced setting. We apply our methods to computationalgeometry problems, i.e. minimum enclosing ball, maximum inscribed ball, and linear regression,and obtain improved complexity bounds. For linear regression with an elementwise nonnegativematrix, our guarantees improve on exact gradient methods by a factor of (cid:112) nnz ( A ) / ( m + n ) . a r X i v : . [ c s . D S ] S e p ontents (cid:96) - (cid:96) sublinear coordinate method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2 (cid:96) - (cid:96) variance-reduced coordinate method . . . . . . . . . . . . . . . . . . . . . . . . 24 IterateMaintainer p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.2 ApproxExpMaintainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.3
ScaleMaintainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
A Deferred proofs from Section 2 54B Deferred proofs from Section 3 54C Deferred proofs for sublinear methods 60D Deferred proofs for variance-reduced methods 67E Additional results on variance-reduced methods 76F Deferred proofs from Section 6 80G
IterateMaintainer : numerical stability and variations 84 Introduction
Bilinear minimax problems of the form min x ∈X max y ∈Y y (cid:62) Ax where A ∈ R m × n , (1)are fundamental to machine learning, economics and theoretical computer science [26, 46, 13]. Wefocus on three important settings characterized by different domain geometries. When X and Y areprobability simplices—which we call the (cid:96) - (cid:96) setting—the problem (1) corresponds to a zero-summatrix game and also to a linear program in canonical feasibility form. The (cid:96) - (cid:96) setting, where X is a Euclidean ball and Y is a simplex, is useful for linear classification (hard-margin support vectormachines) as well as problems in computational geometry [10]. Further, the (cid:96) - (cid:96) setting, whereboth X and Y are Euclidean balls (with general center), includes linear regression.Many problems of practical interest are sparse , i.e., the number of nonzero elements in A , whichwe denote by nnz , satisfies nnz (cid:28) mn . Examples include: linear programs with constraints involvingfew variables, linear classification with 1-hot-encoded features, and linear systems that arise fromphysical models with local interactions. The problem description size nnz plays a central role inseveral runtime analyses of algorithms for solving the problem (1).However, sparsity is not an entirely satisfactory measure of instance complexity: it is not con-tinuous in the elements of A and consequently it cannot accurately reflect the simplicity of “nearlysparse” instances with many small (but nonzero) elements. Measures of numerical sparsity , such asthe (cid:96) to (cid:96) norm ratio, can fill this gap [17]. Indeed, many problems encountered in practice arenumerically sparse. Examples include: linear programming constraints of the form x ≥ n (cid:80) i x i ,linear classification with neural network activations as features, and linear systems arising fromphysical models with interaction whose strength decays with distance.Existing bilinear minimax solvers do not exploit the numerical sparsity of A and their runtimeguarantees do not depend on it—the basic limitation of these methods is that they do not directlyaccess the large matrix entries, and instead sample the full columns and rows in which they occur.To overcome this limitation, we propose methods that access A a single entry at a time, leveragenumerical sparsity by accessing larger coordinates more frequently, and enjoy runtime guaranteesthat depend explicitly on numerical sparsity measures. For numerically sparse large-scale instancesour runtimes are substantially better than the previous state-of-the-art. Moreover, our runtimessubsume the previous state-of-the-art dependence on nnz and rcs , the maximum number of nonzerosin any row or column.In addition to proposing algorithms with improved runtimes, we develop two techniques that maybe of broader interest. First, we design non-uniform sampling schemes that minimize regret bounds;we use a general framework that unifies the Euclidean and (local norms) simplex geometries, possiblyfacilitating future extension. Second, we build a data structure capable of efficiently maintainingand sampling from multiplicative weights iterations (i.e. entropic projection) with a fixed densecomponent . This data structure overcomes limitations of existing techniques for maintaining entropicprojections and we believe it may prove effective in other settings where such projections appear. Table 2 summarizes our runtime guarantees and puts them in the context of the best existing results.We consider methods that output (expected) (cid:15) -accurate solutions of the saddle-point problem (1),namely a pair x, y satisfying E (cid:20) max v ∈Y v (cid:62) Ax − min u ∈X y (cid:62) Au (cid:21) ≤ (cid:15). min x ∈X max y ∈Y f ( x, y ) ,specialized to f ( x, y ) = y (cid:62) Ax . Each algorithm presents a different tradeoff between per-iterationcomplexity and the required iteration count, corresponding to the matrix access modality: exactgradient methods compute matrix-vector products in each iteration, row-column stochastic gradientmethods sample a row and a column in each iteration, and our proposed coordinate methods takethis tradeoff to an extreme by sampling a single coordinate of the matrix per iteration. In addition,variance reduction (VR) schemes combine both fast stochastic gradient computations and infrequentexact gradient computations, maintaining the amortized per-iteration cost of the stochastic schemeand reducing the total iteration count for sufficiently small (cid:15) .The runtimes in Table 2 depend on the numerical range of A through a matrix norm L thatchanges with both the problem geometry and the type of matrix access; we use L mv , L rc and L co to denote the constants corresponding to matrix-vector products, row-column queries and coordi-nated queries, respectively. Below, we describe these runtimes in detail. In the settings we study,our results are the first theoretical demonstration of runtime gains arising from sampling a singlecoordinate of A at a time, as opposed to entire rows and columns. Coordinate stochastic gradient methods.
We develop coordinate stochastic gradient estima-tors which allow per-iteration cost (cid:101) O (1) and iteration count (cid:101) O (cid:0) n + m + ( L co (cid:15) ) (cid:1) . We define L co inTable 1; for each domain geometry, the quantity L co L rc is a measure of the numerical sparsity of A ,satisfying ≤ L co L rc ≤ rcs . Every iteration of our method requires sampling an element in a row or a column with probabilityproportional to its entries. Assuming a matrix access model that allows such sampling in time (cid:101) O (1) (similarly to [5, 41, 15]), the total runtime of our method is (cid:101) O (cid:0) n + m + ( L co (cid:15) ) (cid:1) . In this case, fornumerically sparse problems such that L co = O ( L rc ) , the proposed coordinate methods outperformrow-column sampling by a factor of m + n . Moreover, the bound L co ≤ L rc ( m + n ) implies thatour runtime is never worse than that of row-column methods. When only coordinate access to thematrix A is initially available, we may implement the required sampling access via preprocessingin time O ( nnz ) . This changes the runtime to (cid:101) O (cid:0) nnz + ( L co (cid:15) ) (cid:1) , so that the comparison above holdsonly when ( L co (cid:15) ) = ˜Ω( nnz ) . In that regime, the variance reduction technique we describe belowprovides even stronger guarantees. Coordinate methods with variance reduction.
Using our recently proposed framework [8]we design a variance reduction algorithm with amortized per-iteration cost (cid:101) O (1) , required iterationcount of (cid:101) O (cid:0) √ nnz · L co (cid:15) (cid:1) and total running time (cid:101) O (cid:0) nnz + √ nnz · L co (cid:15) (cid:1) . In the numerically sparseregime L co = O ( L rc ) , our runtime improves on row-column VR by a factor of (cid:112) nnz / ( m + n ) , andin general the bound L co ≤ L rc √ m + n guarantees it is never worse. Since variance reductionmethods always require a single pass over the data to compute an exact gradient, this comparisonholds regardless of the matrix access model. In the (cid:96) - (cid:96) setting we note that for elementwise Interior point methods offer an alternative tradeoff between iteration cost and iteration count: the numberof required iterations depends on /(cid:15) only logarithmically, but every iteration is costly, requiring a linear systemsolution which at present takes time Ω(min { m, n } ) . In the (cid:96) - (cid:96) geometry, the best known runtimes for interiorpoint methods are (cid:101) O (( nnz + min { m, n } ) (cid:112) min { m, n } ) [23], (cid:101) O (max { m, n } ω ) [12], and (cid:101) O ( mn + min { m, n } ) [45]. Inthis paper we are mainly interested in the large-scale low-accuracy regime with L/(cid:15) < min( m, n ) where the runtimesdescribed in Table 2 are favorable (with the exception of [45] in certain cases). Our methods take only few passesover the data, which are not the case for many interior-point methods [23, 12]. Also, our methods do not rely on ageneral (ill-conditioned) linear system solver, which is a key ingredient in interior point methods. mv (matrix-vector) L rc (row-column) L co (coordinate) (cid:96) - (cid:96) max i,j | A ij | max i,j | A ij | max (cid:110) max i (cid:107) A i : (cid:107) , max j (cid:107) A : j (cid:107) (cid:111) (cid:96) - (cid:96) max i (cid:107) A i : (cid:107) max i (cid:107) A i : (cid:107) max (cid:110) max i (cid:107) A i : (cid:107) , (cid:107) A (cid:107) F (cid:111) † (cid:96) - (cid:96) (cid:107) A (cid:107) op (cid:107) A (cid:107) F max (cid:110)(cid:113)(cid:80) i (cid:107) A i : (cid:107) , (cid:113)(cid:80) j (cid:107) A : j (cid:107) (cid:111) Table 1:
Dependence on A for different methods in different geometries. Comments: A i : and A : j denote the i th row and j th column of A , respectively. Numerically sparse instances satisfy L co = O ( L rc ) . † In the (cid:96) - (cid:96) setting we can also achieve, via alternative sampling schemes, L co = L rc √ rcs and L co =max { max i (cid:107) A i : (cid:107) , (cid:112) max i (cid:107) A i : (cid:107) max j (cid:107) A : j (cid:107) } . Method Iteration cost Total runtimeExact gradient [28, 31] O ( nnz ) (cid:101) O (cid:16) nnz · L mv · (cid:15) − (cid:17) Row-column [16, 10, 7] O ( n + m ) (cid:101) O (cid:16) ( m + n ) · L rc · (cid:15) − (cid:17) Row-column VR [7, 8] O ( n + m ) (cid:101) O (cid:16) nnz + (cid:112) nnz · ( m + n ) · L rc · (cid:15) − (cid:17) Sparse row-col (folklore) (cid:101) O ( rcs ) (cid:101) O (cid:16) rcs · L rc · (cid:15) − (cid:17) Sparse row-col VR (Appendix E) (cid:101) O ( rcs ) (cid:101) O (cid:16) nnz + √ nnz · rcs · L rc · (cid:15) − (cid:17) Coordinate (Section 3.1) (cid:101) O (1) (cid:101) O (cid:16) nnz + L co · (cid:15) − (cid:17) Coordinate VR (Section 3.2) (cid:101) O (1) (cid:101) O (cid:16) nnz + √ nnz · L co · (cid:15) − (cid:17) Table 2:
Comparison of iterative methods for bilinear problems.
Comments: nnz denotes thenumber of nonzeros in A ∈ R m × n and rcs ≤ max { m, n } denotes the maximum number of nonzeros in anyrow and column of A . The quantities L mv , L co and L rc depend on problem geometry (see Table 1). Task Method RuntimeMaxIB Allen-Zhu et al. [2] (cid:101) O (cid:0) mn + ρm √ n · (cid:15) − (cid:1) Our method (Theorem 3) (cid:101) O (cid:0) nnz + ρ √ nnz · rcs · (cid:15) − (cid:1) † MinEB(when m ≥ n ) Allen-Zhu et al. [2] (cid:101) O (cid:0) mn + m √ n · (cid:15) − / (cid:1) Our method (Theorem 4) (cid:101) O (cid:0) nnz + √ nnz · rcs · (cid:15) − / (cid:1) † Regression( A (cid:62) A (cid:23) µI ) AGD [30] (cid:101) O (cid:16) nnz · (cid:107) A (cid:107) op 1 √ µ (cid:17) Gupta and Sidford [17] (cid:101) O (cid:16) nnz + nnz / · (cid:16) (cid:80) i ∈ [ n ] (cid:107) A (cid:107) F · (cid:107) A i : (cid:107) · (cid:107) A i : (cid:107) (cid:17) / √ µ (cid:17) Our method (Theorem 5) (cid:101) O (cid:16) nnz + √ nnz · max (cid:110)(cid:113)(cid:80) i (cid:107) A i : (cid:107) , (cid:113)(cid:80) j (cid:107) A : j (cid:107) (cid:111) √ µ (cid:17) Table 3:
Comparison of complexity for different applications.
Comments: ρ denotes the radii ratioof the minimum ball enclosing the rows of A and maximum ball inscribed in them. † For MaxIB and MinEB,we refer the reader to Section 6.2 for a more fine-grained runtime bound. on-negative matrices, L co = max {(cid:107) A (cid:107) , (cid:107) A (cid:62) (cid:107) } ≤ L mv √ m + n , and consequently our methodoutperforms exact gradient methods by a factor of (cid:112) nnz / ( m + n ) , even without any numerical orspectral sparsity in A . Notably, this is the same factor of improvement that row-column VR achievesover exact gradient methods in the (cid:96) - (cid:96) and (cid:96) - (cid:96) regimes. Optimality of the constant L co . For the (cid:96) - (cid:96) and (cid:96) - (cid:96) settings, we argue that the constant L co in Table 1 is optimal in the restricted sense that no alternative sampling distribution for coordinategradient estimation can have a better variance bound than L co (a similar sense of optimality alsoholds for L rc in each geometry). In the (cid:96) - (cid:96) setting, a different sampling distribution produces animproved (and optimal) constant max { max i (cid:107) A i : (cid:107) , (cid:107)| A |(cid:107) op } , where A i : is the i th row of A , and | A | ij = | A ij | is the elementwise absolute value of A . However, it is unclear how to efficiently samplefrom this distribution. Row-column sparse instances.
Some problem instances admit a structured form of sparsitywhere every row and column has at most rcs nonzero elements. In all settings we have L co ≤ L rc √ rcs and so our coordinate methods naturally improve when rcs is small. Specifically, the samplingdistributions and data structures we develop in this paper allow us to modify previous methods forrow-column VR [8] to leverage row-column sparsity, reducing the amortized per-iteration cost from O ( m + n ) to (cid:101) O ( rcs ) . Applications.
We illustrate the implications of our results for two problems in computationalgeometry, minimum enclosing ball (Min-EB) and maximum inscribed ball (Max-IB), as well aslinear regression. For Min-EB and Max-IB in the non-degenerate case m ≥ n , we apply our (cid:96) - (cid:96) results to obtain algorithms whose runtime bounds coincide with the state-of-the-art [2] for denseproblems, but can be significantly better for sparse or row-column sparse instances. For linearregression we focus on accelerated linearly converging algorithms, i.e., those that find x such that (cid:107) Ax − b (cid:107) ≤ (cid:15) in time proportional to µ − log (cid:15) where µ is the smallest eigenvalue of A (cid:62) A . Withinthis class and in a number of settings, our reduced variance coordinate method offers improvementover the state-of-the-art: for instances where (cid:107) A i : (cid:107) = O ( (cid:107) A i : (cid:107) ) and (cid:107) A : j (cid:107) = O ( (cid:107) A : j (cid:107) ) for all i, j it outperforms [17] by a factor of nnz / , and for elementwise nonnegative instances it outperformsaccelerated gradient descent by a factor of (cid:112) nnz / ( m + n ) . See Table 3 for a detailed runtimecomparison. We now provide a detailed overview of our algorithm design and analysis techniques, highlighting ourmain technical insights. We focus on the (cid:96) - (cid:96) geometry, since it showcases all of our developments.Our technical contributions have two central themes:1. Sampling schemes design.
The key to obtaining efficient coordinate methods is carefullychoosing the sampling distribution. Here, local norms analysis of stochastic mirror descent [39]on the one hand enables tight regret bounds, and on the other hand imposes an additional designconstraint since the stochastic estimators must be bounded for the analysis to apply. We achieveestimators with improved variance bounds meeting this boundedness constraint by leveraginga “clipping” operation introduced by Clarkson et al. [10]. Specifically, in the simplex geometry,we truncate large coordinates of our estimators, and show that our method is robust to theresulting distortion. 4.
Data structure design.
Our goal is to perform iterations in (cid:101) O (1) time, but our mirror descentprocedures call for updates that change m + n variables in each step. We resolve this tension viadata structures that implicitly maintain the iterates. Variance reduction poses a considerablechallenge here, because every reduced-variance stochastic gradient contains a dense componentthat changes all coordinates in a complicated way. In particular, existing data structures cannotefficiently compute the normalization factor necessary for projection to the simplex. We designa data structure that overcomes this hurdle via Taylor expansions, coordinate binning, and abinomial heap-like construction. The data structure computes approximate mirror projections,and we modify the standard mirror descent analysis to show it is stable under the particularstructure of the resulting approximation errors.At the intersection of these two themes is a novel sampling technique we call “sampling from thesum,” which addresses the same variance challenges as the “sampling from the difference” techniqueof [8], but is more amenable to efficient implementation with a data structure. Our algorithm is an instance of stochastic mirror descent [29], which in the (cid:96) - (cid:96) setting producesa sequence of iterates ( x , y ) , ( x , y ) , . . . according to x t +1 = Π ∆ ( x t ◦ exp {− η ˜ g x ( x t , y t ) } ) and y t +1 = Π ∆ ( y t ◦ exp {− η ˜ g y ( x t , y t ) } ) , (2)where Π ∆ ( v ) = v (cid:107) v (cid:107) is the projection onto the simplex ( exp and log are applied to vectors elemen-twise, and elementwise multiplication is denoted by ◦ ), η is a step size, and ˜ g x , ˜ g y are stochasticgradient estimators for f ( x, y ) = y (cid:62) Ax satisfying E ˜ g x ( x, y ) = ∇ x f ( x, y ) = A (cid:62) y and E ˜ g y ( x, y ) = −∇ y f ( x, y ) = − Ax.
We describe the computation and analysis of ˜ g x ; the treatment of ˜ g y is analogous. To compute ˜ g x ( x, y ) , we sample i, j from a distribution p ( x, y ) on [ m ] × [ n ] and let ˜ g x ( x, y ) = y i A ij p ij ( x, y ) e j , (3)where p ij ( x, y ) denotes the probability of drawing i, j from p ( x, y ) and e j is the j th standard basisvector—a simple calculation shows that E ˜ g x = A (cid:62) y for any p . We first design p ( x, y ) to guaranteean (cid:101) O (cid:0) ( L co (cid:15) ) (cid:1) iteration complexity for finding an (cid:15) -accurate solution, and then briefly touch on howto compute the resulting iterations in (cid:101) O (1) time. Local norms-informed distribution design.
The standard stochastic mirror descent analy-sis [29] shows that if E (cid:107) ˜ g x ( x, y ) (cid:107) ∞ ≤ L for all x, y (and similarly for ˜ g y ), taking η = (cid:15)L and achoice of T = (cid:101) O (cid:0) ( L(cid:15) ) (cid:1) suffices to ensure that the iterate average T (cid:80) Tt =1 ( x t , y t ) is an (cid:15) -accuratesolution in expectation. Unfortunately, this analysis demonstrably fails to yield sufficiently tightbounds for our coordinate estimator: there exist instances for which any distribution p produces L ≥ nL rc . We tighten the analysis using a local norms argument [cf. 39, Section 2.8], showing that (cid:101) O (cid:0) ( L(cid:15) ) (cid:1) iterations suffice whenever (cid:107) η ˜ g x (cid:107) ∞ ≤ with probability 1 and for all x, y E (cid:107) ˜ g x ( x, y ) (cid:107) x ≤ L , where (cid:107) γ (cid:107) x = (cid:88) j x j γ j
5s the local norm at x ∈ X . We take p ij = y i A ij (cid:107) A i : (cid:107) (4)(recalling that x, y are both probability vectors). Substituting into (3) gives E (cid:107) ˜ g x ( x, y ) (cid:107) x = (cid:88) i,j y i A ij x j p ij = (cid:88) i,j y i (cid:107) A i : (cid:107) x j = (cid:88) i y i (cid:107) A i : (cid:107) ≤ max i (cid:107) A i : (cid:107) ≤ L co , with L co = max { max i (cid:107) A i : (cid:107) , max j (cid:107) A : j (cid:107) } as in Table 1.While this is the desired bound on E (cid:107) ˜ g x ( x, y ) (cid:107) x , the requirement (cid:107) η ˜ g x (cid:107) ∞ ≤ does not holdwhen A has sufficiently small elements. We address this by clipping ˜ g : we replace η ˜ g x with clip( η ˜ x x ) ,where [clip( v )] i := min {| v i | , } sign( v i ) , the Euclidean projection to the unit box. The clipped gradient estimator clearly satisfies the de-sired bounds on infinity norm and local norm second moment, but is biased for the true gradient.Following the analysis of Clarkson et al. [10], we account for the bias by relating it to the secondmoment via | (cid:104) γ − clip ( γ ) , x (cid:105) | ≤ (cid:107) γ (cid:107) x , which allows to absorb the effect of the bias into existing terms in our error bounds. Putting togetherthese pieces yields the desired bound on the iteration count. Efficient implementation.
Data structures for performing the update (2) and sampling fromthe resulting iterates in (cid:101) O (1) time are standard in the literature [e.g., 37]. We add to these thesomewhat non-standard ability to also efficiently track the running sum of the iterates. To efficientlysample i, j ∼ p according to (4) we first use the data structure to sample i ∼ y in (cid:101) O (1) time and thendraw j ∈ [ n ] with probability proportional to A ij in time O (1) , via either O ( nnz ) preprocessing oran appropriate assumption about the matrix access model. The “heavy lifting” of our data structuredesign is dedicated for supporting variance reduction, which we describe in the next section. Sampling distributions beyond (cid:96) - (cid:96) . Table 4 lists the sampling distributions we develop forthe various problem geometries. Note that for the (cid:96) - (cid:96) setting we give three different distributionsfor sampling the simplex block of the gradient (i.e., ˜ g y ); each distribution corresponds to a differentparameter L co (see comments following Table 1). The distribution q ij ∝ √ y i | A ij x j | yields a strongerbound L in the (cid:96) - (cid:96) setting, but we do not know how to efficiently sample from it. To accelerate the stochastic coordinate method we apply our recently proposed variance reductionframework [8]. This framework operates in α(cid:15) epochs, where α is a design parameter that tradesbetween full and stochastic gradient computations. Each epoch consists of three parts: (i) computingthe exact gradient at a reference point ( x , y ) , (ii) performing T iterations of regularized stochasticmirror descent to produce the sequence ( x , y ) , . . . , ( x T , y T ) and (iii) taking an extra-gradient stepfrom the average of the iterates in (ii). Setting κ = 1 / (1+ ηα/ , the iterates x t follow the recursion x t +1 = Π ∆ (cid:0) x κt ◦ x − κ ◦ exp {− ηκ [ g x + ˜ δ x ( x t , y t )] } (cid:1) , where Π ∆ ( v ) = v (cid:107) v (cid:107) , (5)6nd g x = A (cid:62) y is the exact gradient at the reference point, and ˜ δ x is a stochastic gradient difference estimator satisfying E ˜ δ x ( x, y ) = ∇ x f ( x, y ) − ∇ x f ( x , y ) = A (cid:62) ( y − y ) . The iteration for y t is similar. In [8] we show that if ˜ δ x satisfies E (cid:107) ˜ δ x ( x, y ) (cid:107) ∞ ≤ L (cid:16) (cid:107) x − x (cid:107) + (cid:107) y − y (cid:107) (cid:17) ∀ x, y (6)and a similar bound holds on E (cid:107) ˜ δ y ( x, y ) (cid:107) ∞ , then T = O ( L α ) iterations per epoch with step size η = αL suffice for the overall algorithm to return a point with expected error below (cid:15) .We would like to design a coordinate-based estimator ˜ δ such that the bound (6) holds for L = L co as in Table 1 and each iteration (5) takes (cid:101) O (1) time. Since every epoch also requires O ( nnz ) time for matrix-vector product (exact gradient) computations, the overall runtime would be (cid:101) O (( nnz + L co α ) · α(cid:15) ) . Choosing α = L co / √ nnz then gives the desired runtime (cid:101) O ( nnz + √ nnz · L co (cid:15) ) . Distribution design (sampling from the difference).
We start with a straightforward adap-tation of the general estimator form (3). To compute ˜ δ x ( x, y ) , we sample i, j ∼ p , where p maydepend on x, x , y and y , and let ˜ δ x ( x, y ) = ( y i − [ y ] i ) A ij p ij e j , (7)where e j is the j th standard basis vector. As in the previous section, we find that the requirement (6)is too stringent for coordinate-based estimators. Here too, we address this challenge with a localnorms argument and clipping of the difference estimate. Using the “sampling from the difference”technique from [8], we arrive at p ij = | y i − [ y ] i |(cid:107) y − y (cid:107) · A ij (cid:107) A i : (cid:107) . (8)This distribution satisfies the local norm relaxation of (6) with L = L co . Data structure design.
Efficiently computing (5) is significantly more challenging than its coun-terpart (2). To clarify the difficulty and describe our solution, we write x t = Π ∆ (ˆ x t ) = ˆ x t / (cid:107) ˆ x t (cid:107) and break the recursion for the unnormalized iterates ˆ x t into two steps ˆ x (cid:48) t = ˆ x κt ◦ exp { v } , and (9) ˆ x t +1 = ˆ x (cid:48) t ◦ exp { s t } , (10)where v = (1 − κ ) log x − ηκg x is a fixed dense vector, and s t = − η ˜ δ x ( x t , y t ) is a varying 1-sparsevector. The key task of the data structures is maintaining the normalization factor (cid:107) ˆ x t (cid:107) in near-constant time. Standard data structures do not suffice because they lack support for the densestep (9).Our high-level strategy is to handle the two steps (9) and (10) separately. To handle the densestep (9), we propose the data structure ScaleMaintainer that efficiently approximates (cid:107) ˆ x t (cid:107) in the“homogeneous” case of no sparse updates (i.e. s t = 0 for all t ). We then add support for the sparsestep (10) using a binomial heap-like construction involving O (log n ) instances of ScaleMaintainer .7 he ScaleMaintainer data structure.
When s t = 0 for all t the iterates ˆ x t admit closed forms ˆ x t + τ = ˆ x κ τ t ◦ exp (cid:40) v τ − (cid:88) t (cid:48) =0 κ t (cid:48) (cid:41) = ˆ x κ τ t ◦ exp (cid:110) − κ τ − κ v (cid:111) = ˆ x t ◦ exp { [1 − κ τ ]¯ v } , where ¯ v = v − κ − log x t . Consequently, we design ScaleMaintainer to take as initialization ¯ n -dimensional vectors ¯ x ∈ R ¯ n ≥ , and ¯ v ∈ R ¯ n and provide approximations of the normalization factor Z τ (¯ x, ¯ v ) = (cid:107) ¯ x ◦ exp { (1 − κ τ )¯ v } ) (cid:107) (11)for arbitrary values of τ ≥ . We show how to implement each query of Z τ (¯ x, ¯ v ) in amortizedtime (cid:101) O (1) . The data structure also supports initialization in time (cid:101) O (¯ n ) and deletions (i.e., settingelements of ¯ x to zero) in amortized time (cid:101) O (1) .To efficiently approximate the quantity Z τ (¯ x, ¯ v ) we replace the exponential with its order p = O (log n ) Taylor series. That is, we would like to write Z τ (¯ x, ¯ v ) = (cid:88) i ∈ [¯ n ] [¯ x ] i e (1 − κ τ )[¯ v ] i ≈ (cid:88) i ∈ [¯ n ] [¯ x ] i p (cid:88) q =0 q ! (1 − κ τ ) q [¯ v ] qi = p (cid:88) q =0 (1 − κ τ ) q q ! (cid:104) ¯ x, ¯ v q (cid:105) . The approximation (cid:80) pq =0 (1 − κ τ ) q q ! (cid:104) ¯ x, ¯ v q (cid:105) is cheap to compute, since for every τ it is a linear combina-tion of the p = (cid:101) O (1) numbers {(cid:104) ¯ x, ¯ v q (cid:105)} q ∈ [ p ] which we can compute once at initialization. However,the Taylor series approximation has low multiplicative error only when | (1 − κ τ )[¯ v ] i | = O ( p ) , whichmay fail to hold, as we may have (cid:107) ¯ v (cid:107) ∞ = poly ( n ) in general. To handle this, suppose that for afixed τ we have an offset µ ∈ R and “active set” A ⊆ [¯ n ] such that the following conditions hold for athreshold R = O ( p ) : (a) the Taylor approximation is valid in A , e.g. we have | (1 − κ τ )(¯ v i − µ ) | ≤ R for all i ∈ A , (b) entries outside A are small; (1 − κ τ )[¯ v i − µ ] ≤ − R for all i / ∈ A , and (c) at leastone entry in the active set is large; (1 − κ τ )[¯ v i − µ ] ≥ for some i ∈ A . Under these conditions, theentries in A c are negligibly small and we can truncate them, resulting in the approximation e (1 − κ τ ) µ p (cid:88) q =0 (1 − κ τ ) q q ! (cid:104) ¯ x, (¯ v − µ ) q (cid:105) A + e − R (cid:104) ¯ x, (cid:105) A c , which we show approximates Z τ (¯ x, ¯ v ) to within e O ( R +log n ) − Ω( p ) multiplicative error, where we used (cid:104) a, b (cid:105) S := (cid:80) i ∈ S a i b i ; here, we also require that log max i ¯ x i min i ¯ x i = O ( R ) , which we guarantee whenchoosing the initial ¯ x .The challenge then becomes efficiently mapping any τ to {(cid:104) ¯ x, (¯ v − µ ) q (cid:105) A } q ∈ [ p ] for suitable µ and A . We address this by jointly bucketing τ and ¯ v . Specifically, we map τ into a bucket index k = (cid:98) log − κ τ − κ (cid:99) , pick µ to be the largest integer multiple of R/ ((1 − κ )2 k ) such that µ ≤ max i ¯ v i ,and set A = { i | | (1 − κ )2 k (¯ v i − µ ) | ≤ R } . Since k ≤ k max = (cid:98) log − κ (cid:99) = O (log n ) , we argue thatcomputing (cid:104) ¯ x, (¯ v − µ ) q (cid:105) A for every possible resulting µ and A takes at most O (¯ np log − κ ) = (cid:101) O (¯ n ) time, which we can charge to initialization. We further show how to support deletions in (cid:101) O (1) timeby carefully manipulating the computed quantities. Supporting sparse updates.
Building on
ScaleMaintainer , we design the data structure
ApproxExpMaintainer that (approximately) implements the entire mirror descent step (5) in time8 O (1) . The data structure maintains vectors ¯ x ∈ ∆ n and ¯ v ∈ R n and K = (cid:100) log ( n + 1) (cid:101) instancesof ScaleMaintainer denoted { ScaleMaintainer k } k ∈ [ K ] . The k th instance tracks a coordinate sub-set S k ⊆ [ n ] such that { S k } k ∈ [ K ] partitions [ n ] , and has initial data [¯ x ] S k and [¯ v ] S k . We let τ k ≥ denote the “time index” parameter of the k th instance. The data structure satisfies two invariants;first, the unnormalized iterate ˆ x satisfies [ˆ x ] S k = [¯ x ◦ exp { (1 − κ τ k )¯ v } ] S k , for all k ∈ [ K ] . (12)Second, the partition satisfies | S k | ≤ k − for all k ∈ [ K ] , (13)where at initialization we let S K = [ n ] and S k = ∅ for k < K , ¯ x = x , ¯ v = v − κ − log x and τ K = 0 .The invariant (12) allows us to efficiently (in time (cid:101) O ( K ) = (cid:101) O (1) ) query coordinates of x t =ˆ x t / (cid:107) ˆ x t (cid:107) , since ScaleMaintainer allows us to approximate (cid:107) ˆ x t (cid:107) = (cid:80) k ∈ [ K ] Z τ k ([¯ x ] S k , [¯ v ] S k ) with Z as defined in (11). To implement the dense step (9), we simply increment τ k ← τ k + 1 forevery k . Let j be the nonzero coordinate of s t in the sparse step (10), and let k ∈ [ K ] be suchthat j ∈ S k . To implement (10), we delete coordinate j from ScaleMaintainer k , and create asingleton instance ScaleMaintainer maintaining S = { j } with initial data [¯ x ] S = e s t ˆ x j , [¯ v ] S = v j / (1 − κ ) − log( e s t ˆ x j ) and τ = 0 . Going from k = 1 to k = K , we merge ScaleMaintainer k − into ScaleMaintainer k until the invariant (13) holds again. For example, if before the sparse stepwe have | S | = 1 , | S | = 3 and | S | = 2 , we will perform 3 consecutive merges, so that afterwardswe have | S | = | S | = 0 and | S | = 7 .To merge two ScaleMaintainer k − into ScaleMaintainer k , we let S (cid:48) k = S k − ∪ S k and initializea new ScaleMaintainer instance with [¯ x ] S (cid:48) k = [ˆ x ] S (cid:48) k , [¯ v ] S (cid:48) k = [ v ] S (cid:48) k / (1 − κ ) − log[ˆ x ] S (cid:48) k and τ k = 0 ;this takes (cid:101) O ( | S (cid:48) k | ) = (cid:101) O (cid:0) k (cid:1) time due to the invariant (13). Noting that a merge at level k can onlyhappen once in every Ω(2 k ) updates, we conclude that the amortized cost of merges at each level is (cid:101) O (1) , and (since K = (cid:101) O (1) ), so is the cost of the sparse update. Back to distribution design (sampling from the sum).
Our data structure enables us tocompute the iteration (5) and query coordinates of the iterates x t and y t in (cid:101) O (1) amortized time.However, we cannot compute ˜ δ x using the distribution (8) because we do not have an efficient wayof sampling from | y t − y | ; Taylor approximation techniques are not effective for approximating theabsolute value because it is not smooth. To overcome this final barrier, we introduce a new designwhich we call “sampling from the sum,” p ij ( x, y ) = (cid:18) y i + 23 [ y ] i (cid:19) · A ij (cid:107) A i : (cid:107) . (14)Sampling from the modified distribution is simple, as our data structure allows us to sample from y t . Moreover, we show that the distribution (14) satisfies a relaxed version of (6) where the LHS isreplaced by a local norm as before, and the RHS is replaced by L ( V x ( x t ) + V y ( y t )) , where V x ( x (cid:48) ) is the KL divergence between x and x (cid:48) . In Table 5 we list the sampling distributions we design forvariance reduction in the different domain geometries. The data structures
ApproxExpMaintainer and
ScaleMaintainer structure support two additional operationsnecessary for our algorithm: efficient approximate sampling from x t and maintenance of a running sum of ˆ x τ . Giventhe normalization constant approximation, the implementation of these operations is fairly straightforward, so we donot discuss them in the introduction. More precisely, for every j ∈ S (cid:48) k we set ¯ x j = ˆ x j + ε max i ∈ S (cid:48) k ˆ x i , where ε is a small padding constant that ensuresthe bounded multiplicative range necessary for correct operation of ScaleMaintainer . p ij q ij (cid:96) - (cid:96) y i · A ij (cid:107) A i : (cid:107) x j · A ij (cid:107) A : j (cid:107) (cid:96) - (cid:96) y i · | A ij |(cid:107) A i : (cid:107) A ij (cid:107) A (cid:107) (cid:96) - (cid:96) y i · | A ij |(cid:107) A i : (cid:107) ∝ x j · A ij (cid:54) =0 (cid:96) - (cid:96) y i · | A ij |(cid:107) A i : (cid:107) A ij · x j (cid:80) k ∈ [ n ] (cid:107) A : k (cid:107) · x k (cid:96) - (cid:96) (cid:107) A i : (cid:107) (cid:80) k ∈ [ m ] (cid:107) A k : (cid:107) · | A ij |(cid:107) A i : (cid:107) (cid:107) A : j (cid:107) (cid:80) k ∈ [ n ] (cid:107) A : k (cid:107) · | A ij |(cid:107) A : j (cid:107) (cid:96) - (cid:96) y i (cid:107) y (cid:107) · | A ij |(cid:107) A i : (cid:107) x j (cid:107) x (cid:107) · | A ij |(cid:107) A : j (cid:107) Table 4:
The distributions p, q used in our coordinate gradient estimator.
Comments: Theestimator is of the form ˜ g ( x, y ) = (cid:0) p ij y i A ij · e j , − q lk A lk x k · e l (cid:1) where i, j ∼ p and l, k ∼ q . Setting p ij q ij (cid:96) - (cid:96) y i + 2[ y ] i · A ij (cid:107) A i : (cid:107) x j + 2[ x ] j · A ij (cid:107) A : j (cid:107) (cid:96) - (cid:96) y i + 2[ y ] i · | A ij |(cid:107) A i : (cid:107) A ij (cid:107) A (cid:107) (cid:96) - (cid:96) y i + 2[ y ] i · | A ij |(cid:107) A i : (cid:107) ∝ [ x − x ] j · A ij (cid:54) =0 (cid:96) - (cid:96) y i + 2[ y ] i · | A ij |(cid:107) A i : (cid:107) | A ij | · [ x − x ] j (cid:80) k ∈ [ n ] (cid:107) A : k (cid:107) · [ x − x ] k (cid:96) - (cid:96) (cid:107) A i : (cid:107) (cid:80) k ∈ [ m ] (cid:107) A k : (cid:107) · | A ij |(cid:107) A i : (cid:107) (cid:107) A : j (cid:107) (cid:80) k ∈ [ n ] (cid:107) A : k (cid:107) · | A ij |(cid:107) A : j (cid:107) (cid:96) - (cid:96) [ y − y ] i (cid:107) y − y (cid:107) · | A ij |(cid:107) A i : (cid:107) [ x − x ] j (cid:107) x − x (cid:107) · | A ij |(cid:107) A : j (cid:107) Table 5:
The distributions p , q used for our reduced variance coordinate gradient estimator. Comments: The estimator is of the form ˜ g ( x, y ) = (cid:0) A (cid:62) y + 1 p ij ( y i − y ,i ) A ij · e j , − Ax − q lk A lk ( x k − x ,k ) · e l (cid:1) where i, j ∼ p and l, k ∼ q and x , y is a reference point. .3 Related work Coordinate methods.
Updating a single coordinate at a time—or more broadly computing onlya single coordinate of the gradient at every iteration—is a well-studied and successful technique inoptimization [50]. Selecting coordinates at random is key to obtaining strong performance guaran-tees: Strohmer and Vershynin [42] show this for linear regression, Shalev-Shwartz and Tewari [36]show this for (cid:96) regularized linear models, and Nesterov [32] shows this for general smooth mini-mization. Later works [22, 3, 33] propose accelerated coordinate methods. These works share twocommon themes: selecting the gradient coordinate from a non-uniform distribution (see also [34]),and augmenting the 1-sparse stochastic gradient with a dense momentum term. These techniquesplay important roles in our development as well.To reap the full benefits of coordinate methods, iterations must be very cheap, ideally takingnear-constant time. However, most coordinate methods require super-constant time, typically in theform of a vector-vector computation. Even works that consider coordinate methods in a primal-dualcontext [38, 2, 52, 27, 37] perform the coordinate updates only on the dual variable and require avector-vector product (or more generally a component gradient computation) at every iteration.A notable exception is the work of Wang [49, 48] which develops a primal-dual stochastic co-ordinate method for solving Markov decision processes, essentially viewing them as (cid:96) ∞ - (cid:96) bilinearsaddle-point problems. Using a tree-based (cid:96) sampler data structure similar to the (cid:96) sampler we usefor simplex domains for the sublinear case, the method allows for (cid:101) O (1) iterations and a potentiallysublinear runtime scaling as (cid:15) − . Tan et al. [43] also consider bilinear saddle-point problems andvariance reduction. Unlike our work, they assume a separable domain, use uniform sampling, anddo not accelerate their variance reduction scheme with extra-gradient steps. The separable domainmakes attaining constant iteration cost time much simpler, since there is no longer a normalizationfactor to track, but it also rules out applications to the simplex domain. While Tan et al. [43] reportpromising empirical results, their theoretical guarantees do not improve upon prior work.Our work develops coordinate methods with (cid:101) O (1) iteration cost for new types of problems.Furthermore, it maintains the iteration efficiency even in the presence of dense components arisingfrom the update, thus allowing for acceleration via an extra-gradient scheme. Data structures for optimization.
Performing iterations in time that is asymptotically smallerthan the number of variables updated at every iteration forces us to carry out the updates implicitlyusing data structures; several prior works employ data structures for exactly the same reason. Oneof the most similar examples comes from Lee and Sidford [22], who design a data structure foran accelerated coordinate method in Euclidean geometry. In our terminology, their data structureallows performing each iteration in time O ( rcs ) while implicitly updating variables of size O ( n ) .Duchi et al. [14] design a data structure based on balanced search trees that supports efficientEuclidean projection to the (cid:96) ball of vector of the form u + s where u is in the (cid:96) ball and s issparse. They apply it in a stochastic gradient method for learning (cid:96) regularized linear classifierwith sparse features. Among the many applications of this data structure, Namkoong and Duchi[27] adapt it to efficiently compute Euclidean projections into the intersection of the simplex and a χ ball for 1-sparse updates. Shalev-Shwartz and Wexler [37] and Wang [49, 48], among others, usebinary tree data structures to perform multiplicative weights projection to the simplex and samplingfrom the iterates.A recent work of Sidford and Tian [40] develops a data structure which is somewhat similar to our ApproxExpMaintainer data structure, for updates arising from a primal-dual method to efficientlysolve (cid:96) ∞ regression. Their data structure was also designed to handle updates to a simplex variablewhich summed a structured dense component and a sparse component. However, the data structure11esign of that work specifically exploited the structure of the maximum flow problem in a numberof ways, such as bounding the sizes of the update components and relating these bounds to howoften the entire data structure should be restarted. Our data structure can handle a broader rangeof structured updates to simplex variables and has a much more flexible interface, which is crucialto the development of our variance-reduced methods as well as our applications.Another notable use of data structures in optimization appears in second order methods, wherea long line of work uses them to efficiently solve sequences of linear systems and approximatelycompute iterates [20, 4, 23, 12, 25, 44, 45]. Finally, several works on low rank optimization makeuse of sketches to efficiently represent their iterates and solutions [9, 51]. Numerical sparsity.
Measures of numerical sparsity, such as the (cid:96) / (cid:96) ∞ or (cid:96) / (cid:96) ratios, arecontinuous and dimensionless relaxations of the (cid:96) norm. The stable rank of a matrix A measuresthe numerical sparsity of its singular values (specifically, their squared (cid:96) / (cid:96) ∞ ratio) [11].For linear regression, stochastic methods generally outperform exact gradient methods only when A is has low stable rank, cf. discussion in [8, Section 4.3], i.e., numerically sparse singular values. Inrecent work, Gupta and Sidford [17] develop algorithms for linear regression and eigenvector prob-lems for matrices with numerically sparse entries (as opposed to singular values). Our paper furtherbroadens the scope of matrix problems for which we can benefit from numerical sparsity. Moreover,our results have implications for regression as well, improving on [17] in certain numerically sparseregimes.In recent work by Babichev et al. [6], the authors develop primal-dual sublinear methods for (cid:96) -regularized linear multi-class classification (bilinear games in (cid:96) - (cid:96) ∞ geometry), and obtain com-plexity improvements depending on the numerical sparsity of the problem. Similarly to our work,careful design of the sampling distribution plays a central role in [6]. They also develop a datastructure that allows iteration cost independent of the number of classes. However, unlike our work,Babichev et al. [6] rely on sampling entire rows and columns, have iteration costs linear in n + m ,and do not utilize variance reduction. We believe that our techniques can yield improvements intheir setting. In Section 2, we set up our terminology, notation, the interfaces of our data structures, and thedifferent matrix access models we consider. In Section 3 we develop our algorithmic framework: wepresent coordinate stochastic gradient methods in Section 3.1 and their reduced variance counter-parts in Section 3.2. In Section 4 we apply both methods to solving (cid:96) - (cid:96) matrix games; we showhow to implement the method using our data structures and analyze the runtime. In Section 5, wediscuss in detail the implementation and analysis of our data structures. Finally, in Section 6 wespecialize our results to obtain algorithms for minimum enclosing ball and maximum inscribed ballproblems as well as linear regression. Many proof details as well as our algorithms for other domainsetups, i.e. (cid:96) - (cid:96) and (cid:96) - (cid:96) are deferred to the appendix.12 Preliminaries
In Section 2.1, we abstract the properties of the different domains we handle into a general notion ofa “local norm” setup under which we develop our results. In Section 2.2, we give the definition andoptimality criterion of the bilinear saddle-point problem we study. In Section 2.3, we give the matrixaccess models used in the algorithms we design. In Section 2.4, we summarize the interfaces andcomplexity of the data structures we design, deferring their detailed implementations to Section 5.
The analyses of our algorithms cater to the geometric of each specific domain. To express ourresults generically, for each pair of domains Z = X × Y , we define an associated “local norm setup”,which contains various data tailored for our analyses. While initially this notation may appearcomplicated or cumbersome, later it helps avoid redundancy in the paper. Further, it clarifies thestructure necessary to generalize our methods to additional domain geometries.
Definition 1. A local norm setup is the quintuplet ( Z , (cid:107)·(cid:107) · , r , Θ , clip ) , where1. Z is a compact and convex subset of Z ∗ := R n × R m .2. (cid:107)·(cid:107) · is a local norm: for every z ∈ Z , the function (cid:107)·(cid:107) z : Z ∗ → R ≥ is a norm on Z ∗ .3. r : Z → R is a convex distance generating function: its induced Bregman divergence V z ( z (cid:48) ) := r ( z (cid:48) ) − r ( z ) − (cid:10) ∇ r ( z ) , z (cid:48) − z (cid:11) . satisfies (cid:10) γ, z − z (cid:48) (cid:11) − V z ( z (cid:48) ) ≤ (cid:107) γ (cid:107) ∗ := 12 max s ∈Z (cid:107) γ (cid:107) s for all z, z (cid:48) ∈ Z and γ ∈ Z ∗ . (15) Θ = max z,z (cid:48) ∈Z { r ( z ) − r ( z (cid:48) ) } is the range of r . For z ∗ ∈ arg min z ∈Z r ( z ) we have Θ is anupper bound on the range of V z ∗ ( z ) ≤ Θ for all z ∈ Z .5. clip : Z ∗ → Z ∗ is a mapping that enforces a local version of (15) : | (cid:10) clip( γ ) , z − z (cid:48) (cid:11) | − V z ( z (cid:48) ) ≤ (cid:107) γ (cid:107) z for all z, z (cid:48) ∈ Z and γ ∈ Z ∗ , (16) and satisfies the distortion guarantee | (cid:104) γ − clip( γ ) , z (cid:105) | ≤ (cid:107) γ (cid:107) z for all z ∈ Z and γ ∈ Z ∗ . (17)Table 6 summarizes the three local norm setups we consider. Throughout the paper,for a vector z ∈ X × Y , we denote its X and Y blocks by z x and z y .In addition, we write coordinate i of any vector v as [ v ] i . Proposition 1.
The quintuplets ( Z , (cid:107)·(cid:107) · , r , Θ , clip ) in Table 6 satisfy the local norm setup re-quirements in Definition 1. - (cid:96) (cid:96) - (cid:96) (cid:96) - (cid:96) X ∆ n B n B n Y ∆ m ∆ m B m (cid:107) δ (cid:107) z (cid:113)(cid:80) k ∈ [ n + m ] [ z ] k [ δ ] k (cid:113) (cid:107) δ x (cid:107) + (cid:80) i ∈ [ m ] [ z y ] i [ δ y ] i (cid:107) δ (cid:107) r (cid:80) k ∈ [ n + m ] [ z ] k log[ z ] k (cid:107) z x (cid:107) + (cid:80) i ∈ [ m ] [ z y ] i log[ z y ] i (cid:107) z (cid:107) Θ log( mn ) + log( m ) 1clip( δ ) sign( δ ) ◦ min { , | δ |} ( δ x , sign( δ y ) ◦ min { , | δ y |} ) δ Table 6:
Local norm setups.
Comments: In each case, Z = X × Y , ∆ n is the probability simplex { x | x ∈ R n ≥ , (cid:62) n x = 1 } , B n is the Euclidean ball { x | x ∈ R n , (cid:107) x (cid:107) ≤ } , the operations sign , min , and | · | are performed entrywise on a vector, and ◦ stands for the entrywise product betweenvectors.While Proposition 1 is not new, for completeness and compatibility with our notation we prove itin Appendix A.In each local norm setup, we slightly overload notation and use (cid:107)·(cid:107) (without a subscript) todenote the dual norm of (cid:107)·(cid:107) ∗ , i.e., (cid:107) η (cid:107) := max δ : (cid:107) δ (cid:107) ∗ ≤ δ (cid:62) η . In each domain geometry (cid:107)·(cid:107) and (cid:107)·(cid:107) ∗ are as follows: (cid:107) η (cid:107) = (cid:113) (cid:107) η x (cid:107) + (cid:107) η y (cid:107) (cid:107) δ (cid:107) ∗ = (cid:113) (cid:107) δ x (cid:107) ∞ + (cid:107) δ y (cid:107) ∞ for (cid:96) - (cid:96) (cid:107) η (cid:107) = (cid:113) (cid:107) η x (cid:107) + (cid:107) η y (cid:107) (cid:107) δ (cid:107) ∗ = (cid:113) (cid:107) δ x (cid:107) + (cid:107) δ y (cid:107) ∞ for (cid:96) - (cid:96) (cid:107) η (cid:107) = (cid:107) η (cid:107) (cid:107) δ (cid:107) ∗ = (cid:107) δ (cid:107) for (cid:96) - (cid:96) . (18) Throughout, we consider the bilinear saddle point problem min x ∈X max y ∈Y f ( x, y ) , where f ( x, y ) := y (cid:62) Ax + b (cid:62) x − c (cid:62) y, for A ∈ R m × n , b ∈ R n and c ∈ R m . (19)We will always assume that every row and column of A has at least one nonzero entry (else removingsaid row or column does not affect the problem value), so that the number of nonzeros nnz is atleast m + n − . To simplify the exposition of the (cid:96) - (cid:96) and (cid:96) - (cid:96) setups we will assume b = n and c = m as is standard in the literature. Adding linear terms to these setups is fairly straightforwardand does not affect the complexity (up to logarithmic factors) of our designed algorithms using datastructures designed in this paper (specifically ApproxExpMaintainer in Section 2.4); see Section 6for an example. The gradient mapping associated with (19) for z = ( z x , z y ) ∈ Z = X × Y is g ( z ) := ( ∇ x f ( z ) , −∇ y f ( z )) = ( A (cid:62) z y + b, − Az x + c ) . (20)The mapping g is continuous and monotone, where we call g monotone if and only if (cid:10) g ( z (cid:48) ) − g ( z ) , z (cid:48) − z (cid:11) ≥ , ∀ z, z (cid:48) ∈ Z . f . Our goal is to de-sign randomized algorithms for finding an (expected) (cid:15) -accurate saddle point z ∈ Z such that, inexpectation, E Gap( z ) := E (cid:20) max y (cid:48) ∈Y f ( z x , y (cid:48) ) − min x (cid:48) ∈X f ( x (cid:48) , z y ) (cid:21) ≤ (cid:15). (21)In order to do so, we aim to find a sequence z , z , . . . , z K with (expected) low average regret ,i.e., such that E max u ∈Z (cid:110) K (cid:80) Kk =1 (cid:104) g ( z k ) , z k − u (cid:105) (cid:111) ≤ (cid:15) . Due to bilinearity of f we have E Gap (cid:32) K K (cid:88) k =1 z k (cid:33) = E max u ∈Z (cid:40) K K (cid:88) k =1 (cid:104) g ( z k ) , z k − u (cid:105) (cid:41) ≤ (cid:15). Finally, we make the explicit assumption that whenever we are discussing an algorithm in thispaper with a simplex domain (e.g. in (cid:96) - (cid:96) or (cid:96) - (cid:96) case), the quantity L co /(cid:15) is bounded by ( m + n ) ,as otherwise we are in the high-accuracy regime where the runtimes of interior point methods orcutting-plane methods [24, 18] are favorable. Specifically for (cid:96) - (cid:96) matrix games in this regime,interior-point methods [23, 12, 45] are always faster, see footnote in Section 1.1. We make thisassumption for notational convenience when discussing logarithmic factors depending on multiplequantities, such as m , n , L co , and (cid:15) − . We design randomized algorithms which require accessing and sampling from the matrix A ∈ R n × m in a variety of ways. Here, we list these operations, where we assume each takes constant time.Specific algorithms only require access to a subset of this list; we make a note of each algorithm’srequirements when presenting it.A1. For i, j ∈ [ m ] × [ n ] , return A ij .A2. For i ∈ [ m ] and p ∈ { , } , draw j ∈ [ n ] with probability | A ij | p / (cid:107) A i : (cid:107) pp .A3. For j ∈ [ n ] and p ∈ { , , } , draw i ∈ [ m ] with probability | A ij | p / (cid:107) A : j (cid:107) pp .A4. For i ∈ [ m ] ( j ∈ [ n ] ) and p ∈ { , } , return (cid:107) A i : (cid:107) p ( (cid:107) A : j (cid:107) p ).A5. For p ∈ { , } , return max i ∈ [ m ] (cid:107) A i : (cid:107) p , max j ∈ [ n ] (cid:107) A : j (cid:107) p , nnz , rcs , and (cid:107) A (cid:107) F .Given any representation of the matrix as a list of nonzero entries and their indices, we can alwaysimplement the access modes above (in the assumed constant time) with O ( nnz ) time preprocessing;see e.g. Vose [47] for an implementation of the sampling (in a unit cost RAM model). Our variance-reduced algorithms have an additive O ( nnz ) term appearing in their runtimes due to the need tocompute at least one matrix-vector product to implement gradient estimators. Thus, their statedruntime bounds hold independently of matrix access assumptions. We rely on data structures to maintain and sample from the iterates of our algorithms. Below, wegive a summary of the operations supported by our data structures and their runtime guarantees.We show how to implement these data structures in Section 5.15 .4.1
IterateMaintainer p Given p ∈ { , } , we design a data structure IterateMaintainer p which maintains an implicitrepresentation of the current iterate x ∈ R n and a running sum s of all iterates. At initialization,this data structure takes as input the initial iterate x to be maintained and for p = 2 , the datastructure also takes as input a fixed vector v . It then supports the following operations. Category Function Runtime initialize
Init ( x , v ) : x ← x , s ← O ( n ) update Scale ( c ) : x ← cx O (1) AddSparse ( j, c ) : [ x ] j ← [ x ] j + c (if p = 1 , we require c ≥ − [ x ] j ) O (log n ) † AddDense ( c ) : x ← x + cv (supported if p = 2 ) O (1) UpdateSum () : s ← s + x O (1) query Get ( j ) : Return [ x ] j O (1) GetSum ( j ) : Return [ s ] j O (1) GetNorm () : Return (cid:107) x (cid:107) p O (1) sample † Sample () : Return j with probability [ x ] pj / (cid:107) x (cid:107) pp O (log n ) † An alternative implementation does not support
Sample , but performs
AddSparse in time O (1) . The implementation of
IterateMaintainer p is given in Section 5.1. In Sections C.2 and D.3 weuse variants of this data structure WeightedIterateMaintainer and CenteredIterateMaintainer ,and defer the detailed discussions of their implementations to Appendix G. ApproxExpMaintainer
To maintain multiplicative weights updates with a fixed dense component, we design a data structure
ApproxExpMaintainer initialized with an arbitrary point x ∈ ∆ n , a direction v ∈ R n , a decayconstant κ ∈ [0 , and an approximation error parameter ε . In order to specify the implementationof our data structure, we require the following definition. Definition 2 ( β -padding) . For x, x (cid:48) ∈ ∆ n , we say x (cid:48) is a β -padding of x if x (cid:48) = ˜ x/ (cid:107) ˜ x (cid:107) , for apoint ˜ x ∈ R n ≥ with ˜ x ≥ x entrywise and (cid:107) ˜ x − x (cid:107) ≤ β . Notions similar to β -padding appear in previous literature [e.g., 21]. A key technical property of β -paddings is that they do not increase entropy significantly (see Lemma 5). ApproxExpMaintainer has maintains two vectors x, ˆ x ∈ ∆ n that, for an error tolerance param-eter ε , satisfy the invariant ˆ x is a ε -padding of x. (22)an error tolerance parameter ε We now specify the interface, where ◦ denotes elementwise product, [ x κ ] j = [ x ] κj denotes elementwise power, Π ∆ ( z ) = z/ (cid:107) z (cid:107) normalizes z ∈ R n ≥ to lie in the simplex,and (cid:107) s (cid:107) denotes the number of nonzeroes in vector s . To state our runtimes, we define ω := max (cid:18) − κ , nλε (cid:19) . For most of our applications of
ApproxExpMaintainer , ω is a polynomial in m and n (our iteratedimensions), so log( ω ) = O (log( mn )) (with the exception of our maximum inscribed ball application,16 ategory Function Runtime initialize Init ( x , v, κ, ε, λ ) : κ ∈ [0 , , ε > , min j [ x ] j ≥ λ O ( n log n log ω ) update MultSparse ( g ) : x ← ε -padding of Π ∆ ( x ◦ exp( g )) O ( (cid:107) g (cid:107) log n log ω ) DenseStep () : x ← Π ∆ ( x κ ◦ exp( v )) O (log n ) UpdateSum () : s ← s + ˆ x (recall invariant (22)) O (log n log ω ) query Get ( j ) : Return [ˆ x ] j O (log n log ω ) GetSum ( j ) : Return [ s ] j O (log ω ) sample Sample () : Return j with probability [ˆ x ] j O (log n log ω ) where our runtimes additionally depend polylogarithmically on the size of the hyperplane shifts b ;see Remark 2). We defer a more fine-grained runtime discussion to Section 5.3.The role of ApproxExpMaintainer is in to efficiently implement the regularized and reduced-variance stochastic mirror descent steps of the form (5). To do this, we initialize the data structurewith v = (1 − κ ) log x − ηκg x . Then, the iteration (5) consists of calling DenseStep () followed by MultSparse ( − ηκ ˜ δ x ) . In this section, we develop our algorithmic frameworks. The resulting algorithms have either sub-linear or variance-reduced complexities. We develop our sublinear coordinate method framework inSection 3.1, and its variance-reduced counterpart in Section 3.2.
In Section 3.1.1 we introduce the concept of a local gradient estimator , which allow stronger guar-antees for stochastic mirror descent with clipping (Algorithm 1) via local norms analysis. Then, inSection 3.1.2 we state the form of the specific local gradient estimators we use in our coordinatemethods, and motivate the values of L co in Table 1. For local norm setup ( Z , (cid:107)·(cid:107) · , r , Θ , clip ) , we call a stochastic gradient estimator ˜ g : Z → Z ∗ an L -local estimator if it satisfies the following properties for all z ∈ Z :1. Unbiasedness: E [˜ g ( z )] = g ( z ) .2. Second moment bound: for all w ∈ Z , E [ (cid:107) ˜ g ( z ) (cid:107) w ] ≤ L . The following lemma shows that L -local estimators are unbiased for L -bounded operators. Lemma 1.
A gradient mapping that admits an L -local estimator satisfies (cid:107) g ( z ) (cid:107) ∗ ≤ L for all z ∈ Z .Proof. For every z ∈ Z , the function (cid:107)·(cid:107) z is convex. Thus by Jensen’s inequality, (cid:107) g ( z ) (cid:107) z = (cid:107) E ˜ g ( z ) (cid:107) z ≤ E (cid:107) ˜ g ( z ) (cid:107) z ≤ L . Taking supremum over z ∈ Z gives (cid:107) g ( z ) (cid:107) ∗ ≤ L .17e note that the same result does not hold for ˜ g because maximum and expectation do not commute.That is, E (cid:107) ˜ g (cid:107) ∗ is not bounded by L . This fact motivates our use of local norms analysis.Below, we state Algorithm 1, stochastic mirror descent with clipping, and a guarantee on its rateof convergence using local gradient estimators. We defer the proof to Appendix B and note herethat it uses the “ghost iterates” technique due to Nemirovski et al. [29] in order to rigorously boundthe expected regret with respect to the best response to our iterates, rather than a pre-specifiedpoint. This technique is purely analytical and does not affect the algorithm. We also note thatthe second inequality in Proposition 2 holds with any convex-concave function f , similarly to [8,Corollary 1]; the first uses bilinearity of our problem structure. Algorithm 1:
Stochastic mirror descent
Input:
Matrix A ∈ R m × n , L -local gradient estimator ˜ g , clipping function clip( · ) Output:
A point with O ( Θ ηT + ηL ) expected duality gap Parameters:
Step-size η , number of iterations T z ← arg min z ∈Z r ( z ) for t = 1 , . . . , T do z t ← arg min z ∈Z (cid:8) (cid:104) clip( η ˜ g ( z t − )) , z (cid:105) + V z t − ( z ) (cid:9) return T +1 (cid:80) Tt =0 z t Proposition 2.
Let ( Z , (cid:107)·(cid:107) · , r , Θ , clip ) be a local norm setup, let L, (cid:15) > , and let ˜ g be an L -localestimator. Then, for η ≤ (cid:15) L and T ≥ η(cid:15) ≥ L Θ (cid:15) , Algorithm 1 outputs a point ¯ z such that E Gap(¯ z ) ≤ E (cid:34) sup u ∈Z T + 1 T (cid:88) t =0 (cid:104) g ( z t ) , z t − u (cid:105) (cid:35) ≤ (cid:15). We now state the general form which our local gradient estimators ˜ g take. At a point z ∈ Z , forspecified sampling distributions p ( z ) , q ( z ) , sample i x , j x ∼ p ( z ) and i y , j y ∼ q ( z ) . Then, define ˜ g ( z ) := (cid:18) A i x j x [ z y ] i x p i x j x ( z ) e j x , − A i y j y [ z x ] j y q i y j y ( z ) e i y (cid:19) + g (0) where g (0) = ( b, c ) . (23)It is clear that regardless of the distributions p ( z ) , q ( z ) , for the gradient operator in (20), ˜ g ( z ) is anunbiased gradient estimator ( E [˜ g ( z )] = g ( z ) ) and ˜ g ( z ) − g (0) is 2-sparse. Optimal values of L co . In the remainder of this section we assume for simplicity the g (0) = 0 (i.e. the objective f in (19) has not linear terms). Here we compute the optimal values of L forlocal gradient estimators (see Definition 3) of the form (23) for each of the local norm setups weconsider. This motivates the values of L co we derive in the following sections. First, in the (cid:96) - (cid:96) case, the second moment of (cid:107) ˜ g x ( z ) (cid:107) w x (the local norm of the X block of ˜ g ( z ) at point w ) is E (cid:34) [ w x ] j x (cid:18) A i x j x [ z y ] i x p i x j x ( z ) (cid:19) (cid:35) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] A ij [ z y ] i [ w x ] j p ij ( z ) ≥ (cid:88) i ∈ [ m ] ,j ∈ [ n ] | A ij | [ z y ] i (cid:113) [ w x ] j . z y ∈ ∆ m and √ w x ∈ B n with (cid:13)(cid:13) √ w x (cid:13)(cid:13) = 1 , the above lower bound is in the worst case max i (cid:107) A i : (cid:107) . Similarly, the best possible bound on the Y is max j (cid:107) A : j (cid:107) . Therefore, in the (cid:96) - (cid:96) setup, no local estimator has parameter L smaller than L co in Table 1.Next, in the (cid:96) - (cid:96) case, the ( (cid:96) ) second moment of the X block is E (cid:34)(cid:18) A i x j x [ z y ] i x p i x j x ( z ) (cid:19) (cid:35) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] A ij [ z y ] i p ij ( z ) ≥ (cid:88) i ∈ [ m ] ,j ∈ [ n ] | A ij | [ z y ] i = (cid:88) i ∈ [ m ] (cid:107) A i : (cid:107) [ z y ] i . In the worst case, this is at least ( (cid:80) i ∈ [ m ] (cid:107) A i : (cid:107) ) ; similarly, the best second moment bound for the Y block is ( (cid:80) j ∈ [ n ] (cid:107) A : j (cid:107) ) , which means that L co is similarly unimprovable in the (cid:96) - (cid:96) setup.Finally, in the (cid:96) - (cid:96) case, where X = B n and Y = ∆ m , we again have that the (cid:96) second momentof the X (ball) block is at least E (cid:34)(cid:18) A i x j x [ z y ] i x p i x j x ( z ) (cid:19) (cid:35) ≥ (cid:88) i ∈ [ m ] ,j ∈ [ n ] | A ij | [ z y ] i . Here, since z y ∈ ∆ m , the worst-case lower bound of the variance is max i (cid:107) A i : (cid:107) . Further, the localnorm (at w ) second moment of the Y (simplex) block is at least E (cid:34) [ w y ] i y (cid:18) A i y j y [ z x ] j y q i y j y ( z ) (cid:19) (cid:35) ≥ (cid:88) i ∈ [ m ] ,j ∈ [ n ] | A ij | [ z x ] j (cid:112) [ w y ] i . Since z x ∈ B n and √ w y ∈ B m , in the worst case this second moment can be as high as (cid:107)| A |(cid:107) op , wherewe use | A | to denote the elementwise absolute value of A . This is better than the L co in Table 1,suggesting there is room for improvement here. However, the sampling probabilities inducing thisoptimal variance bound are of the form q ij ( z ; w ) ∝ | A ij | (cid:112) [ w y ] i · [ z x ] j , and it unclear how to efficiently sample from this distribution. Improving our (cid:96) - (cid:96) gradient esti-mator (or proving that no improvement is possible) remains an open problem. In this section, we develop the algorithmic framework we use in our variance-reduced methods. Wefirst define a type of “centered-local” gradient estimator, modifying the local gradient estimators ofthe previous section. We then give the general form of a variance-reduced method and analyze it inthe context of our gradient estimators and the error incurred by our data structure maintenance.
For local norm setup ( Z , (cid:107)·(cid:107) · , r , Θ , clip ), and given a reference point w = ( w x , w y ) ,we call a stochastic gradient estimator ˜ g w : Z → Z ∗ an L -centered-local estimator if it satisfies thefollowing properties:1. Unbiasedness: E [˜ g w ( z )] = g ( z ) .2. Relative variance bound: for all w ∈ Z , E [ (cid:107) ˜ g w ( z ) − g ( w ) (cid:107) w ] ≤ L V w ( z ) . emark 1. Similarly to Lemma 1, a gradient mapping that admits an L -centered-local estimatoralso satisfies (cid:107) g ( z ) − g ( w ) (cid:107) ∗ ≤ L V w ( z ) , by Jensen’s inequality.Algorithm 2 below is an approximate variant of the variance reduction algorithm in our earlierwork [8] which closely builds upon the “conceptual prox-method” of Nemirovski [28]. The algorithmrepeatedly calls a stochastic oracle O : Z → Z to produce intermediate iterates, and then performsan extragradient (linearized) proximal step using the intermediate iterate. The main modificationcompared to [8] is Line 5, which accommodates slight perturbations to the extra-gradient stepresults. These perturbations arise due to input requirements of our data structures: we slightly padcoordinates in simplex blocks to ensure they are bounded away from zero.
Algorithm 2:
OuterLoop ( O ) (conceptual prox-method [28]) Input:
Target approximation quality ε outer , ( α, ε inner ) -relaxed proximal oracle O ( z ) forgradient mapping g and some ε inner < ε outer , distance-generating r Parameters:
Number of iterations K . Output:
Point ¯ z K with E Gap(¯ z ) ≤ α Θ K + ε outer z ← arg min z ∈Z r ( z ) for k = 1 , . . . , K do z k − / ← O ( z k − ) (cid:46) We implement O ( z k − ) by calling InnerLoop ( z k − , ˜ g z k − , α ) z (cid:63)k := Prox αz k − ( g ( z k − / )) = arg min z ∈Z (cid:8)(cid:10) g (cid:0) z k − / (cid:1) , z (cid:11) + αV z k − ( z ) (cid:9) z k ← any point satisfying V z k ( u ) − V z (cid:63)k ( u ) ≤ ε outer − ε inner α , for all u ∈ Z return ¯ z K = K (cid:80) Kk =1 z k − / The following definition summarizes the key property of the oracle O . Definition 5 ([8, Definition 1]) . Let operator g be monotone and α, ε inner > . An ( α, ε inner )-relaxedproximal oracle for g is a (possibly randomized) map O : Z → Z such that z (cid:48) = O ( z ) satisfies E (cid:20) max u ∈Z (cid:8) (cid:10) g ( z (cid:48) ) , z (cid:48) − u (cid:11) − αV z ( u ) (cid:9)(cid:21) ≤ ε inner . The following proposition, a variant of [8, Proposition 1], shows that despite the error permittedtolerated in Line 5, the algorithm still converges with rate /K . We defer its proof to Appendix B. Proposition 3.
Let O be an ( α , ε inner )-relaxed proximal oracle with respect to gradient mapping g ,distance-generating function r with range at most Θ and some ε inner ≤ ε outer . Let z / , z / , . . . , z K − / be iterates of Algorithm 2 and let ¯ z K be its output. Then E Gap(¯ z K ) ≤ E max u ∈Z K K (cid:88) k =1 (cid:10) g ( z k − / ) , z k − / − u (cid:11) ≤ α Θ K + ε outer . Algorithm 3 is a variant of the variance-reduced inner loop of [8], adapted for local norms andinexact iterates (again, due to approximations made by the data structure). It tolerates error inthree places:1. Instead of estimating the gradient at the previous iterate w t − , we estimate it at a point ˆ w t − such that w t − − ˆ w t − has small norm and similar divergence from the reference point w (Line 2).2. Instead of letting the next iterate be the exact mirror descent step w (cid:63)t , we let be a point w T thatis close to w (cid:63)t in norm and has similar divergences to from w and to any any point in Z (Line 4).20. The output ˜ w can be an approximation of the average of the iterates, as long as its difference tothe true average has bounded norm (Line 5).We quantify the effect of these approximations in Proposition 4, which gives a runtime guaranteefor Algorithm 3 (where we recall the definition of (cid:107)·(cid:107) as the dual norm of (cid:107)·(cid:107) ∗ , see (18)). The proofis deferred to Appendix B. Algorithm 3:
InnerLoop ( w , ˜ g w , ϕ ) Input:
Initial w ∈ Z , L -centered-local gradient estimator ˜ g w , oracle quality α > Parameters:
Step size η , number of iterations T , approximation tolerance ϕ Output:
Point ˜ w satisfying Definition 5 for t = 1 , . . . , T do ˆ w t − ≈ w t − satisfying (a) V w ( ˆ w t − ) − V w ( w t − ) ≤ ϕα and (b) (cid:107) ˆ w t − − w t − (cid:107) ≤ ϕLD w (cid:63)t ← arg min w ∈Z (cid:8) (cid:104) w, clip( η ˜ g w ( ˆ w t − ) − ηg ( w )) + ηg ( w ) (cid:105) + αη V w ( w ) + V w t − ( w ) (cid:9) w t ≈ w (cid:63)t satisfying(a) max u (cid:2) V w t ( u ) − V w (cid:63)t ( u ) (cid:3) ≤ ηϕ ,(b) V w ( w t ) − V w ( w (cid:63)t ) ≤ ϕα , and(c) (cid:107) w t − w (cid:63)t (cid:107) ≤ ϕ LD return ˜ w ≈ T (cid:80) Tt =1 w t satisfying (cid:13)(cid:13)(cid:13) ˜ w − T (cid:80) Tt =1 w t (cid:13)(cid:13)(cid:13) ≤ ϕLD . Proposition 4.
Let ( Z , (cid:107)·(cid:107) · , r , Θ , clip ) be any local norm setup. Let w ∈ Z , α ≥ ε inner > ,and ˜ g w be an L -centered-local estimator for some L ≥ α . Assume the domain is bounded by max z ∈Z (cid:107) z (cid:107) ≤ D , that g is L -Lipschitz, i.e. (cid:107) g ( z ) − g ( z (cid:48) ) (cid:107) ∗ ≤ L (cid:107) z − z (cid:48) (cid:107) , that g is LD -bounded,i.e. max z ∈Z (cid:107) g ( z ) (cid:107) ∗ ≤ LD , and that ˆ w = w . Then, for η = α L , T ≥ ηα ≥ L α , and ϕ = ε inner ,Algorithm 3 outputs a point ˆ w ∈ Z such that E max u ∈Z [ (cid:104) g ( ˜ w ) , ˜ w − u (cid:105) − αV w ( u )] ≤ ε inner , (24) i.e. Algorithm 3 is an ( α, ε inner ) -relaxed proximal oracle. Remark 2 (Assumption of boundedness on g ) . The assumption that g is LD -bounded in the dualnorm is immediate from other assumptions used in Proposition 4 in the case of the applicationsin Section 4, where we develop methods for solving (cid:96) - (cid:96) matrix games and assume that g (0) = 0 .In applications in Section 6, due to the existence of extra linear terms b , c (cid:54) = 0 , all complexitybounds will have an additional dependence on log( (cid:107) [ b ; c ] (cid:107) ∗ ) which we pay in the implementation ofdata structure ApproxExpMaintainer (i.e. the parameter L in the bound on g is larger if (cid:107) [ b ; c ] (cid:107) ∗ is large). We hide this extra polylogarithmic factor in the (cid:101) O notation.We also remark that (up to constants) the bounds on the range of ε inner ≤ α ≤ L in the statementof Proposition 4 correspond to the cases where the inner and outer loop consist of a single iteration. We now state the general form which our centered-local estimators ˜ g w take, given a referencepoint w ∈ Z . At a point z , for sampling distributions p ( z ; w ) , q ( z ; w ) to be specified, sample i x , j x ∼ p ( z ; w ) and i y , j y ∼ q ( z ; w ) . Then, define ˜ g w ( z ) = (cid:18) A i x j x [ z y − w y ] i x p i x j x ( z ; w ) e j x , − A i y j y [ z x − w x ] j y q i y j y ( z ; w ) e i y (cid:19) + g ( w ) . (25)21t is clear that regardless of the distributions p ( z ; w ) , q ( z ; w ) , this is an unbiased gradientestimator ( E [˜ g w ( z )] = g ( z ) ). Furthermore, ˜ g w ( z ) − g ( w ) is always 2-sparse. In this section we instantiate the algorithmic framework of Section 3 in (cid:96) - (cid:96) setup without linearterms, i.e. b = c = 0 in the objective (19). This is the fundamental “matrix game” problem min x ∈ ∆ m max y ∈ ∆ n y (cid:62) Ax.
We give two algorithms for approximately solving matrix games. In Section 4.1 we develop astochastic coordinate method based on Algorithm 1 with potentially sublinear runtime (cid:101) O (cid:16) ( L , co /(cid:15) ) (cid:17) .In Section C we develop a coordinate variance-reduction based on Algorithm 2 with runtime (cid:101) O (cid:16) nnz + √ nnz · L , co /(cid:15) (cid:17) that improves on the former runtime whenever it is Ω( nnz ) . In both caseswe have L , co := max (cid:26) max i (cid:107) A i : (cid:107) , max j (cid:107) A : j (cid:107) (cid:27) (26)as in Table 1.Instantiations for the (cid:96) - (cid:96) and (cid:96) - (cid:96) setups follow similarly. We carry them out in Appendices C(for stochastic coordinate methods) and D (for variance reduction methods). Remark 3.
For simplicity in this section (and the remaining implementations in Appendices C, D),we will set g (0) = 0 whenever the setup is not (cid:96) - (cid:96) , as is standard in the literature. We defer adiscussion of how to incorporate arbitrary linear terms in simplex domains to Section 6; up toadditional logarithmic terms in the runtime, this extension is supported by ApproxExpMaintainer . Assumptions.
Throughout (for both Sections 4.1 and 4.2), we assume access to entry queries, (cid:96) norms of rows and columns, and (cid:96) sampling distributions for all rows and columns. We use the (cid:96) - (cid:96) local norm setup (Table 6). We also define L max := (cid:107) A (cid:107) max = max i ∈ [ m ] ,j ∈ [ n ] | A ij | . (cid:96) - (cid:96) sublinear coordinate method For z ∈ ∆ n × ∆ m and desired accuracy (cid:15) > , we specify the sampling distributions p ( z ) , q ( z ) : p ij ( z ) := [ z y ] i A ij (cid:107) A i : (cid:107) and q ij ( z ) := [ z x ] j A ij (cid:107) A : j (cid:107) . (27)We first state and prove the local properties of this estimator. Lemma 2.
In the (cid:96) - (cid:96) setup, estimator (23) using the sampling distribution in (27) is a √ L , co -local estimator.Proof. Unbiasedness holds by definition. For arbitrary w x , we have the variance bound: E (cid:104) (cid:107) ˜ g x ( z ) (cid:107) w x (cid:105) ≤ (cid:88) i ∈ [ m ] ,j ∈ [ n ] p ij ( z ) · (cid:32) [ w x ] j · (cid:18) A ij [ z y ] i p ij ( z ) (cid:19) (cid:33) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] [ w x ] j A ij [ z y ] i p ij ( z ) ≤ (cid:88) i ∈ [ m ] ,j ∈ [ n ] [ w x ] j [ z y ] i (cid:107) A i : (cid:107) ≤ max i ∈ [ m ] (cid:107) A i : (cid:107) ≤ ( L co , ) . E (cid:104) (cid:107) ˜ g y ( z ) (cid:107) w y (cid:105) ≤ ( L co , ) . The definition (cid:107) ˜ g ( z ) (cid:107) w = (cid:107) ˜ g x ( z ) (cid:107) w x + (cid:107) ˜ g y ( z ) (cid:107) w y yields the claimed variance bound. In this section, we discuss the details of how to leverage the
IterateMaintainer data structureto implement the iterations of our algorithm. The algorithm we analyze is Algorithm 1, using thelocal estimator defined in (23), and the distribution (27). We choose η = (cid:15) (cid:16) L , co (cid:17) and T = (cid:24) η(cid:15) (cid:25) ≥ (cid:16) L , co (cid:17) log( mn ) (cid:15) . Lemma 2 implies that our estimator satisfies the remaining requirements for Proposition 2, givingthe duality gap guarantee in T iterations. In order to give a runtime bound, we claim that eachiteration can be implemented in log( mn ) time, with O ( m + n ) additional runtime. Data structure initializations and invariants.
At the start of the algorithm, we spend O ( m + n ) time initializing data structures via IM x . Init ( n n , n ) and IM y . Init ( m m , m ) , where IM x , IM y are appropriate instantiations of IterateMaintainer data structures. Throughout, we preservethe invariant that the points maintained by IM x , IM y correspond to the x and y blocks of the currentiterate z t at iteration t of the algorithm. Iterations.
For simplicity, we only discuss the runtime of updating the x block as the y blockfollows symmetrically. We divide each iteration into the following substeps, each of which we showruns in time O (log mn ) . We refer to the current iterate by z = ( z x , z y ) , and the next iterate by w = ( w x , w y ) . Sampling.
Recall that p ij ( z ) := [ z y ] i A ij (cid:107) A i : (cid:107) . We first sample coordinate i via IM y . Sample () in O (log m ) . Next, we sample j ∈ [ n ] with probabil-ity proportional to A ij using the data structure corresponding to A i : in O (1) by assumption of thematrix access model. Computing the gradient estimator.
To compute c := clip( A ij [ z y ] i /p ij ) , it suffices to compute A ij , [ z y ] i , and p ij . Using an entry oracle for A we obtain A ij , and we get [ z y ] i by calling IM y . Get ( i ) .Computing p ij using the precomputed (cid:107) A i : (cid:107) and the values of A ij , [ z y ] i therefore takes O (1) time. Performing the update.
For the update corresponding to a proximal step, we have w x ← Π X ( z x ◦ exp( − η ˜ g x ( z ))) = z x ◦ exp( − η ˜ g x ( z )) (cid:107) z x ◦ exp( − η ˜ g x ( z )) (cid:107) .
23e have computed ˜ g x ( z ) , so to perform this update, we call ξ ← IM x . Get ( j ); IM x . AddSparse ( j, (exp( − ηc ) − ξ ); IM x . Scale ( IterateMaintainer x . GetNorm () − ); IM x . UpdateSum () . By assumption, each operation takes time O (log n ) , giving the desired iteration complexity. It isclear that at the end of performing these operations, the invariant that IM x maintains the x blockof the iterate is preserved. Averaging.
After T iterations, we compute the average point ¯ z x : [¯ z x ] j ← T · IM x . GetSum ( j ) , ∀ j ∈ [ n ] . By assumption, this takes O ( n ) time. In the (cid:96) - (cid:96) setup, the implementation in Section 4.1.2 has runtime O (cid:16) L , co (cid:17) log ( mn ) (cid:15) + m + n , and outputs a point ¯ z ∈ Z such that E Gap(¯ z ) ≤ (cid:15). .Proof. The runtime follows from the discussion in Section 4.1.2. The correctness follows fromProposition 2.
Remark 4.
Using our
IterateMaintainer data structure, the (cid:96) - (cid:96) algorithm of Grigoriadisand Khachiyan [16] runs in time O ( rcs (cid:107) A (cid:107) log ( mn ) /(cid:15) ) , where rcs is the maximum number ofnonzeros in any row or column. Our runtime universally improves upon it since ( L co , ) ≤ rcs (cid:107) A (cid:107) . (cid:96) - (cid:96) variance-reduced coordinate method Given reference point w ∈ ∆ n × ∆ m , for z ∈ ∆ n × ∆ m and a parameter α > , we specify thesampling distributions p ( z ; w ) , q ( z ; w ) : p ij ( z ; w ) := [ z y ] i + 2[ w y ] i · A ij (cid:107) A i : (cid:107) and q ij ( z ; w ) := [ z x ] j + 2[ w x ] j · A ij (cid:107) A : j (cid:107) . (28)We remark that this choice of sampling distribution, which we term “sampling from the sum” (ofthe current iterate and reference point), may be viewed as a computationally-efficient alternative tothe distribution specified in [8], which was based on “sampling from the difference”. In particular,sampling from the difference is an operation which to the best of our knowledge is difficult toimplement in sublinear time, so we believe that demonstrating that this alternative distributionsuffices may be of independent interest. In order to show its correctness, we need the followingclaim, whose proof we defer to Appendix D.1. 24 emma 3. For y, y (cid:48) ∈ ∆ m , divergence V y ( y (cid:48) ) generated by r ( y ) = (cid:80) i ∈ [ m ] [ y ] i log[ y ] i − [ y ] i satisfies V y ( y (cid:48) ) ≥ (cid:13)(cid:13) y (cid:48) − y (cid:13)(cid:13) y + y (cid:48) = 12 (cid:88) i ∈ [ m ] ([ y ] i − [ y (cid:48) ] i ) [ y ] i + [ y (cid:48) ] i . We now show the local properties of this estimator.
Lemma 4.
In the (cid:96) - (cid:96) setup, estimator (25) using the sampling distribution in (28) is a √ L , co -centered-local estimator.Proof. Unbiasedness holds by definition. For arbitrary w x , we have the variance bound: E (cid:104)(cid:13)(cid:13) ˜ g x w ( z ) − g x ( w ) (cid:13)(cid:13) w x (cid:105) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] p ij ( z ; w ) · [ w x ] j · (cid:32) A ij (cid:0) [ z y ] i − [ w y ] i (cid:1) p ij ( z ; w ) (cid:33) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] [ w x ] j A ij (cid:0) [ z y ] i − [ w y ] i (cid:1) p ij ( z ; w ) ≤ (cid:88) i ∈ [ m ] ,j ∈ [ n ] [ w x ] j ([ z y ] i − [ w y ] i ) [ z y ] i + [ w y ] i (cid:107) A i : (cid:107) ≤ (cid:18) max i (cid:107) A i : (cid:107) (cid:19) V w y ( z y ) , where in the last inequality we used Lemma 3. Similarly, we have for arbitrary w y , E (cid:104)(cid:13)(cid:13) ˜ g y w ( z ) − g y ( w ) (cid:13)(cid:13) w y (cid:105) ≤ (cid:18) max j (cid:107) A : j (cid:107) (cid:19) V w x ( z x ) . Combining these and using (cid:107) ˜ g w ( z ) − g ( w ) (cid:107) w := (cid:13)(cid:13) ˜ g x w ( z ) − g x ( w ) (cid:13)(cid:13) w x + (cid:13)(cid:13) ˜ g y w ( z ) − g y ( w ) (cid:13)(cid:13) w y yields the desired variance bound. In this section, we discuss the details of how to leverage the
ApproxExpMaintainer data structureto implement the iterations of our algorithm. We first state one technical lemma on the effectof β -padding (Definition 2) on increasing entropy, used in conjunction with the requirements ofProposition 4 to bound the error tolerance required by our ApproxExpMaintainer data structure.The proof is deferred to Appendix D.1.
Lemma 5.
Let x (cid:48) ∈ ∆ n be a β -padding of x ∈ ∆ n . Then, (cid:88) j ∈ [ n ] x (cid:48) j log x (cid:48) j − (cid:88) j ∈ [ n ] x j log x j ≤ βne + β (1 + β ) . This leads to the following divergence bounds which will be used in this section.25 emma 6.
Let x (cid:48) ∈ ∆ n be a β -padding of x ∈ ∆ n . Then V x (cid:48) ( u ) − V x ( u ) ≤ β, ∀ u ∈ Z and if (cid:107) log( x ) (cid:107) ∞ ≤ M , then V x ( x (cid:48) ) − V x ( x ) ≤ β (cid:16) M + ne + 1 + β (cid:17) Proof.
Throughout this proof, let ˜ x be the point in Definition 2 such that (cid:107) ˜ x − x (cid:107) ≤ β and x (cid:48) = ˜ x/ (cid:107) ˜ x (cid:107) .The first claim follows from expanding V x (cid:48) ( u ) − V x ( u ) = (cid:88) j ∈ [ n ] u j log x j x (cid:48) j = (cid:88) j ∈ [ n ] u j log (cid:18) x j ˜ x j · (cid:107) ˜ x (cid:107) (cid:19) ≤ log( (cid:107) ˜ x (cid:107) ) ≤ β. The first inequality used u ∈ ∆ n and ˜ x ≥ x entrywise, and the last inequality used log(1 + β ) ≤ β .For the second claim, we have by the triangle inequality (cid:13)(cid:13) x − x (cid:48) (cid:13)(cid:13) ≤ (cid:107) x − ˜ x (cid:107) + (cid:13)(cid:13) ˜ x − x (cid:48) (cid:13)(cid:13) ≤ β + ( (cid:107) ˜ x (cid:107) − (cid:13)(cid:13) x (cid:48) (cid:13)(cid:13) ≤ β. The claim then follows from expanding V x ( x (cid:48) ) − V x ( x ) = (cid:88) j ∈ [ n ] x (cid:48) j log x (cid:48) j − (cid:88) j ∈ [ n ] x j log x j + (cid:10) log x , x − x (cid:48) (cid:11) , and applying Lemma 5.The algorithm we analyze is Algorithm 2 with K = 3 α Θ /(cid:15) , ε outer = 2 (cid:15)/ , ε inner = (cid:15)/ usingAlgorithm 3 as an ( α, ε inner ) -relaxed proximal oracle. The specific modification we perform todefine the iterates { z k } of Algorithm 2 as modifications of the ideal iterates { z (cid:63)k } uses the followingdefinition. Definition 6.
For a simplex variable x (cid:48) ∈ ∆ n , we define truncate ( x (cid:48) , δ ) to be the point x ∈ ∆ n with x j ∝ max( x (cid:48) j , δ ) for all j ∈ [ n ] . For a variable z on two blocks, we overload notation and define truncate ( z, δ ) to be the result of applying truncate ( · , δ ) to each simplex block of z . For our implementation, in each step of Algorithm 2, we will compute the point z (cid:63)k exactly,and apply the operation z k ← truncate ( z (cid:63)k , δ ) , for δ = ε outer − ε inner α ( m + n ) . We now quantify the effect oftruncation in terms of Bregman divergence to an arbitrary point. Lemma 7 (Effect of truncation) . Let x (cid:48) ∈ ∆ n , and let x = truncate ( x (cid:48) , δ ) . Then, for any u ∈ ∆ n ,and where divergences are with respect to entropy, V x ( u ) − V x (cid:48) ( u ) ≤ δn. Proof.
Note that x is a δn -padding of x (cid:48) , as it is the result of adding at most δ to each coordinateand renormalizing to lie in the simplex. Consequently, the result follows from Lemma 6.Lemma 7 thus implies our iterates satisfy the requirements of Algorithm 2. Our implementationof Algorithm 3 will use approximation tolerance ϕ = ε inner / (cid:15)/ , where we always set L , co ≥ α ≥ ε inner . (29)26his matches the requirements of Proposition 4. In the implementation of Algorithm 3, we use thecentered-local gradient estimator defined in (25), using the sampling distribution (28). For each useof Algorithm 3, we choose η = α (cid:16) L , co (cid:17) and T = (cid:24) ηα (cid:25) ≥ (cid:16) L , co (cid:17) α . Our discussion will follow in four steps: first, we discuss the complexity of all executions inAlgorithm 2 other than calls to the oracles. Next, we discuss the complexity of all initializations of
ApproxExpMaintainer data structures. Then, we discuss the complexity of all other iterations ofAlgorithm 3. For simplicity, when discussing Algorithm 3, we will only discuss implementation ofthe x -block, and the y -block will follow symmetrically, while most runtimes are given consideringboth blocks. Lastly, we discuss complexity of computing the average iterate in the end of the innerloop. Altogether, the guarantees of Proposition 3 and Proposition 4 imply that if the guaranteesrequired by the algorithm hold, the expected gap of the output is bounded by (cid:15) . Outer loop extragradient steps.
Overall, we execute K = 3 α Θ /(cid:15) iterations of Algorithm 2,with ε outer = 2 (cid:15)/ , ε inner = (cid:15)/ to obtain the desired gap, where Θ = log( mn ) in the (cid:96) - (cid:96) setup. Wespend O ( nnz ) time executing each extragradient step in Algorithm 2 exactly to compute iterates z (cid:63)k ,where the dominant term in the runtime is computing each g ( z k − / ) , for k ∈ [ K ] . We can maintainthe average point ¯ z throughout the duration of the algorithm, in O ( m + n ) time per iteration.Finally, we spend an additional O ( m + n ) time per iteration applying truncate to each iterate z (cid:63)k . Data structure initializations and invariants.
We consider the initialization of data structuresfor implementing an ( α, ε inner = (cid:15)/ -relaxed proximal oracle with error tolerance ϕ = (cid:15)/ . First,note that the point w x used in the initialization of every inner loop, by the guarantees of truncate operation, has no two coordinates with multiplicative ratio larger than δ = (cid:15)/ (3 α ( m + n )) ≥ ( m + n ) − , by our choice α ≤ L , co (29) and our assumptions on L , co /(cid:15) (cf. Section 2.2). Sinceclearly a simplex variable in ∆ n has a coordinate at least /n , the entries of w are lower boundedby λ = ( m + n ) − .Next, we discuss the initial parameters given to AEM x , an instance of ApproxExpMaintainer which will support necessary operations for maintaining the x variable (we will similarly initializean instance AEM y ). Specifically, the invariant that we maintain throughout the inner loop is that initeration t , the“exact vector” x maintained by AEM x corresponds to the x block of the current iterate w t , and the “approximate vector” ˆ x maintained corresponds to the x block of the approximate iterate ˆ w t , as defined in Algorithm 3.We will now choose ˜ ε so that if AEM x is initialized with error tolerance ˜ ε , all requirements ofProposition 4 (e.g. the bounds stipulated in Algorithm 3) are met. We first handle all divergencerequirements. In a given iteration, denote the x blocks of w (cid:63)t , w t and ˆ w t by x (cid:63)t , x t and ˆ x t respectively,and recall AEM x guarantees x t is a ˜ ε -padding of x (cid:63)t , and ˆ x t is a ˜ ε -padding of x (cid:63)t . The former of theseguarantees is true by the specification of MultSparse (which will be used in the implementation ofthe step, see “Performing the update” below), and the latter is true by the invariant on the pointssupported by
AEM x . Lines 2 and 4 of Algorithm 3 stipulate the divergence requirements, where27 := w x , max { V x ( x t ) − V x ( x (cid:63)t ) , V x (ˆ x t ) − V x ( x t ) } ≤ ϕ α = (cid:15) α (30)and max u (cid:2) V x t ( u ) − V x (cid:63)t ( u ) (cid:3) ≤ ηϕ η(cid:15) . (31)Clearly, combining this guarantee with a similar guarantee on the y blocks yields the desired bound.Since we derived (cid:107) log w (cid:107) ∞ ≤ mn ) , we claim that choosing ˜ ε ≤ (cid:15) α ( m + n ) suffices for the guarantees in (30). By the first part of Lemma 6, for all sufficiently large m + n , max { V x ( x t ) − V x ( x (cid:63)t ) , V x (ˆ x t ) − V x ( x t ) } ≤ ˜ ε (cid:16)
10 log( mn ) + ne + 1 + ˜ ε (cid:17) ≤ (cid:15) α . Similarly for guarantees in (31), by the second part of Lemma 6 we know it suffices to choose ˜ ε ≤ (cid:15) (cid:16) L , co (cid:17) ≤ (cid:15)α (cid:16) L , co (cid:17) = η(cid:15) . Here, we used the restriction ε inner ≤ α ≤ L , co . Next, the norm requirements of Algorithm 3 (theguarantees in Lines 2, 4, and 5) imply we require ˜ ε ≤ (cid:15) √ L , co , where we used that g is (cid:107) A (cid:107) max ≤ L , co -Lipschitz and the diameter of Z is bounded by √ . Usingour assumptions on the size of parameters in Section 2.2, it suffices to set the error tolerance ˜ ε = ( m + n ) − . To give the remainder of specified parameters,
AEM x is initialized via Init ( w x , v, κ, ˜ ε ) for κ := 11 + ηα/ , v := (1 − κ ) log w x − ηκg x ( w ) . To motivate this form of updates, note that each iteration of Algorithm 3 requires us to compute arg min (cid:26) (cid:104) c t e j + g x ( w ) , x (cid:105) + α V w x ( x ) + 1 η V w x t ( x ) (cid:27) . We can see that the solution to this update is given by (cid:2) w (cid:63)t +1 (cid:3) x ← Π ∆ ([ w x t ] κ ◦ exp ((1 − κ ) log w x − ηκg x ( w )) ◦ exp( − ηκc t e j )) . (32)This form of update is precisely supported by our choice of κ and v , as well as the DenseStep and
MultSparse operations. By the choice of parameters, we note that − κ ≥ ( m + n ) − .Finally, in order to support our sampling distributions and gradient computations, we computeand store the vectors w and g ( w ) in full using O ( nnz ( A )) time at the beginning of the inner loop.In O ( m + n ) time, we also build two data structures which allow us to sample from entries of thegiven fixed vectors w x , and w y , in constant time respectively.Following Section 2.4.2, we defined the parameter ω := max (cid:18) − κ , nλ ˜ (cid:15) (cid:19) ≤ ( m + n ) , so that log( ω ) = O (log( mn )) . Altogether, these initializations take time O ( nnz + ( m + n ) log ( mn )) , following Section 2.4.2.28 nner loop iterations. We discuss how to make appropriate modifications to the x -block. Forsimplicity we denote our current iterate as z , and the next iterate as w . Also, we denote ˆ z asthe concatenation of implicit iterates that the two ApproxExpMaintainer copies maintain (seeSection 2.4.2 for more details), which is ˜ ε close in (cid:96) distance to z , the prior iterate, so that we canquery or sample entries from ˆ z using AEM x and AEM y . Each inner loop iteration consists of usinga gradient estimator at ˆ z satisfying (cid:107) ˆ z − z (cid:107) ≤ ˜ ε , sampling indices for the computation of ˜ g w (ˆ z ) ,computing the sparse part of ˜ g w (ˆ z ) , and performing the approximate update to the iterate. Weshow that we can run each substep using data structure AEM x in time O (log ( mn )) , within theerror tolerance of Proposition 4 due to the definition of ˜ ε . Combining with our discussion of thecomplexity of initialization, this implies that the total complexity of the inner loop, other thanoutputting the average iterate, is O ( T log ( mn ) + nnz + ( m + n ) log ( mn )) = O (cid:16) L , co (cid:17) · log ( mn ) α + nnz + ( m + n ) log ( mn ) . Sampling.
Recall that the distribution we sample from is given by p ij (ˆ z ; w ) := [ˆ z y ] i + 2[ w y ] i · A ij (cid:107) A i : (cid:107) . First, with probability / , we sample a coordinate i from the precomputed data structure forsampling from w y in constant time; otherwise, we sample i via AEM y . Sample () . Then, we samplean entry of A i : proportional to its square via the precomputed data structure (cf. Section 2.3) inconstant time. This takes in total O (log mn ) time. Computing the gradient estimator.
Proposition 4 requires us to compute the sparse componentof the gradient estimator (25) at point ˆ z . To do this for the x block, we first query [ w x ] j and [ˆ z y ] i ← AEM y . Get ( i ) , and then access the precomputed norm (cid:107) A i : (cid:107) and entry A ij . We then compute c = clip (cid:32) A ij (cid:2) ˆ z y − w y (cid:3) i · z y ] i + 2[ w y ] i · (cid:107) A i : (cid:107) A ij (cid:33) . By the guarantees of
ApproxExpMaintainer , this takes total time bounded by O (log( mn )) . Performing the update.
To perform the update, by observing the form of steps in Algorithm 3 withour choice of entropy regularizer, the update form given by the regularized mirror-descent step is(as derived in the discussion of the initialization of
AEM x , see (32)) [ w (cid:63) x ← Π ∆ (( w x ) κ ◦ exp((1 − κ ) log w x − ηκg x ( w ) − ηκce j )) . To implement this, recalling our choice of the vector v in the initialization of AEM x , it suffices to call AEM x . DenseStep ();
AEM x . MultSparse ( − ηκce j ); AEM x . UpdateSum () . By assumption, each operation takes time bounded by O (log ( mn )) , where we note the vector usedin the MultSparse operation is -sparse. The implementation of this update is correct up to a ˜ ε -padding, whose error we handled previously. By the discussion in the data structure initializationsection, this preserves the invariant that the x block of the current iterate is maintained by AEM x .29 verage iterate computation. At the end of each run of Algorithm 3, we compute and returnthe average iterate via calls
AEM x . GetSum ( j ) for each j ∈ [ n ] , and scaling by /T , and similarly query AEM y . The overall complexity of this step is O (( m + n ) log ( mn )) . The correctness guarantee, i.e.that the output approximates the average in (cid:96) norm up to ϕ/LD , is given by the choice of ˜ ε andthe guarantees of ApproxExpMaintainer , where in this case the domain size D is bounded by √ .This is never the dominant factor in the runtime, as it is dominated by the cost of initializations. In the (cid:96) - (cid:96) setup, let nnz (cid:48) := nnz + ( m + n ) log ( mn ) . The implementation inSection 4.2.2 with the optimal choice of α = max (cid:16) (cid:15)/ , L , co log ( mn ) / √ nnz (cid:48) (cid:17) has runtime O nnz (cid:48) + (cid:16) L , co (cid:17) log ( mn ) α α log( mn ) (cid:15) = O (cid:32) nnz (cid:48) + √ nnz (cid:48) L , co log ( mn ) (cid:15) (cid:33) and outputs a point ¯ z ∈ Z such that E Gap(¯ z ) ≤ (cid:15). Proof.
The correctness of the algorithm is given by the discussion in Section 4.2.2 and the guaranteesof Proposition 3 with K = 3 α Θ /(cid:15) , ε outer = 2 (cid:15)/ , ε inner = (cid:15)/ , Proposition 4 with ϕ = (cid:15)/ , and thedata structure ApproxExpMaintainer with our choice of ˜ ε := ( m + n ) − , to meet the approximation conditions in Line 2, 4 and 5 of Algorithm 3. The runtime bound isgiven by the discussion in Section 4.2.2, and the optimal choice of α is clear. In this section, we give implementations of our data structures, fulfilling the interface and runtimeguarantees of Section 2.4. In Section 5.1 we provide the implementation of
IterateMaintainer p for p ∈ { , } used for sublinear coordinate methods. In Section 5.2, we provide an implementation of ApproxExpMaintainer used in variance-reduced coordinate methods for simplex domains, providedwe have an implementation of a simpler data structure,
ScaleMaintainer , which we then providein Section 5.3.
IterateMaintainer p The
IterateMaintainer p , p ∈ { , } data structure is described in Section 2.4.1 and used fortracking the iterates in our fully stochastic methods and the Euclidean part of our the iterates in ourvariance-reduced methods. The data structure maintains an internal representation of x , the currentiterate, and s , a running sum of all iterates. The main idea behind the efficient implementation ofthe data structure is to maintain x and s as a linear combination of sparsely-updated vectors. Inparticular, the data structure has the following state: scalars ξ u , ξ v , σ u , σ v , ι , ν ; vectors u, u (cid:48) , v ,and the scalar (cid:107) v (cid:107) ; the vector v is only relevant for variance reduction and is therefore set 0 forthe non-Euclidean case p = 1 .We maintain the following invariants on the data structure state at the end of every operation:30 x = ξ u u + ξ v v , the internal representation of x • s = u (cid:48) + σ u u + σ v v , the internal representation of running sum s • ι = (cid:104) x, v (cid:105) , the inner product of the iterate with fixed vector v • ν = (cid:107) x (cid:107) p , the appropriate norm of the iterateIn addition, to support sampling, our data structure also maintains a binary tree dist x of depth O (log n ) . Each leaf node is associated with a coordinate j ∈ [ n ] , and each internal node is associatedwith a subset of coordinates corresponding to leaves in its subtree. For the node corresponding to S ⊆ [ n ] (where S may be a singleton), we maintain the sums (cid:80) j ∈ S [ u ] pj , (cid:80) j ∈ S [ u ] j [ v ] j , and (cid:80) j ∈ S [ v ] pj .We now give the implementation of each operation supported by IterateMaintainer d , followedby proofs of correctness and of the runtime bounds when applicable. • Init ( x , v ) . Runs in time O ( n ) .If p = 1 set v ← n ; otherwise we compute and store (cid:107) v (cid:107) . Initialize the remaining data struc-ture state as follows: ( ξ u , ξ v , u ) ← (1 , , x ) , ( σ u , σ v , u (cid:48) ) ← (0 , , n ) , ( ι, ν ) ← ( (cid:104) x , v (cid:105) , (cid:107) x (cid:107) p ) .Initialize dist x , storing the relevant sums in each internal node.It is clear that x = ξ u u + ξ v v , s = u (cid:48) + σ u u + σ v v , and that the invariants of ι, ν hold. Eachstep takes O ( n ) time; for the first 4 steps this is immediate, and the final recursing upwards fromthe leaves spends constant time for each internal node, where there are O ( n ) nodes. • Scale ( c ) : x ← cx . Runs in time O (1) .Multiply each of ξ u , ξ v , ν, ι by c .• AddSparse ( j, c ) : [ x ] j ← [ x ] j + c , with the guarantee c ≥ − [ x ] j if p = 1 . Runs in time O (log n ) .1. u ← u + cξ u e j .2. u (cid:48) ← u (cid:48) − cσ u ξ u e j .3. If p = 1 , ν ← ν + c . If p = 2 , ν ← (cid:112) ν + 2 c [ ξ u u + ξ v v ] j + c .4. ι ← ι + c [ v ] j .5. For internal nodes of dist x on the path from leaf j to the root, update (cid:80) j ∈ S [ u ] pj , (cid:80) j ∈ S [ u ] j [ v ] j appropriately.• AddDense ( c ) : x ← x + cv . Runs in time O (1) . (Supported only for p = 2 ).Set ξ v ← ξ v + c , ν ← (cid:113) ν + 2 cι + c (cid:107) v (cid:107) , and ι ← ι + c (cid:107) v (cid:107) .• UpdateSum () : s ← s + x . Runs in time O (1) .Set σ u ← σ u + ξ u and σ v ← σ v + ξ v . 31ach of the runtime bounds clearly hold; we now demonstrate that the necessary invariants arepreserved. Correctness of Scale and
UpdateSum are clear. Regarding correctness of
AddSparse ,note that (ignoring the v terms when p = 1 ) ξ u (cid:18) u + cξ u e j (cid:19) + ξ v v = ξ u u + ξ v v + ce j , (cid:18) u (cid:48) − cσ u ξ u e j (cid:19) + σ u (cid:18) u + cξ u e j (cid:19) + σ v v = u (cid:48) + σ u u + σ v v. When p = 1 , the update to ν is clearly correct. When p = 2 , because only [ x ] j changes, [ ξ u u + ξ v v + ce j ] j = [ ξ u u + ξ v v ] j + 2 c [ ξ u u + ξ v v ] j + c , ([ ξ u u + ξ v v + ce j ] j ) · [ v ] j = ([ ξ u u + ξ v v ] j ) · [ v ] j + c [ v ] j . Thus, the updates to the norm and inner product are correct. Regarding correctness of
AddDense when p = 2 , we have ξ u u + ( ξ v + c ) v = ξ u u + ξ v v + cv, (cid:107) x + cv (cid:107) = ν + 2 cι + c (cid:107) v (cid:107) , (cid:104) x + cv, v (cid:105) = ι + c (cid:107) v (cid:107) . Here, we use the invariants that ν = (cid:107) x (cid:107) and ι = (cid:104) x, v (cid:105) . • Get ( j ) : Return [ x ] j . Runs in time O (1) .Return ξ u [ u ] j + ξ v [ v ] j .• GetSum ( j ) : Return [ s ] j . Runs in time O (1) .Return [ u (cid:48) ] j + σ u [ u ] j + σ v [ v ] j .• Norm () : Return (cid:107) x (cid:107) p . Runs in time O (1) .Return ν .By our invariants, each of these operations is correct. The method
Sample returns a coordinate j with probability proportional to [ x ] pj in time O (log n ) .To implement it, we recursively perform the following procedure, where the recursion depth is atmost O (log n ) , starting at the root node and setting S = [ n ] :1. Let S , S be the subsets of coordinates corresponding to the children of the current node.2. Using scalars ξ u , ξ v , and the maintained (cid:80) j ∈ S i [ u ] pj , (cid:80) j ∈ S i [ u ] j [ v ] j , (cid:80) j ∈ S i [ v ] pj when appropriate,compute (cid:80) j ∈ S i [ x ] pj = (cid:80) j ∈ S i [ ξ u u + ξ v v ] pj for i ∈ { , } .3. Sample a child i ∈ { , } of the current node proportional to (cid:80) j ∈ S i [ x ] pj by flipping an appro-priately biased coin. Set S ← S i .It is clear that this procedure samples according to the correct probabilities. Furthermore, step2 can be implemented in O (1) time using precomputed values, so the overall complexity is O (log n ) .32 .2 ApproxExpMaintainer
In this section, we give the implementation of
ApproxExpMaintainer which supports dense updateto simplex mirror descent iterates. For convenience, we restate its interface, where we recall thenotation (cid:107) g (cid:107) for the number of nonzero entries in g , Definition 2 of an ε -padding, the invariant ˆ x is a ε -padding of x, (33)and the notation ω := max (cid:18) − κ , nλ(cid:15) (cid:19) . Category Function Runtime initialize
Init ( x , v, κ, ε, λ ) : κ ∈ [0 , , ε > , min j [ x ] j ≥ λ O ( n log n log ω ) update MultSparse ( g ) : x ← ε -padding of Π ∆ ( x ◦ exp( g )) O ( (cid:107) g (cid:107) log n log ω ) DenseStep () : x ← Π ∆ ( x κ ◦ exp( v )) O (log n ) UpdateSum () : s ← s + ˆ x (recall invariant (22)) O (log n log ω ) query Get ( j ) : Return [ˆ x ] j O (log n log ω ) GetSum ( j ) : Return [ s ] j O (log ω ) sample Sample () : Return j with probability [ˆ x ] j O (log n log ω ) We build
ApproxExpMaintainer out of a simpler data structure called
ScaleMaintainer , whichmaintains the simplex projection of fixed vectors raised elementwise to arbitrary powers; this sufficesto support consecutive
DenseStep calls without
MultSparse calls between them. To add supportfor
MultSparse , we combine O (log n ) instances of ScaleMaintainer in a formation resemblinga binomial heap: for every entry updated by
MultSparse we delete it from the
ScaleMaintainer instance currently holding it, put it in a new singleton
ScaleMaintainer instance (after appropriatescaling due to
MultSparse ), and merge this singleton into existing instances. We now give a briefdescription of the
ScaleMaintainer interface, and based on it, describe the implementation of
ApproxExpMaintainer . We will provide the implementation of
ScaleMaintainer in Section 5.3.
ScaleMaintainer is initialized with vectors ¯ x ∈ R n (cid:48) ≥ and ¯ δ ∈ R n (cid:48) (with n (cid:48) ≤ n ) and supportsefficient approximate queries on vectors of the form x [ σ ] := ¯ x ◦ exp (cid:0) σ ¯ δ (cid:1) , for any scalar σ ∈ [ σ min , . More specifically, the data structure allows efficient computation of (cid:107) x [ σ ] (cid:107) (to within small multiplicative error ε scm ), as well as entry queries, sampling and runningsum accumulation from a vector ˆ x [ σ ] satisfying ˆ x [ σ ] is a ε scm -padding of Π ∆ (cid:0) ¯ x ◦ exp (cid:0) σ ¯ δ (cid:1)(cid:1) = x [ σ ] (cid:107) x [ σ ] (cid:107) . (34)We make the following assumptions on the input to the data structure: λ scm ≤ [¯ x ] i ≤ for all i ∈ [ n (cid:48) ] and σ ∈ ( σ min , . The upper bounds on ¯ x and σ are arbitrary, and we may choose λ scm and σ min to be very small sincethe data structure runtime depends on them only logarithmically. To summarize this dependence,we define ω scm := max (cid:26) σ min , nλ scm ε scm (cid:27) . With these assumptions and notation, we define the formal interface of
ScaleMaintainer .33 ategory Function Runtime initialize
Init (¯ x, ¯ δ, σ min , ε scm , λ scm ) O ( n (cid:48) log n log ω scm ) update Del ( j ) : Remove coordinate j from ¯ x , ¯ δ O (1) UpdateSum ( γ, σ ) : s ← s + γ ˆ x [ σ ] , with ˆ x [ σ ] defined in (34) O (log ω scm ) query Get ( j ) : Return [ˆ x [ σ ]] j O (log ω scm ) GetSum ( j ) : Return [ s ] j . O (log ω scm ) GetNorm ( σ ) : Return ± ε approx. of (cid:13)(cid:13) ¯ x ◦ exp( σ ¯ δ ) (cid:13)(cid:13) O (log ω scm ) sample Sample ( σ ) : Return j with probability [ˆ x [ σ ]] j O (log n log ω scm ) ApproxExpMaintainer state
Throughout this section, we denote K := (cid:100) log n (cid:101) . ApproxExpMaintainer maintains a partition of [ n ] into K sets S . . . , S K (some of them possibly empty) that satisfy the invariant | S k | ≤ k for all k ∈ [ K ] . (35)We refer to the index k as “rank” and associate with each rank k ∈ [ K ] the following data1. Scalar γ k ≥ and nonnegative integer τ k .2. Vectors ¯ x k , ¯ δ k ∈ R | S k | such that λ scm ≤ [¯ x k ] i ≤ for all i ∈ [ | S k | ] , where λ scm = min( ε/n, λ ) .3. A ScaleMaintainer instance, denoted
ScaleMaintainer k , initialized with ¯ x k , ¯ δ k and λ scm defined above, σ min = 1 − κ and ε scm = ε/ , so that log ω scm = O (log ω ) . ApproxExpMaintainer also maintains a vector u ∈ R n for auxiliary running sum storage.Define the vector δ ∈ R n by [ δ ] S k = log (cid:0) γ k (cid:2) ¯ x k ◦ exp (cid:0) (1 − κ τ k )¯ δ k (cid:1)(cid:3)(cid:1) , k ∈ { , . . . , K } , (36)where [ δ ] S k denotes the coordinates of δ in S k . Recall that x denotes the point in ∆ n maintainedthroughout the operations of ApproxExpMaintainer ; we maintain the key invariant that the point x is proportional to exp( δ ) , i.e., x = exp( δ ) (cid:107) exp( δ ) (cid:107) . (37)Specifically, we show in Section 5.2.3 that our implementation of DenseStep modifies ¯ x , ¯ δ , { τ k } Kk =0 , { γ k } Kk =0 so that the resulting effect on δ , per definition (37), is δ ← κδ + v. (38)Similarly, our implementation of MultSparse modifies the state so that the resulting effect on δ is exp( δ ) (cid:107) exp( δ ) (cid:107) ← ε -padding of exp( δ + v ) (cid:107) exp( δ + v ) (cid:107) . (39)We remark that the role of γ k is to scale ¯ x k so that it lies coordinatewise in the range [ λ scm , ,conforming to the ScaleMaintainer input requirement. This is also the reason we require the ε -padding operation in the definition of MultSparse .34 .2.2 ε -padding point ˆ x We now concretely define the point ˆ x , which is the ε -padding of x that ApproxExpMaintainer maintains. Let
Γ := K (cid:88) k =1 γ k ScaleMaintainer k . GetNorm (1 − κ τ k ) , (40)be the approximation of exp( δ ) derived from the ScaleMaintainer instances. For any j ∈ [ n ] , let k j be such that j ∈ S k j , and let i j be the index of j in S k j . The j th coordinate of ˆ x is [ˆ x ] j := γ k j ScaleMaintainer k j . GetNorm (1 − κ τ kj )Γ · ScaleMaintainer k j . Get ( i j , − κ τ kj ) . (41)Since for each k , (cid:80) j ∈ S k ScaleMaintainer k . Get ( j, − κ τ k ) = ScaleMaintainer k . GetNorm (1 − κ τ k ) we have that ˆ x ∈ ∆ n . We now prove that ˆ x is a ε -padding of x . To do so, we prove the followinglemma. Lemma 8.
Let ε scm ≤ and { S k } Kk =1 be a partition of [ n ] . Suppose for each k ∈ [ K ] , ˆ x k ∈ ∆ | S k | isan ε scm -padding of x k ∈ ∆ | S k | . Further, suppose we have positive scalars { ν k } Kk =1 , { ˆ ν k } Kk =1 satisfying (1 − ε scm ) ν k ≤ ˆ ν k ≤ (1 + ε scm ) ν k , for all ≤ k ≤ K. Then, for N = (cid:80) Kk =1 ν k and ˆ N = (cid:80) Kk =1 ˆ ν k , we have that ˆ x := (cid:80) Kk =1 ˆ ν k ˆ N ˆ x k is a ε scm -padding of x := (cid:80) Kk =1 ν k N x k .Proof. For every k ∈ [ K ] , let ˜ x k to be such that ˜ x k ≥ x k elementwise, ˆ x k = ˜ x k / (cid:107) ˜ x k (cid:107) , and (cid:107) ˜ x k − x k (cid:107) ≤ ε scm . Consider the point ˜ x := K (cid:88) k =1 ˜ ν k ˆ N ˜ x k , where ˜ ν k := ˆ ν k max k ∈ [ K ] (cid:107) ˜ x k (cid:107) (cid:107) ˜ x k (cid:107) · ε scm − ε scm , so that ˆ x = ˜ x/ (cid:107) ˜ x (cid:107) . Since ˜ x k ≥ x k elementwise, ˆ ν k ≥ (1 − ε scm ) ν k and ˆ N ≤ (1 + ε scm ) N , we havethat ˜ x ≥ x elementwise. Furthermore, we have ˆ ν k ≤ (1 + ε scm ) ν k and ˆ N ≥ (1 − ε scm ) N , and theproperties ˜ x k ≥ x k and (cid:107) ˜ x k − x k (cid:107) imply ≤ (cid:107) ˜ x k (cid:107) ≤ ε scm as well as max k ∈ [ K ] (cid:107) ˜ x k (cid:107) (cid:107) ˜ x k (cid:107) ≤ ε scm .Therefore (cid:107) ˜ x − x (cid:107) ≤ K (cid:88) k =1 ν k N (cid:13)(cid:13)(cid:13)(cid:13) (1 + ε scm ) (1 − ε scm ) ˜ x k − x k (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:18) (1 + ε scm ) (1 − ε scm ) − (cid:19) (1 + ε scm ) + ε scm ≤ ε scm , where the final bound is verified numerically for ε scm ≤ / .The ScaleMaintainer interface guarantees that calls to
ScaleMaintainer k . GetNorm return (cid:13)(cid:13) ¯ x k ◦ exp((1 − κ τ k )¯ δ k ) (cid:13)(cid:13) to within a ± ε scm multiplicative factor, and moreover that Get returnsentries from an ε scm -padding of Π ∆ (¯ x k ◦ exp((1 − κ τ k )¯ δ k )) . Thus, applying Lemma 8 with ourdefinition of ˆ x in (41) yields that ˆ x is a ε scm = ε -padding of x . ApproxExpMaintainer initialization and updates
We give the implementation and prove runtimes of
Init , MultSparse , DenseStep , and
UpdateSum .35 nit . Upon initialization of
ApproxExpMaintainer , we set γ K = max j ∈ [ n ] [ x ] j and τ k = 0 forall k . We let S K = [ n ] (so that S k = ∅ for all k < K ) and instantiate a single instance of ScaleMaintainer of rank K with parameters ¯ x K = x γ K , ¯ δ K = v − κ − log x , ε scm = ε , λ scm = min (cid:16) εn , λ (cid:17) . It is clear that the invariant (37) holds at initialization, and that the coordinates of ¯ x K lie in theappropriate range, since we assume that x ∈ [ λ, n . We will use the same choices of ε scm , λ scm forevery ScaleMaintainer instance. The overall complexity of this operation is O ( n log n log ω ) . MultSparse . We state the implementation of
MultSparse , prove that the resulting update is(39), and finally give its runtime analysis. We perform
MultSparse ( g ) in sequence for each nonzerocoordinate of g . Let j denote such nonzero coordinate and let k j be such that j ∈ S k j ; the operationconsists of the following steps.1. Remove j from S k j and delete the corresponding coordinate from ScaleMaintainer k j (via acall to Del ).2. Let S = j and initialize ScaleMaintainer with initial data ¯ x and ¯ δ described below.3. For k going from to K , set S k ← S k ∪ S k − and S k − = ∅ , merging ScaleMaintainer k and ScaleMaintainer k − as described below. If the new set S k satisfies | S k | ≤ k , break the loop;else, proceed to the next k .We now state the initial data given to each ScaleMaintainer upon initialization in the stepsabove. Whenever a
ScaleMaintainer k is created supporting S k ⊆ [ n ] , we first compute δ i for each i ∈ S k according to (37). When creating the singleton instance ScaleMaintainer we perform theupdate δ j ← δ j + g j ; (42)this implements multiplication of the j th coordinate by exp( g j ) . To instantiate ScaleMaintainer k ,we set τ k = 0 , γ k ← max i ∈ S k exp([ δ ] i ) and modify δ according to [ δ ] S k ← max { [ δ ] S k , log ( λ scm · γ k ) } . (43)In other words, we raise very small entries of [ δ ] S k to ensure that the ratio between any two entriesof [exp( δ )] S k is in the range [ λ − scm , λ scm ] . We then give ScaleMaintainer k the initial data ¯ x k = 1 γ k [exp ( δ )] S k , ¯ δ k = (cid:20) v − κ − δ (cid:21) S k . (44)It is clear that entries of ¯ x k are in the range [ λ scm , , and invariant (37) holds at initialization, as τ k = 0 . Therefore, the operation (42) implements x ← Π ∆ ( x ◦ exp( g )) exactly; it remains to showthat the operation (43) amounts to an ε -padding of exp( δ ) / (cid:107) exp( δ ) (cid:107) . We do so by invoking thefollowing lemma, substituting the values of δ before (42) for δ − and δ + respectively, and λ scm ≤ ε/n for ρ . Lemma 9.
Let δ − , δ + ∈ R n satisfy [ δ + ] i = max (cid:26) [ δ − ] i , max j [ δ − ] j + log ρ (cid:27) for all j ∈ [ n ] and ρ ≤ . Then, exp( δ + ) / (cid:107) exp( δ + ) (cid:107) is a ρn -padding of exp( δ − ) / (cid:107) exp( δ − ) (cid:107) . roof. Let x = exp( δ − ) / (cid:107) exp( δ − ) (cid:107) , x (cid:48) = exp( δ + ) / (cid:107) exp( δ + ) (cid:107) , and ˜ x = exp( δ + ) / (cid:107) exp( δ − ) (cid:107) .Clearly x (cid:48) = ˜ x/ (cid:107) ˜ x (cid:107) and ˜ x ≥ x element-wise. Moreover, letting M = max j exp([ δ − ] j ) , we have (cid:107) ˜ x − x (cid:107) = (cid:107) exp( δ + ) − exp( δ − ) (cid:107) (cid:107) exp( δ − ) (cid:107) ≤ ρM |{ i | [ δ + ] i (cid:54) = [ δ − ] i }|(cid:107) exp( δ − ) (cid:107) ≤ ρM · n (cid:107) exp( δ − ) (cid:107) ≤ ρn, establishing the ρn -padding property.Finally, we discuss runtime. Recall that the cost of initializing a ScaleMaintainer with | S | elements is O ( | S | log n log ω ) . So, step 1 of our implementation of MultSparse , i.e., calling
Del once and initializing a rank-1
ScaleMaintainer per nonzero element, costs O ( (cid:107) g (cid:107) log n log ω ) . Wenow discuss costs of merging in step 2. We show these merges cost an amoritized O (log n log ω ) per nonzero coordinate of g , leading to the claimed bound. Specifically, we show the cost of T deletions and initializations due to nonzero entries of MultSparse arguments is O ( T log n log ω ) .Consider the number of times a rank- k set can be created through merges: we claim it is upperbounded by O (cid:0) T / k (cid:1) . It follows that the overall complexity of step 2 is O (cid:32) K (cid:88) k =0 k T k log n log ω (cid:33) = O ( T log n log ω ) . The claimed bound on the number of rank- k merges holds because at least k − deletions (andhence that many MultSparse calls) must occur between consecutive rank- k merges. To see this,for each k , maintain a potential Φ k for the sum of cardinalities of all rank (cid:96) sets for (cid:96) < k . Eachdeletion increases Φ k by at most 1. For an insertion merge to create a rank- k set, Φ k must havebeen at least k − + 2 ; after the merge, it is 0, as in its creation, all rank- (cid:96) sets for (cid:96) < k must havebeen merged. So, there must have been at least k − deletions in between merges. DenseStep . To implement
DenseStep we simply increment τ k ← τ k + 1 for all k ; clearly, thistakes time O (log n ) . We now show that (37) is maintained, i.e., that the resulting update to thevariable δ under DenseStep is (38). Recall that
ScaleMaintainer k is initialized according (44).Clearly, (37) holds at initialization for the set S k , as − κ = 0 . We now show that it continues tohold after any number of DenseStep calls. Let δ be the value of [ δ ] S k when ScaleMaintainer k isinitialized, and let δ τ be the value of [ δ ] S k after τ calls to DenseStep , each performing the update x ← Π ∆ ( x κ ◦ exp( v )) . This is consistent with the update δ τ +1 = κδ τ + [ v ] S k , which requires δ τ = κ τ δ + τ − (cid:88) τ (cid:48) =0 κ τ (cid:48) [ v ] S k = κ τ δ + 1 − κ τ − κ [ v ] S k = log( γ k ¯ x k ) + (1 − κ τ )¯ δ k , where in the final transition we substituted δ = log( γ k ¯ x k ) and [ v ] S k = (1 − κ )[¯ δ k + δ ] accordingto (44). We see that the required of form of δ τ is identical to its definition (36) and consequentlythat (37) holds. UpdateSum . We maintain the running sum s via the invariant [ s ] S k = [ u ] S k + ScaleMaintainer k . GetSum () , ∀ k ∈ [ K ] , (45)which we preserve in two separate procedures. First, whenever UpdateSum () is called, we computethe quantity Γ defined in (40), and for each k ∈ [ K ] call ScaleMaintainer k . UpdateSum (cid:18) γ k ScaleMaintainer k . GetNorm (1 − κ τ k )Γ (cid:19) .
37t is straightforward to see that this indeed preserves the invariant (45) for our definition of ˆ x in (41),and takes time O (log n log ω ) . Next, whenever a coordinate is deleted from a ScaleMaintainer k instance, or an entire ScaleMaintainer k instance is deleted due to a merge operation, we update u j ← u j + ScaleMaintainer k . GetSum ( j ) for every deleted coordinate j , or j involved in the merge, respectively. We charge the cost of thisoperations to that of new ScaleMaintainer instance, which we accounted for in the analysis of
MultSparse . Get ( j ) . Recalling our definition of ˆ x (41), we compute Γ in time O (log n log ω ) by obtaining GetNorm (1 − κ τ k ) for each k , and then call Get ( j, − κ τ k ) in time O (log ω ) for the relevant k . GetSum ( j ) . Recalling our definition of s (45), we implement GetSum ( j ) in O (log ω ) time via a singlecall to GetSum on the relevant
ScaleMaintainer instance, and querying a coordinate of u . Sample . Recalling (41), we first compute Γ , as well as all γ k ScaleMaintainer k . GetNorm (1 − κ τ k ) , in O (log n log ω ) time. We then sample an instance ScaleMaintainer k , for ≤ k ≤ K , proportionalto the value γ k ScaleMaintainer k . GetNorm (1 − κ τ k ) , in O (log n ) time. Finally, for the sampledinstance, we call ScaleMaintainer k . Sample (1 − κ τ k ) to output a coordinate in O (log n log ω ) time.By the definition of ˆ x used by ApproxExpMaintainer , as well as the definitions used by each
ScaleMaintainer k instance, it is clear this preserves the correct sampling probabilities. ScaleMaintainer
Finally, we provide a self-contained treatment of
ScaleMaintainer , the main building block in theimplementation of
ApproxExpMaintainer described above.
For ease of reference we restate the interface of
ScaleMaintainer , where for the sake of brevitywe drop the subscript scm from ε and λ , and use n rather than n (cid:48) to denote the input dimension.Recall that for the vectors ¯ x and ¯ δ given at initialization, the data structure keeps track of vectorsof the form ˆ x [ σ ] := a ε -padding of Π ∆ (cid:0) ¯ x ◦ exp (cid:0) σ ¯ δ (cid:1)(cid:1) , (46)where σ is any scalar in the range { } ∪ [ σ min , .The implementation of the data structure relies on three internal parameters: polynomial ap-proximation order p ∈ N , truncation threshold R ≥ , and σ discretization level K ∈ N . To satisfythe accuracy requirements we set these as R = Θ(1) log 1 ελ , p = Θ(1) log 1 ελ , and K = (cid:24) log 1 σ min (cid:25) ; we give the runtime analysis in terms of these parameters. We now outline our design of
ScaleMaintainer , where the main challenge is supporting efficient
GetNorm operations under no assumptions on the numerical range of the input ¯ δ .38 ategory Function Runtime initialize Init (¯ x, ¯ δ, σ min , ε, λ ) : require ¯ x ∈ [ λ, n O ( npK log n ) update Del ( j ) : Remove coordinate j from ¯ x , ¯ δ O (1) UpdateSum ( γ, σ ) : s ← s + γ ˆ x [ σ ] O ( p ) query Get ( j, σ ) : Return [ˆ x [ σ ]] j O ( p ) GetSum ( j ) : Return [ s ] j . O ( pK ) GetNorm ( σ ) : Return ± ε approx. of (cid:13)(cid:13) ¯ x ◦ exp( σ ¯ δ ) (cid:13)(cid:13) O ( p ) sample Sample ( σ ) : Return j with probability [ˆ x [ σ ]] j O ( p log n ) Exponential approximation via Taylor expansion.
Our main strategy is to replace the ex-ponential in the definition of ˆ x [ σ ] with its Taylor expansion of order p = O (log nελ ) , giving thefollowing approximation to the GetNorm ( σ ) (cid:13)(cid:13) ¯ x ◦ exp (cid:0) σ ¯ δ (cid:1)(cid:13)(cid:13) ≈ (cid:42) ¯ x, p (cid:88) q =0 q ! ( σ ¯ δ ) q (cid:43) = p (cid:88) q =0 σ q q ! (cid:10) ¯ x, ¯ δ q (cid:11) , where q th powers are applied to ¯ δ elementwise. By pre-computing all the inner products { (cid:10) ¯ x, ¯ δ q (cid:11) } pq =0 at initialization, we may evaluate this Taylor approximation of GetNorm in time O ( p ) . The validityof the approximation relies on the following well-known fact. Fact 1 (Theorem 4.1 in [35]) . Let ε (cid:48) , R ≥ . A Taylor series f p ( t ) = (cid:80) pq =0 t q q ! of degree p = O ( R + log ε (cid:48) ) satisfies | exp( t ) − f p ( t ) | ≤ exp( t ) ε (cid:48) for all t ∈ [ − R, . Truncating small coordinates and σ discretization. For Fact 1 to directly imply the desiredapproximation guarantee for
GetNorm , the entries of σ ¯ δ must all lie in [ − R, for some R = (cid:101) O (1) .However, this will not hold in general, as our data structure must support any value of ¯ δ . For afixed value of σ , we can work instead with a shifted and truncated version of ¯ δ , i.e., ˜ δ [ σ, µ ] := max { ¯ δ − µ, − R/σ } , where the offset µ is roughly the maximum element of ¯ δ . Fact 1 allows us to approximate theexponential of σ ˜ δ [ σ, µ ] , and for R = Θ(log( nελ )) we argue that the truncation of the smallest entriesof δ results in small multiplicative error. Unfortunately, the dependence of ˜ δ [ σ, µ ] on σ would defeatthe purpose of efficient computation, because it is impossible to precompute { (cid:10) ¯ x, ˜ δ [ σ, µ ] q (cid:11) } pq =0 forevery σ ∈ [ σ min , . To address this, we argue that truncation of the form ˜ δ [ˆ σ, µ ] is accurate enoughfor any σ ∈ [ˆ σ/ , ˆ σ ] . Therefore, it suffices to to discretize [ σ min , into K = (cid:100) log σ min (cid:101) levels ˆ σ k := 2 k − σ min and precompute (cid:10) ¯ x, ˜ δ [ˆ σ k , µ ] q (cid:11) for every k ∈ [ K ] in q ≤ p . This allows us to compute GetNorm ( σ ) in O ( p ) = (cid:101) O (1) time, with O ( npK ) = (cid:101) O ( n ) preprocessing time.39 upporting deletion via lazy offset selection. Had the dataset not supported deletions, wecould have simply set µ to be the largest entry of ¯ δ (independent of k ). However, with deletionsthe largest entry of ¯ δ could change, potentially invalidating the truncation. To address this, wemaintain a different threshold µ k for every k ∈ [ K ] , and argue that the approximation remains validif the invariant ¯ δ max ≤ µ k ≤ ¯ δ max + R σ k for every k ∈ [ K ] (47)holds, where ¯ δ max := max j ¯ δ j . Writing ˜ δ [ k ] := ˜ δ [ˆ σ k , µ k ] = max (cid:26) ¯ δ − µ k , − R ˆ σ k (cid:27) for every k ∈ [ K ] , (48)the data structure only needs to maintain µ k and (cid:10) ¯ x, ˜ δ [ k ] q (cid:11) for every k ∈ [ K ] in q ≤ p .When deleting coordinate j , for every k we test whether the invariant (47) remains valid. Ifit does, we keep µ k the same and implement deletion (for this value of k ) in time O ( p ) = (cid:101) O (1) by subtracting [¯ x ] j [˜ δ [ k ]] qj from (cid:10) ¯ x, ˜ δ [ k ] q (cid:11) for every q ≤ p . If the invariant is no longer valid, wereset µ k to the new value of ¯ δ max and recompute (cid:10) ¯ x, ˜ δ [ k ] q (cid:11) for every q ≤ p . Note that the re-computation time is proportional to the number of un-truncated coordinates in the newly defined ˜ δ [ k ] . The key observation here is that every re-computation decreases µ k by at least R/ (2ˆ σ k ) andso no element of ¯ δ can remain un-truncated for more than two re-computation. Therefore, the costof recomputing inner products due to deletions, for the entire lifetime of the data structure, is atmost O ( npK ) = (cid:101) O ( n ) , which we charge to the cost of initialization. Explicit expression for ˆ x [ σ ] . Following the preceding discussion, for any σ ≥ σ min we set k (cid:63) = (cid:24) log σσ min (cid:25) , so that σ ∈ (cid:20) ˆ σ k (cid:63) , ˆ σ k (cid:63) (cid:21) , and define Z [ σ ] := e σµ k(cid:63) p (cid:88) q =0 σ q q ! (cid:68) ¯ x, ˜ δ [ k (cid:63) ] q (cid:69) ≈ (cid:13)(cid:13) ¯ x ◦ exp (cid:0) σ ¯ δ (cid:1)(cid:13)(cid:13) (49) ˆ x [ σ ] := e σµ k(cid:63) Z [ σ ] p (cid:88) q =0 σ q q ! ¯ x ◦ ˜ δ [ k (cid:63) ] q ≈ Π ∆ (cid:0) ¯ x ◦ exp (cid:0) σ ¯ δ (cid:1)(cid:1) , (50)with ˜ δ as defined in (48). We now prove that the approximation guarantees of
ScaleMaintainer hold.
Proposition 5.
There exist R = O (1) · log nελ and p = O (1) · log nελ such that for all σ ∈ [ σ min , , ifthe invariant (47) holds we have that Z [ σ ] is an ε multiplicative approximation of (cid:13)(cid:13) ¯ x ◦ exp (cid:0) σ ¯ δ (cid:1)(cid:13)(cid:13) and ˆ x [ σ ] is a ε -padding of Π ∆ (cid:0) ¯ x ◦ exp (cid:0) σ ¯ δ (cid:1)(cid:1) , with Z [ σ ] and ˆ x [ σ ] defined in Eq.s (49) and (50) respectively. We can query the maximum entry of ¯ δ under deletions in O (1) time via a standard data structure, e.g. adoubly-linked list of the sorted entries of ¯ δ . roof. To simplify notation, we write µ = µ k (cid:63) and ˆ σ = ˆ σ k (cid:63) . We begin by noting that the inequali-ties (47) and σ ≤ ˆ σ imply that σ ˜ δ i [ k (cid:63) ] ∈ [ − R, for every i ∈ [ n ] and we may therefore apply Fact 1to obtain p (cid:88) q =0 σ q q ! ¯ x j ˜ δ i [ k (cid:63) ] q ≥ (1 − ε (cid:48) )¯ x i exp( σ ˜ δ i [ k ]) ≥ (1 − ε (cid:48) ) e − σµ ¯ x i exp( σ ¯ δ i [ k ]) (51)for every i ∈ [ n ] . Therefore, we have Z [ σ ] ≥ (1 − ε (cid:48) ) (cid:13)(cid:13) ¯ x ◦ exp (cid:0) σ ¯ δ (cid:1)(cid:13)(cid:13) . (52)Similarly, we have p (cid:88) q =0 σ q q ! ¯ x j ˜ δ i [ k (cid:63) ] q ≤ (1 + ε (cid:48) )¯ x i exp( σ ˜ δ i [ k ]) ≤ (1 + ε (cid:48) ) e − σµ ¯ x i (exp( σ ¯ δ i [ k ]) + exp( − σR/ ˆ σ )) Note that the condition (47) also implies that ˜ δ j [ k (cid:63) ] ≥ − R/ (2ˆ σ ) for some j ∈ [ n ] (namely themaximal element of ¯ δ ). Using also ¯ x j ≥ λ , we have e σµ exp( − σR/ ˆ σ ) ≤ exp( − σR/ (2ˆ σ )) ¯ x j λ exp( σ ¯ δ j ) . Taking R ≥ nλε (cid:48) and recalling that σ ≥ ˆ σ/ , we have exp( − σR/ (2ˆ σ )) ≤ λε (cid:48) / (2 n ) and conse-quently e σµ exp( − σR/ ˆ σ ) ≤ ε (cid:48) ¯ x j exp( σ ¯ δ j ) ≤ ε (cid:48) n (cid:13)(cid:13) ¯ x ◦ exp (cid:0) σ ¯ δ (cid:1)(cid:13)(cid:13) . Substituting back and using ¯ x i ≤ and ε (cid:48) < gives e σµ p (cid:88) q =0 σ q q ! ¯ x j ˜ δ i [ k (cid:63) ] q ≤ (1 + ε (cid:48) )¯ x i exp( σ ¯ δ i ) + ε (cid:48) n (cid:13)(cid:13) ¯ x ◦ exp (cid:0) σ ¯ δ (cid:1)(cid:13)(cid:13) . (53)Summing over i ∈ [ n ] , we obtain Z [ σ ] ≤ (1 + 2 ε (cid:48) ) (cid:13)(cid:13) ¯ x ◦ exp (cid:0) σ ¯ δ (cid:1)(cid:13)(cid:13) (54)Therefore, Z [ σ ] is a ε (cid:48) -multiplicative approximation of (cid:13)(cid:13) ¯ x ◦ exp (cid:0) σ ¯ δ (cid:1)(cid:13)(cid:13) .It remains to show that ˆ x [ σ ] is a ε -padding of x [ σ ] := Π ∆ (cid:0) ¯ x ◦ exp (cid:0) σ ¯ δ (cid:1)(cid:1) . First, if we define ˜ x = ε (cid:48) − ε (cid:48) ˆ x [ σ ] then the bounds (51) and (54) imply that ˜ x ≥ x [ σ ] elementwise. Also, the bounds (52)and (53) imply that ˆ x i [ σ ] − x i [ σ ] ≤ (1 + ε (cid:48) ) x i [ σ ] + ε (cid:48) /n − ε (cid:48) for every i ∈ [ n ] . Therefore, for ε (cid:48) < / , (cid:107) ˜ x − x [ σ ] (cid:107) ≤ (cid:18) ε (cid:48) − ε (cid:48) (cid:19) − ≤ ε (cid:48) , so that ˆ x [ σ ] is a ε (cid:48) padding of x [ σ ] . Taking ε (cid:48) = ε/ concludes the proof.41 .3.4 Implementation: data structure state and initialization Besides storing ¯ x and ¯ δ , the data structure maintains the following fields.1. An offset µ k ∈ R for every k ≤ K = (cid:108) log σ min (cid:109) , initialized as µ k = max j [¯ δ ] j for all k .2. A balanced binary tree with n leaves. For node v in the tree, k ∈ [ K ] and q ∈ { , . . . , p } , westore A v [ k, q ] := (cid:68) ¯ x, ˜ δ [ k ] q (cid:69) S v , where ˜ δ [ k ] = max { ¯ δ − µ k , − R/ ˆ σ k } as before, the set S v contains the leaves in the subtreerooted in v , and (cid:104) a, b (cid:105) S := (cid:80) i ∈ S a i b i . When referring to the root of the tree we omit thesubscript, i.e., we write A [ k, q ] := (cid:68) ¯ x, ˜ δ [ k ] q (cid:69) .
3. A vector u ∈ R n and coefficients c k,q ∈ R for every k ∈ [ K ] and q ∈ { , . . . , p } , for maintainingthe running sum. We initialize them all to be . The running sum obeys the followinginvariant: s = u + K (cid:88) k =1 p (cid:88) q =0 c k,q q ! ¯ x ◦ ˜ δ [ k ] q . (55)4. A doubly linked list of the sorted entries of ¯ δ , with a pointer to the maximal element of ¯ δ aswell as pointers to the largest element smaller than µ k − R/ ˆ σ k for every k ∈ [ K ] .Initializing the data structure for maintaining the maximum element takes time O ( n log n ) dueto the need to sort ¯ δ . With it, initializing µ k is trivial and so is the initialization of u and c q,k .Initializing the data stored in the binary tree takes time O ( npK ) , since for every value k and q andinternal node v with children v (cid:48) , v (cid:48)(cid:48) we can recursively compute A v [ k, q ] as A v (cid:48) [ k, q ] + A v (cid:48)(cid:48) [ k, q ] . Wewill also charge some additional deletion costs to the initialization runtime, resulting in the overallcomplexity O ( npK log n ) . GetNorm ( σ ) . We compute k (cid:63) = (cid:108) log σσ min (cid:109) and return Z [ σ ] = e σk (cid:63) (cid:80) pq =0 σ q q ! A [ k (cid:63) , q ] . Clearly, thistakes O ( p ) time and Proposition 5 provides the claimed approximation guarantee. Get ( j, σ ) . We compute Z [ σ ] and k (cid:63) as described above and return e σµk(cid:63) Z [ σ ] (cid:80) pq =0 σ q q ! ¯ x j ˜ δ j [ k (cid:63) ] q in ac-cordance with the form (50) of ˆ x [ σ ] . Again, this takes O ( p ) time and Proposition 5 provides theclaimed approximation guarantee. GetSum ( j ) . Recalling the invariant (55), we return u j + (cid:80) Kk =1 (cid:80) pq =0 c k,q q ! ¯ x j ˜ δ j [ k ] q in time O ( pK ) . Sample ( σ ) . We perform a random walk from the root of our binary tree data structure to a leaf.A each internal node v with children v (cid:48) and v (cid:48)(cid:48) , we select node v (cid:48) with probability (cid:104) , ˆ x [ σ ] (cid:105) S v (cid:48) (cid:104) , ˆ x [ σ ] (cid:105) S v = (cid:80) pq =0 σ q q ! A v (cid:48) [ k (cid:63) , q ] (cid:80) pq =0 σ q q ! A v [ k (cid:63) , q ] , v (cid:48)(cid:48) , where k (cid:63) = (cid:108) log σσ min (cid:109) . We return the index associated with the leaf inwhich we end the walk; the probability of returning index j is exactly [ˆ x [ σ ] j . Each step in the walktakes time O ( p ) and there are O (log n ) steps, so the total time is O ( p log n ) . UpdateSum ( σ ) . Recalling the invariant (55) and the form (50) of ˆ x [ σ ] , we compute k (cid:63) and Z [ σ ] asin the GetNorm implementation, and update update c k (cid:63) ,q ← c k (cid:63) ,q + e σµ k(cid:63) σ q Z [ σ ] for every q ∈ { , . . . , p } . This takes time O ( p ) . Del ( j ) . We set [¯ δ j ] ← −∞ , remove the element corresponding to index j from the doubly linkedlist, and perform the following operations for each k ∈ [ K ] separately. First, we check if the newmaximum element of ¯ δ is a least µ k − R/ (2ˆ σ k ) . If it is, we leave µ k unchanged and we simply update A v [ k, q ] ← A v [ k, q ] − ¯ x j ˜ δ j [ k ] q for every q ≤ p and node v on the path from the root to the leaf corresponding to index j . Sincethe length of the path is O (log n ) , this update takes time O ( p log n ) .Otherwise, the new maximum element is less than µ k − R/ (2ˆ σ k ) , and we must change µ k inorder to maintain the invariant (47). Let µ new k be the new maximum element of ¯ δ , and let U k = (cid:26) i (cid:12)(cid:12)(cid:12)(cid:12) [¯ δ ] i ≥ µ new k + R ˆ σ k (cid:27) be the new set of un-truncated indices. (We find the elements in this set when we update the pointerto the first element smaller than µ k − R/ ˆ σ k ). We recompute A v [ k, q ] = (cid:68) ¯ x, ˜ δ [ k ] q (cid:69) S v for every q ≤ p and every node v with a child in U k . Performing the computation recursively from leaf to root, thistake at most O ( | U k | p log n ) time. To maintain the invariant (55) as the definition of ˜ δ [ k ] changes,we update u j ← u j + p (cid:88) q =0 c k,q q ! ¯ x j (cid:0) [¯ δ j − µ new k ] q − [max (cid:8) ¯ δ j − µ k , − R/ ˆ σ k (cid:9) ] q (cid:1) for every j ∈ U k ; this update takes O ( | U k | p ) time. Finally, we update µ k ← µ new k .Summing over k ∈ [ K ] , deletion operations of the first kind (with µ k unchanged) take at most O ( Kp log n ) time per call to Del . Operations of the second kind (with µ k decreased) take time O ( N p log n ) throughout the data structure lifetime, where N = (cid:80) t ≥ (cid:80) Kk =1 | U ( t ) k | and for each k ∈ [ K ] we write U (1) k , U (2) k , . . . to denote the different sets U k generated by all calls to Del . Foreach k , if µ k is decreased at all then it must decrease by at least R/ (2ˆ σ k ) . Therefore, by definitionof U k , an index j can belong to U ( t ) k for at most 2 values of t . Consequently, we have N = O ( nK ) .Therefore, deletion operations of the second kind contribute at most O ( nKp log n ) to the totalruntime, which we charge to initialization. 43 Applications
In this section, we leverage the techniques of this paper to obtain improved runtimes for solvingcertain structured optimization problems.In Sections 6.1 and 6.2, we use a variant of our variance-reduced coordinate method in the (cid:96) - (cid:96) setup to obtain algorithms for solving the maximum inscribed ball (Max-IB) and minimumenclosing ball (Min-EB) problems. Our algorithms improve upon the runtimes of those in Allen-Zhu et al. [2] by a factor depending on the sparsity of the matrix. This improvement stems from apreprocessing step in [2] where the input is randomly rotated to improve a norm dependence of thealgorithm. Our methods avoid this preprocessing and obtain runtimes dependent on the both thesparsity and numerical sparsity of the data, providing universal improvements in the sparse regime,in the non-degenerate case where the span of the points is full-rank.In Section 6.3, we use the results of our variance-reduced algorithm in the (cid:96) - (cid:96) setup (cf.Section D.2) to obtain improved regression algorithms for a variety of data matrices, includingwhen the matrix is numerically sparse or entrywise nonnegative.Our methods in this section rely on an extension of the outer loop of this paper (Algorithm 2)for strongly monotone minimax problems , developed in our previous work [8]. Specifically, for aseparable regularizer r ( x, y ) = r x ( x ) + r y ( y ) on a joint space for any of our setups, consider thefollowing composite bilinear minimax problem: min x ∈X max y ∈Y f ( x, y ) := y (cid:62) Ax + µ x φ ( x ) − µ y ψ ( y ) , where φ = V x x (cid:48) , ψ = V y y (cid:48) . (56)We call such problem a ( µ x , µ y ) -strongly monotone problem; this is a special case of a gen-eralization of the notion of strong convexity, in the case of convex minimization. For generalstrongly-monotone problems, Carmon et al. [8] provided a variant of Algorithm 2 with the followingguarantee. Proposition 6 (Proposition 5, Carmon et al. [8]) . For problem (56) , denote µ := √ µ x µ y and ρ := (cid:112) µ x /µ y . Let O be an ( α , ε )-relaxed proximal oracle for operator g ( x, y ) := ( ∇ x f ( x, y ) , −∇ y f ( x, y )) ,let Θ be the range of r , and let (cid:107) ( ∇ x f ( z ) , −∇ y f ( z (cid:48) )) (cid:107) ∗ ≤ G , for all z, z (cid:48) ∈ Z . Let z K be the outputof K iterations of OuterLoopStronglyMonotone , Algorithm 7 of Carmon et al. [8]. Then E Gap( z K ) ≤ √ G (cid:118)(cid:117)(cid:117)(cid:116)(cid:32)(cid:18) αµ + α (cid:19) K (cid:18) ρ + 1 ρ (cid:19) Θ + εµ (cid:33) . Each iteration k ∈ [ K ] consists of one call to O , producing a point z k − / , and one step of the form z k ← (cid:110)(cid:10) g ( z k − / ) , z (cid:11) + α ˆ V z k − ( z ) + µ ˆ V z k − / ( z ) (cid:111) , (57) where ˆ V := ρV x + ρ − V y . In particular, by setting ε = µ(cid:15) G , using K = (cid:101) O ( α/µ ) iterations, we have the guarantee E Gap( z K ) ≤ (cid:15) . The ( α , ε )-relaxed proximal oracle works similarly as in Algorithm 3 except for the additionalcomposite terms. For completeness we include the algorithm with its theoretical guarantees andimplementation in Section E.2 (see Algorithm 4, Proposition 1 and Section E.2.2).44n all of our applications discussed in this section, the cost of each step (57) is O ( nnz ) , stemmingfrom the computation of g ( x, y ) . The resulting algorithms therefore have runtime ˜ O (cid:18) ( nnz + ( cost of implementing O )) · αµ (cid:19) . In the maximum inscribed ball (Max-IB) problem, we are given a polyhedron P ⊂ R n defined by m halfspaces { H i } i ∈ [ m ] , each characterized by a linear constraint H i = { x ∈ R n : (cid:104) a i , x (cid:105) + b i ≥ } ,i.e. P = ∩ i ∈ [ n ] H i . The goal is to (approximately) find a point x ∗ ∈ P that maximizes the smallestdistance to any of the bounding hyperplanes H i , i.e. x ∗ ∈ arg max x ∈ P min i ∈ [ n ] (cid:104) a i , x (cid:105) + b i (cid:107) a i (cid:107) . More formally, if the optimal radius of the maximum inscribed ball is r ∗ , the goal is to find an (cid:15) -accurate solution, i.e. a point in P which has minimum distance to all bounding hyperplanes atleast (1 − (cid:15) ) r ∗ .Given halfspace information A , b where A i : = a i for all i ∈ [ m ] , the polytope is defined by P = { x | Ax + b ≥ } . We use the following notation in this section: B := (cid:107) b (cid:107) ∞ , r ∗ is the valueof the maximum inscribed ball problem, R is the radius of the minimum enclosing ball, which isdefined as the Euclidean ball containing P with smallest radius possible, x ∗ is the center of themaximum inscribed ball, and ρ is an upper bound on the aspect ratio R/r ∗ . As in Allen-Zhu et al.[2], we will make the following assumptions:1. The polytope is bounded, and thus m ≥ n . This is without loss of generality since when thepolytope is unbounded, the aspect ratio ρ = ∞ , and our runtime result holds trivially.2. (cid:107) A i : (cid:107) = 1 for all i ∈ [ m ] , so (cid:107) A (cid:107) →∞ = 1 , by properly scaling A (one can consider the trivialcase when for some i , a i = 0 separately).3. The origin is inside polytope P , i.e. O ∈ P , by properly shifting P .We also define the following constant (see Appendix D.3) in this section with respect to therescaled matrix A , L , co := min (cid:110) L , , (1) co , L , , (2) co , L , , (3) co (cid:111) ≤ √ rcs · L , rc ≤ √ rcs , given the definitions of L , , (1) co , L , , (2) co , L , , (3) co as in (94), (95), and (96), and the second assumptionabove (namely, that L , rc = max i ∈ [ m ] (cid:107) A i : (cid:107) = 1 ).Allen-Zhu et al. [2] show that solving Max-IB is equivalent to the following minimax problem: r ∗ := max x ∈ R n min y ∈ ∆ m f ( x, y ) := y (cid:62) Ax + y (cid:62) b, (58)and moreover, to solve the problem to (cid:15) -multiplicative accuracy, it suffices to find x ∗ (cid:15) that solves theminimax problem to (cid:15) -multiplicative accuracy in terms of the one-sided gap of the x block, i.e. min y ∈ ∆ m f ( x ∗ (cid:15) , y ) ≥ (1 − (cid:15) ) f ( x ∗ , y ∗ ) , where ( x ∗ , y ∗ ) is the optimal saddle point of problem (58). We first state several bounds on theparameters of the problem from Allen-Zhu et al. [2].45 act 2 (Geometric properties of Max-IB) . We have (cid:107) x ∗ (cid:107) ≤ R , and r ∗ = max x ∈ R n min y ∈ ∆ m f ( x, y ) := y (cid:62) Ax + y (cid:62) b ≤ B ≤ R. These facts imply that we can instead consider the constrained minimax problem (where weoverload our definition of f for the rest of the section): r ∗ := max x ∈ B n min y ∈ ∆ m f ( x, y ) = y (cid:62) ˜ Ax + y (cid:62) b, where ˜ A = 2 R · A. (59)We first use a “warm start” procedure to find a constant multiplicative estimate of r ∗ , which usesthe strongly monotone algorithm OuterLoopStronglyMonotone of Carmon et al. [8] together withAlgorithm 4 of Section E.2 as a relaxed proximal oracle on the ( µ, µ ) -strongly monotone problem max x ∈ B n min y ∈ ∆ m f µ ( x, y ) := y (cid:62) ˜ Ax + y (cid:62) b + µ (cid:88) i ∈ [ m ] [ y ] i log[ y ] i − µ (cid:107) x (cid:107) , and a line search over parameter µ . The following lemma is an immediate consequence of Proposi-tion 6 and Corollary 1, whose proof we defer to Appendix F.1. Lemma 10.
We can spend (cid:101) O (cid:16) nnz + ρ √ nnz · L , co (cid:17) time preprocessing to obtain a -multiplicativeapproximation ˆ r of r ∗ , i.e. ˆ r ≤ r ∗ ≤ ˆ r. Finally, we use our variance-reduced coordinate algorithm, namely Algorithm 4 as a relaxed prox-imal oracle in
OuterLoopStronglyMonotone together with Proposition 6 once more to solve (59)to the desired accuracy. The implementation in Section E.2.2 and complexity results in Section D.3yield the runtime. This implementation crucially uses our development of the
ApproxExpMaintainer data structure in order to obtain a runtime depending directly on rcs rather than dimensions of thematrix, as well as independence on B . For completeness, a proof can be found in Appendix F.1. Theorem 3.
The algorithm of Section D.3 can be used to find an (cid:15) -accurate solution x ∗ (cid:15) to Max-IBsatisfying min y ∈ ∆ m f ( x ∗ (cid:15) , y ) ≥ (1 − (cid:15) ) r ∗ with high probability in time (cid:101) O (cid:32) nnz + ρ √ nnz · L , co (cid:15) (cid:33) = (cid:101) O (cid:18) nnz + ρ √ nnz · rcs (cid:15) (cid:19) . Remark 5.
Because we assumed m ≥ n , in the case A is dense, up to logarithmic terms our runtimeimproves upon the runtime of (cid:101) O ( ρm √ n/(cid:15) ) in Allen-Zhu et al. [2] by a factor of at least (cid:114) mn nnz · m rcs generically. This is an improvement when A is sparse or column-sparse, i.e. nnz (cid:28) mn , or rcs (cid:28) m .Such a saving is larger when A has numerical sparsity so that e.g. (cid:16) L , co (cid:17) ≤ max i ∈ [ m ] (cid:107) A i : (cid:107) + (cid:0) max i ∈ [ m ] (cid:107) A i : (cid:107) (cid:1) (cid:0) max j ∈ [ n ] (cid:107) A : j (cid:107) (cid:1) < rcs · max i ∈ [ m ] (cid:107) A i (cid:107) . Here (cid:101) O is hiding an additional factor of polylog ( (cid:107) b (cid:107) ∞ ) due to the additional cost in the runtime of ApproxExpMaintainer , caused by the linear term b (see Remark 2). .2 Minimum enclosing ball In the minimum enclosing ball (Min-EB) problem, we are given a set of data points { a , . . . , a m } with a = 0 , max i ∈ [ m ] (cid:107) a i (cid:107) = 1 . The goal is to find the minimum radius R ∗ such that there existsa point x with distance at most R ∗ to all points. Following the presentation of Allen-Zhu et al.[2], we consider Min-EB in an equivalent form. Define the vector b to have b i = (cid:107) a i (cid:107) entrywise.Then, Min-EB is equivalent to the minimax problem R ∗ := min x ∈ R n max y ∈ ∆ m (cid:88) i y i (cid:107) x − a i (cid:107) = min x ∈ R n max y ∈ ∆ m f ( x, y ) , where f ( x, y ) := y (cid:62) Ax + y (cid:62) b + 12 (cid:107) x (cid:107) . (60)By assumption, (cid:107) A (cid:107) →∞ = 1 . We let ( x ∗ , y ∗ ) be the optimal solution to the saddle point problem.We first state several bounds on the quantities of the problem. These bounds were derived inAllen-Zhu et al. [2] and obtained by examining the geometric properties of the problem. Fact 3.
The following bounds hold: (cid:107) x ∗ (cid:107) ≤ , and R ∗ ≥ / . To achieve a multiplicative approximation, since R ∗ ≥ / by Fact 3, it suffices to obtain a pair ( x ∗ (cid:15) , y ∗ (cid:15) ) achieving max y f ( x ∗ (cid:15) , y ) − min x f ( x, y ∗ (cid:15) ) ≤ (cid:15)/ . In light of minimax optimality, Lemma 11(proved in Section F.2) shows that it suffices to consider, for (cid:15) (cid:48) = Θ( (cid:15)/ log m ) , solving the following (1 , (cid:15) (cid:48) ) -strongly monotone problem to sufficient accuracy: min x ∈ R n max y ∈ ∆ m f (cid:15) (cid:48) ( x, y ) := y (cid:62) Ax + y (cid:62) b − (cid:15) (cid:48) (cid:88) i ∈ [ m ] [ y ] i log[ y ] i + 12 (cid:107) x (cid:107) . (61) Lemma 11.
Setting (cid:15) (cid:48) = (cid:15)/ (32 log m ) , an (cid:15)/ -accurate solution or (61) is an (cid:15)/ -accurate solutionto the original problem (60). As an immediate result of the above lemma, the runtime in Section D.3 and the correctnessproofs of Proposition 6 and Corollary 1, we obtain the following guarantee.
Theorem 4.
The strongly monotone algorithm
OuterLoopStronglyMonotone of Carmon et al. [8],using Algorithm 4 of Section E.2 and the estimator of Section D.3 as a relaxed proximal oracle, findsan (cid:15) -accurate solution x ∗ (cid:15) to Min-EB satisfying R ∗ ≤ max y f ( x ∗ (cid:15) , y ) ≤ (1 + (cid:15) ) R ∗ with high probabilityin time (cid:101) O (cid:32) nnz + √ nnz · L , co √ (cid:15) (cid:33) = (cid:101) O (cid:18) nnz + √ nnz · rcs √ (cid:15) (cid:19) . Remark 6.
When m ≥ n , up to logarithmic terms our runtime improves the (cid:101) O ( m √ n/ √ (cid:15) ) runtimeof Allen-Zhu et al. [2] by a factor of (cid:114) mn nnz · m rcs generically. This is an improvement when A is sparse or column-sparse, i.e. nnz (cid:28) mn , or rcs (cid:28) m .As in Section 6.1, the improvement is larger when A is numerically sparse, i.e. when (cid:16) L , co (cid:17) ≤ max i ∈ [ m ] (cid:107) A i : (cid:107) + (cid:0) max i ∈ [ m ] (cid:107) A i : (cid:107) (cid:1) (cid:0) max j ∈ [ n ] (cid:107) A : j (cid:107) (cid:1) < rcs · max i ∈ [ m ] (cid:107) A i (cid:107) . This can be assumed without loss of generality by shifting and rescaling as in Allen-Zhu et al. [2] and consideringthe trivial case when all a i , i ∈ [ m ] are equal. When m < n , the runtime of the algorithm in Allen-Zhu et al. [2] still holds and is sometimes faster than ours. .3 Regression We consider the standard (cid:96) linear regression problem in a data matrix A ∈ R m × n and vector b ∈ R m , i.e. min x ∈ R n (cid:107) Ax − b (cid:107) . In particular, we consider the equivalent primal-dual form, min x ∈ R n max y ∈ B m f ( x, y ) := y (cid:62) ( Ax − b ) . (62)Throughout, we assume the smallest eigenvalue of A (cid:62) A is µ > and denote an optimal solutionto (62) by z ∗ = ( x ∗ , y ∗ ) (where x ∗ is the unique solution to the regression problem). Our strategyis to consider a sequence of modified problems, parameterized by β > , x (cid:48) ∈ R n : min x ∈ R n max y ∈ B m f βx (cid:48) ( x, y ) := y (cid:62) ( Ax − b ) + β (cid:13)(cid:13) x − x (cid:48) (cid:13)(cid:13) − β (cid:107) y (cid:107) . (63)We denote the optimal solution to (63) by z ∗ ( β,x (cid:48) ) = ( x ∗ ( β,x (cid:48) ) , y ∗ ( β,x (cid:48) ) ) ; when clear from context, forsimplicity we drop β and write z ∗ x (cid:48) = ( x ∗ x (cid:48) , y ∗ x (cid:48) ) (as β = √ µ throughout our algorithm). Lemma 12(proved in Section F.3) states a known relation between the optimal solutions for (62) and (63). Lemma 12.
Letting ( x ∗ , y ∗ ) be the optimal solution for (62) and ( x ∗ x (cid:48) , y ∗ y (cid:48) ) be the optimal solutionfor (63) , the following relation holds: (cid:107) x ∗ x (cid:48) − x ∗ (cid:107) ≤
11 + µβ (cid:13)(cid:13) x (cid:48) − x ∗ (cid:13)(cid:13) . We give a full implementation of the regression algorithm in Algorithm 5 (see Section F.3), andstate its correctness and runtime in Theorem 5. The algorithm repeatedly solves problems of theform (63) in phases, each time using Lemma 12 to ensure progress towards x ∗ . Observing that eachsubproblem is ( β, β ) -strongly monotone, each phase is conducted via OuterLoopStronglyMonotone ,an algorithm of Carmon et al. [8], using the (cid:96) - (cid:96) algorithms of Section D.2 as a proximal oracle. Dueto the existence of composite terms, our inner loop steps are slightly different than in Section D.2; wegive a more formal algorithm for the relaxed proximal oracle and its implementation in Algorithm 4and Appendix E.2. We remark that by a logarithmic number of restarts per phase, a standardargument boosts Theorem 5 to a high-probability claim. Theorem 5.
Given data matrix A ∈ R m × n , vector b ∈ R m , and desired accuracy (cid:15) ∈ (0 , ,assuming A (cid:62) A (cid:23) µI for µ > , Algorithm 5 outputs an expected (cid:15) -accurate solution ˜ x , i.e. E [ (cid:107) ˜ x − x ∗ (cid:107) ] ≤ (cid:15), and runs in time (cid:101) O nnz + √ nnz · max (cid:110)(cid:113)(cid:80) i (cid:107) A i : (cid:107) , (cid:113)(cid:80) j (cid:107) A : j (cid:107) (cid:111) √ µ . We give two settings where the runtime of Algorithm 5 improves upon the state of the art.
Entrywise nonnegative A . In the particular setting when all entries of A are nonnegative, byProposition 7 our complexity as stated in Theorem 5 improves by a factor of (cid:112) nnz / ( m + n ) theruntime of accelerated gradient descent [30], which is the previous state-of-the-art in certain regimeswith runtime O (cid:0) nnz · (cid:107) A (cid:107) op / √ µ (cid:1) . This speedup is most beneficial when A is dense. More generally, this holds for arbitrary A ∈ R m × n satisfying (cid:107)| A |(cid:107) op ≤ (cid:107) A (cid:107) op . umerically sparse A . For numerically sparse A with (cid:107) A i : (cid:107) / (cid:107) A i : (cid:107) = O (1) , (cid:107) A : j (cid:107) / (cid:107) A : j (cid:107) = O (1) for all i ∈ [ m ] , j ∈ [ n ] , we can choose α = √ µ in Algorithm 5 and obtain the runtime O (cid:32) nnz + max (cid:8)(cid:80) i (cid:107) A i : (cid:107) , (cid:80) j (cid:107) A : j (cid:107) (cid:9) µ (cid:33) = O (cid:32) nnz + (cid:107) A (cid:107) µ (cid:33) using the argument in Theorem 5. Under a (similar, but weaker) numerically sparse condition (cid:107) A i : (cid:107) / (cid:107) A i : (cid:107) = O (1) , the prior state-of-the-art stochastic algorithm [19] obtains a runtime of O ( nnz + rcs · (cid:107) A (cid:107) /µ ) , and the recent state-of-the-art result in the numerically sparse regime [17]improves this to O ( nnz + ( (cid:107) A (cid:107) /µ ) . ) when rcs = Ω( (cid:107) A (cid:107) F / √ µ ) . Improving universally over both,our method gives O ( nnz + (cid:107) A (cid:107) /µ ) in this setting. Acknowledgements
This research was supported in part by Stanford Graduate Fellowships, NSF CAREER AwardCCF-1844855, NSF Graduate Fellowship DGE-1656518 and a PayPal research gift. We thankthe anonymous reviewers who helped improve the completeness and readability of this paper byproviding many helpful comments. 49 eferences [1] Z. Allen-Zhu, Y. T. Lee, and L. Orecchia. Using optimization to obtain a width-independent,parallel, simpler, and faster positive sdp solver. In
Proceedings of the twenty-seventh annualACM-SIAM symposium on Discrete algorithms , pages 1824–1831. Society for Industrial andApplied Mathematics, 2016.[2] Z. Allen-Zhu, Z. Liao, and Y. Yuan. Optimization algorithms for faster computational geometry.In , pages 53:1–53:6,2016.[3] Z. Allen-Zhu, Z. Qu, P. Richtárik, and Y. Yuan. Even faster accelerated coordinate descentusing non-uniform sampling. In
International Conference on Machine Learning , pages 1110–1119, 2016.[4] E. Andersen, C. Roos, T. Terlaky, T. Trafalis, and J. Warners. The use of low-rank updates ininterior-point methods.
Numerical Linear Algebra and Optimization , pages 1–12, 1996.[5] M. G. Azar, R. Munos, and H. J. Kappen. Minimax pac bounds on the sample complexity ofreinforcement learning with a generative model.
Machine learning , 91(3):325–349, 2013.[6] D. Babichev, D. Ostrovskii, and F. Bach. Efficient primal-dual algorithms for large-scale mul-ticlass classification. arXiv preprint arXiv:1902.03755 , 2019.[7] P. Balamurugan and F. R. Bach. Stochastic variance reduction methods for saddle-point prob-lems. In
Advances in Neural Information Processing Systems , 2016.[8] Y. Carmon, Y. Jin, A. Sidford, and K. Tian. Variance reduction for matrix games. In
Advancesin Neural Information Processing Systems , 2019.[9] K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsitytime. In
Proceedings of the 45th annual ACM symposium on Symposium on theory of computing ,pages 81–90. ACM, 2013.[10] K. L. Clarkson, E. Hazan, and D. P. Woodruff. Sublinear optimization for machine learning.In , pages 449–457, 2010.[11] M. B. Cohen, J. Nelson, and D. P. Woodruff. Optimal approximate matrix product in termsof stable rank. In . Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.[12] M. B. Cohen, Y. T. Lee, and Z. Song. Solving linear programs in the current matrix multipli-cation time. arXiv preprint arXiv:1810.07896 , 2018.[13] G. B. Dantzig.
Linear Programming and Extensions . Princeton University Press, Princeton,NJ, 1953.[14] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l 1-ballfor learning in high dimensions. In
Proceedings of the 25th international conference on Machinelearning , pages 272–279. ACM, 2008.[15] A. Gilyén, S. Lloyd, and E. Tang. Quantum-inspired low-rank stochastic regression with loga-rithmic dependence on the dimension. arXiv preprint arXiv:1811.04909 , 2018.[16] M. D. Grigoriadis and L. G. Khachiyan. A sublinear-time randomized approximation algorithmfor matrix games.
Operation Research Letters , 18(2):53–58, 1995.5017] N. Gupta and A. Sidford. Exploiting numerical sparsity for efficient learning: faster eigenvectorcomputation and regression. In
Advances in Neural Information Processing Systems , pages5269–5278, 2018.[18] H. Jiang, Y. T. Lee, Z. Song, and S. C.-w. Wong. An improved cutting plane method for convexoptimization, convex-concave games, and its applications. In
Proceedings of the 52nd AnnualACM SIGACT Symposium on Theory of Computing , pages 944–953, 2020.[19] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variancereduction. In
Advances in Neural Information Processing Systems , 2013.[20] N. Karmarkar. A new polynomial-time algorithm for linear programming. In
Proceedings ofthe sixteenth annual ACM symposium on Theory of computing , pages 302–311. ACM, 1984.[21] I. Koutis, G. L. Miller, and R. Peng. Approaching optimality for solving SDD linear systems.In , pages 235–244, 2010.[22] Y. T. Lee and A. Sidford. Efficient accelerated coordinate descent methods and faster algo-rithms for solving linear systems. In , 2013.[23] Y. T. Lee and A. Sidford. Efficient inverse maintenance and faster algorithms for linear pro-gramming. In
IEEE 56th Annual Symposium on Foundations of Computer Science , pages230–249, 2015.[24] Y. T. Lee, A. Sidford, and S. C.-w. Wong. A faster cutting plane method and its implications forcombinatorial and convex optimization. In , pages 1049–1065. IEEE, 2015.[25] Y. T. Lee, Z. Song, and Q. Zhang. Solving empirical risk minimization in the current matrixmultiplication time. In
Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix,AZ, USA , pages 2140–2157, 2019.[26] M. Minsky and S. Papert.
Perceptrons—an introduction to computational geometry . MIT Press,1987.[27] H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust op-timization with f-divergences. In
Advances in Neural Information Processing Systems , pages2208–2216, 2016.[28] A. Nemirovski. Prox-method with rate of convergence O (1 /t ) for variational inequalities withlipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization , 15(1):229–251, 2004.[29] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approachto stochastic programming.
SIAM Journal on optimization , 19(4):1574–1609, 2009.[30] Y. Nesterov. A method for solving a convex programming problem with convergence rate o (1 /k ) . Doklady AN SSSR , 269:543–547, 1983.[31] Y. Nesterov. Dual extrapolation and its applications to solving variational inequalities andrelated problems.
Mathematical Programing , 109(2-3):319–344, 2007.[32] Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems.
SIAM Journal on Optimization , 22(2):341–362, 2012.5133] Y. Nesterov and S. U. Stich. Efficiency of the accelerated coordinate descent method onstructured optimization problems.
SIAM Journal on Optimization , 27(1):110–123, 2017.[34] P. Richtárik and M. Takáč. On optimal probabilities in stochastic coordinate descent methods.
Optimization Letters , 10(6):1233–1243, 2016.[35] S. Sachdeva and N. K. Vishnoi. Faster algorithms via approximation theory.
Foundations andTrends in Theoretical Computer Science , 9(2):125–210, 2014.[36] S. Shalev-Shwartz and A. Tewari. Stochastic methods for (cid:96) -regularized loss minimization. Journal of Machine Learning Research , 12:1865–1892, 2011.[37] S. Shalev-Shwartz and Y. Wexler. Minimizing the maximal loss: How and why. In
ICML ,pages 793–801, 2016.[38] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularizedloss minimization.
Journal of Machine Learning Research , 14:567–599, 2013.[39] S. Shalev-Shwartz et al. Online learning and online convex optimization.
Foundations andTrends in Machine Learning , 4(2):107–194, 2012.[40] A. Sidford and K. Tian. Coordinate methods for accelerating (cid:96) ∞ regression and faster approx-imate maximum flow. In , pages 922–933, 2018.[41] A. Sidford, M. Wang, X. Wu, and Y. Ye. Variance reduced value iteration and faster algorithmsfor solving markov decision processes. In Proceedings of the Twenty-Ninth Annual ACM-SIAMSymposium on Discrete Algorithms , pages 770–787. Society for Industrial and Applied Mathe-matics, 2018.[42] T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponential conver-gence.
Journal of Fourier Analysis and Applications , 15(2):262, 2009.[43] C. Tan, T. Zhang, S. Ma, and J. Liu. Stochastic primal-dual method for empirical risk min-imization with o (1) per-iteration complexity. In Advances in Neural Information ProcessingSystems , 2018.[44] J. van den Brand. A deterministic linear program solver in current matrix multiplication time. arXiv preprint arXiv:1910.11957 , 2019.[45] J. van den Brand, Y. T. Lee, A. Sidford, and Z. Song. Solving tall dense linear programs innearly linear time. In
Proceedings of the 52nd Annual ACM SIGACT Symposium on Theoryof Computing , 2020. To appear.[46] J. Von Neumann and O. Morgenstern.
Theory of games and economic behavior (commemorativeedition) . Princeton university press, 1944.[47] M. D. Vose. A linear algorithm for generating random numbers with a given distribution.
IEEETransactions on software engineering , 17(9):972–975, 1991.[48] M. Wang. Primal-dual π learning: Sample complexity and sublinear run time for ergodicMarkov decision problems. arXiv preprint arXiv:1710.06100 , 2017.[49] M. Wang. Randomized linear programming solves the discounted Markov decision problem innearly-linear (sometimes sublinear) running time. arXiv preprint arXiv:1704.01869 , 2017.[50] S. J. Wright. Coordinate descent algorithms. Mathematical Programming , 151(1):3–34, 2015.5251] A. Yurtsever, M. Udell, J. A. Tropp, and V. Cevher. Sketchy decisions: Convex low-rank matrixoptimization with optimal storage. In
International Conference on Artificial Intelligence andStatistics (AISTATS) , pages 1188–1196, 2017.[52] Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regularized empirical riskminimization.
The Journal of Machine Learning Research , 18(1):2939–2980, 2017.53 ppendix
A Deferred proofs from Section 2
Proof of Proposition 1.
It is clear that our choices of X , Y are compact and convex, and that thelocal norms we defined are indeed norms (in all cases, they are quadratic norms). Validity of ourchoices of Θ follow from the well-known facts that for x ∈ ∆ n , the entropy function (cid:80) j ∈ [ n ] x j log x j is convex with range log n , and that for x ∈ B n , (cid:107) x (cid:107) is convex with range .In the Euclidean case we have that V x ( x (cid:48) ) = (cid:107) x − x (cid:48) (cid:107) and (15) follows from the Cauchy-Schwarz and Young inequalities: (cid:10) γ, x (cid:48) − x (cid:11) ≤ (cid:107) γ (cid:107) + 12 (cid:13)(cid:13) x − x (cid:48) (cid:13)(cid:13) . Similarly, for the simplex we have that entropy is 1-strongly-convex with respect to (cid:107)·(cid:107) and therefore V y ( y (cid:48) ) ≥ (cid:107) y − y (cid:48) (cid:107) , and we obtain (15) from the Hölder and Young inequalities, (cid:10) γ, y (cid:48) − y (cid:11) ≤ (cid:107) γ (cid:107) ∞ + 12 (cid:13)(cid:13) y − y (cid:48) (cid:13)(cid:13) , where we note that (cid:107) γ (cid:107) ∞ = (cid:107) γ (cid:107) ∗ in this case.Finally, clip( · ) is not the identity only when the corresponding domain is the simplex, in whichcase entropy satisfies the local norms bound [cf. 8, Lemma 13], (cid:10) γ, y − y (cid:48) (cid:11) − V y ( y (cid:48) ) ≤ (cid:88) i ∈ [ m ] γ i y i for all y, y (cid:48) ∈ ∆ m and γ ∈ R m such that (cid:107) γ (cid:107) ∞ ≤ . Noting that (cid:107) clip( γ ) (cid:107) ∞ ≤ and that (cid:107) clip( γ ) (cid:107) y ≤ (cid:107) γ (cid:107) y for all y ∈ ∆ m , we have the desired localbound (16). Finally, for every coordinate i ∈ [ m ] we have | γ i − [clip( γ )] i | = || γ i | − | I {| γ i | > } ≤ | γ i | I {| γ i | > } ≤ | γ i | . Consequently, | (cid:104) γ − clip( γ ) , z (cid:105) | ≤ (cid:80) i ∈ [ m ] γ i z i , giving the distortion bound (17). B Deferred proofs from Section 3
B.1 Proof of Proposition 2
In this section, we provide a convergence result for mirror descent under local norms. We requirethe following well-known regret bound for mirror descent.
Lemma 13 ([8, Lemma 12]) . Let Q : Z → R be convex, let T ∈ N , z ∈ Z and γ , γ , . . . , γ T ∈ Z ∗ .The sequence z , . . . , z T defined by z t = arg min z ∈Z (cid:8) (cid:104) γ t − , z (cid:105) + Q ( z ) + V z t − ( z ) (cid:9) satisfies for all u ∈ Z (denoting z T +1 := u ), T (cid:88) t =0 (cid:104) γ t , z t − u (cid:105) + T (cid:88) t =1 (cid:104)∇ Q ( z t ) , z t − u (cid:105) ≤ V z ( u ) + T (cid:88) t =0 {(cid:104) γ t , z t − z t +1 (cid:105) − V z t ( z t +1 ) } . (64)54he proposition follows from this regret bound, the properties of the local norm setup, and the“ghost iterate” argument due to [29]. Proposition 2.
Let ( Z , (cid:107)·(cid:107) · , r , Θ , clip ) be a local norm setup, let L, (cid:15) > , and let ˜ g be an L -localestimator. Then, for η ≤ (cid:15) L and T ≥ η(cid:15) ≥ L Θ (cid:15) , Algorithm 1 outputs a point ¯ z such that E Gap(¯ z ) ≤ E (cid:34) sup u ∈Z T + 1 T (cid:88) t =0 (cid:104) g ( z t ) , z t − u (cid:105) (cid:35) ≤ (cid:15). Proof.
Defining ˜∆ t := g ( z t ) − η clip( η ˜ g ( z t )) and the ghost iterates s t = arg min s ∈Z (cid:26)(cid:28) η ˜∆ t − , s (cid:29) + V s t − ( s ) (cid:27) with s = w , we rearrange the regret as η T (cid:88) t =0 (cid:104) g ( z t ) , z t − u (cid:105) ≤ T (cid:88) t =0 (cid:104) clip( η ˜ g ( z t )) , z t − u (cid:105) + T (cid:88) t =0 (cid:68) η ˜∆ t , s t − u (cid:69) + T (cid:88) t =0 (cid:68) η ˜∆ t , z t − s t (cid:69) , (65)and bound each term in turn.We first apply Lemma 13 with Q = 0 and γ t = clip( η ˜ g ( z t )) , using (16) to conclude that T (cid:88) t =0 (cid:104) clip( η ˜ g ( z t )) , z t − u (cid:105) ≤ V z ( u ) + η T (cid:88) t =0 (cid:107) ˜ g ( z t ) (cid:107) z t , for all u ∈ Z . (66)Next, we apply Lemma 13 again, this time with γ t = η ˜∆ t , to obtain the regret bound T (cid:88) t =0 (cid:10) η ˜∆ t , s t − u (cid:11) ≤ V z ( u ) + T (cid:88) t =0 (cid:110)(cid:68) η ˜∆ t , s t − s t +1 (cid:69) − V s t ( s t +1 ) (cid:111) ≤ V z ( u ) + η T (cid:88) t =0 (cid:26) (cid:107) ˜ g ( z t ) (cid:107) s t + 12 (cid:107) g ( z t ) (cid:107) ∗ (cid:27) , (67)for all u ∈ Z , where we used (cid:68) η ˜∆ t , s t − s t +1 (cid:69) − V s t ( s t +1 ) ≤ (cid:104) ηg ( z t ) , s t − s t +1 (cid:105) − V s t ( s t +1 )+ | (cid:104) clip( η ˜ g ( z t )) , s t − s t +1 (cid:105) | − V s t ( s t +1 ) , and then appealed to the bounds (15) and (16) in the definition of the local norm setup. Now,substituting (66) and (67) into (65), maximizing over u , and taking an expectation, we obtain E sup u ∈Z T (cid:88) t =0 (cid:104) ηg ( z t ) , z t − u (cid:105) ≤
3Θ + η E T (cid:88) t =0 (cid:110) (cid:107) ˜ g ( z t ) (cid:107) z t + (cid:107) ˜ g ( z t ) (cid:107) s t + (cid:107) g ( z t ) (cid:107) ∗ (cid:111) + E T (cid:88) t =0 (cid:68) η ˜∆ t , z t − s t (cid:69) . (68)55o bound the last term we use the fact that g ( z t ) = E [˜ g ( z t ) | z t , s t ] (which follows from the firstpart of Definition 3). We then write (cid:12)(cid:12)(cid:12) E (cid:68) η ˜∆ t , z t − s t (cid:69)(cid:12)(cid:12)(cid:12) ≤ E |(cid:104) η ˜ g ( z t ) − clip( η ˜ g ( z t )) , z t − s t (cid:105)| ≤ η (cid:107) ˜ g ( z t ) (cid:107) z t + η (cid:107) ˜ g ( z t ) (cid:107) s t , (69)where the first inequality is by Jensen’s inequality, and the last is due to the property (17) of thelocal norm setup. Substituting (69) into (68), we obtain E sup u ∈Z T (cid:88) t =0 (cid:104) ηg ( z t ) , z t − u (cid:105) ≤
3Θ + η E T (cid:88) t =0 (cid:26) (cid:107) ˜ g ( z t ) (cid:107) z t + 2 (cid:107) ˜ g ( z t ) (cid:107) s t + 12 (cid:107) g ( z t ) (cid:107) ∗ (cid:27) . Finally, using the second moment bound of local gradient estimator (Definition 3) and its conse-quence Lemma 1, we may bound each of the expected squared norm terms by L . Dividing throughby η ( T + 1) gives E sup u ∈Z (cid:34) T + 1 T (cid:88) t =0 (cid:104) g ( z t ) , z t − u (cid:105) (cid:35) ≤ η ( T + 1) + 9 ηL . Our choices η = (cid:15) L and T ≥ η(cid:15) imply that the right hand side is at most (cid:15) , as required. B.2 Proof of Proposition 3
Proposition 3.
Let O be an ( α , ε inner )-relaxed proximal oracle with respect to gradient mapping g ,distance-generating function r with range at most Θ and some ε inner ≤ ε outer . Let z / , z / , . . . , z K − / be iterates of Algorithm 2 and let ¯ z K be its output. Then E Gap(¯ z K ) ≤ E max u ∈Z K K (cid:88) k =1 (cid:10) g ( z k − / ) , z k − / − u (cid:11) ≤ α Θ K + ε outer . Proof.
For some iteration k , we have by the optimality conditions on z (cid:63)k that (cid:10) g ( z k − / ) , z (cid:63)k − u (cid:11) ≤ α (cid:16) V z k − ( u ) − V z (cid:63)k ( u ) − V z k − ( z (cid:63)k ) (cid:17) ∀ u ∈ Z . Summing over k , writing (cid:10) g ( z k − / ) , z (cid:63)k − u (cid:11) = (cid:10) g ( z k − / ) , z k − / − u (cid:11) − (cid:10) g ( z k − / ) , z k − / − z (cid:63)k (cid:11) ,and rearranging yields K (cid:88) k =1 (cid:10) g ( z k − / ) , z k − / − u (cid:11) ≤ αV z ( u ) + K (cid:88) k =1 α (cid:16) V z k ( u ) − V z (cid:63)k ( u ) (cid:17) + K (cid:88) k =1 (cid:0)(cid:10) g ( z k − / ) , z k − / − z (cid:63)k (cid:11) − αV z k − ( z (cid:63)k ) (cid:1) , (70)for all u ∈ Z . Since z minimizes r , the first term is bounded by V z ( u ) ≤ r ( u ) − r ( z ) ≤ Θ . Thesecond term is bounded by the definition of z k in Algorithm 2: K (cid:88) k =1 α (cid:16) V z k ( u ) − V z (cid:63)k ( u ) (cid:17) ≤ K ( ε outer − ε inner ) . u and then taking an expectation yields E max u ∈Z K (cid:88) k =1 (cid:10) g ( z k − / ) , z k − / − u (cid:11) ≤ α Θ + K ( ε outer − ε inner )+ K (cid:88) k =1 E (cid:2)(cid:10) g ( z k − / ) , z k − / − z (cid:63)k (cid:11) − αV z k − ( z (cid:63)k ) (cid:3) . Finally, by Definition 5, E (cid:2)(cid:10) g ( z k − / ) , z k − / − z (cid:63)k (cid:11) − αV z k − ( z (cid:63)k ) (cid:3) ≤ ε inner for every k , and theresult follows by dividing by K . B.3 Proof of Proposition 4
We provide a convergence result for the variance-reduced stochastic mirror descent scheme in Al-gorithm 3. We first state the following helper bound which is an application of Lemma 15. It isimmediate from the variance bound of local-centered estimators (Property 2 of Definition 4) andthe fact that all local norms (whether the domains are balls or simplices) are quadratic.
Lemma 14.
For any w ∈ Z , ( L, (cid:15) ) -centered-local estimator ˜ g w satisfies E (cid:107) ˜ g w ( z ) − g ( z ) (cid:107) w ≤ L V w ( z ) . Lemma 15.
Let (cid:107)·(cid:107) D be a quadratic norm in a diagonal matrix, e.g. for some D = diag( d ) and d ≥ entrywise, let (cid:107) x (cid:107) D = (cid:80) d i x i . Then, if X is a random vector, we have E (cid:107) X − E [ X ] (cid:107) D ≤ E (cid:107) X (cid:107) D . Proof.
This follows from the definition of variance: E (cid:107) X − E [ X ] (cid:107) D = E (cid:107) X (cid:107) D − (cid:107) E [ X ] (cid:107) D ≤ E (cid:107) X (cid:107) D . Proposition 4.
Let ( Z , (cid:107)·(cid:107) · , r , Θ , clip ) be any local norm setup. Let w ∈ Z , α ≥ ε inner > ,and ˜ g w be an L -centered-local estimator for some L ≥ α . Assume the domain is bounded by max z ∈Z (cid:107) z (cid:107) ≤ D , that g is L -Lipschitz, i.e. (cid:107) g ( z ) − g ( z (cid:48) ) (cid:107) ∗ ≤ L (cid:107) z − z (cid:48) (cid:107) , that g is LD -bounded,i.e. max z ∈Z (cid:107) g ( z ) (cid:107) ∗ ≤ LD , and that ˆ w = w . Then, for η = α L , T ≥ ηα ≥ L α , and ϕ = ε inner ,Algorithm 3 outputs a point ˆ w ∈ Z such that E max u ∈Z [ (cid:104) g ( ˜ w ) , ˜ w − u (cid:105) − αV w ( u )] ≤ ε inner , (24) i.e. Algorithm 3 is an ( α, ε inner ) -relaxed proximal oracle.Proof. For any u ∈ Z , and defining ˜∆ t := ˜ g ( ˆ w t ) − g ( w ) and ∆ t := g ( ˆ w t ) − g ( w ) , we have (cid:88) t ∈ [ T ] (cid:104) ηg ( w t ) , w t − u (cid:105) = (cid:88) t ∈ [ T ] (cid:68) clip( η ˜∆ t ) + ηg ( w ) , w t − u (cid:69) + (cid:88) t ∈ [ T ] (cid:68) η ∆ t − clip( η ˜∆ t ) , w t − u (cid:69) + (cid:88) t ∈ [ T ] (cid:104) ηg ( w t ) − ηg ( ˆ w t ) , w t − u (cid:105) . (71)57e proceed to bound the three terms on the right hand side of (71) in turn. For the first term,recall the guarantees for the “ideal” iterates of Algorithm 3, w (cid:63)t = arg min w ∈Z (cid:110)(cid:68) clip( η ˜∆ t ) + ηg ( w ) , w (cid:69) + αη V w ( w ) + V w t − ( w ) (cid:111) . By using the optimality conditions of these iterates, defining Q ( z ) := (cid:104) ηg ( w ) , z (cid:105) + αη V w ( z ) , γ t :=clip( η ˜∆ t ) , and defining for notational convenience w (cid:63)T +1 := u , (cid:88) t ∈ [ T ] (cid:104) γ t − + ∇ Q ( w (cid:63)t ) , w (cid:63)t − u (cid:105) ≤ (cid:88) t ∈ [ T ] (cid:10) −∇ V w t − ( w (cid:63)t ) , w (cid:63)t − u (cid:11) = (cid:88) t ∈ [ T ] (cid:0) V w t − ( u ) − V w (cid:63)t ( u ) − V w t − ( w (cid:63)t ) (cid:1) = V w ( u ) + (cid:88) t ∈ [ T ] (cid:0) V w t ( u ) − V w (cid:63)t ( u ) (cid:1) − T (cid:88) t =0 V w t ( w (cid:63)t +1 ) . (72)We thus have the chain of inequalities, recalling γ = 0 , (cid:88) t ∈ [ T ] (cid:68) clip( η ˜∆ t ) + ηg ( w ) , w t − u (cid:69) + αη (cid:88) t ∈ [ T ] (cid:104)∇ V w ( w (cid:63)t ) , w (cid:63)t − u (cid:105) = T (cid:88) t =0 (cid:104) γ t , w t − u (cid:105) + (cid:88) t ∈ [ T ] (cid:104)∇ Q ( w (cid:63)t ) , w (cid:63)t − u (cid:105) + (cid:88) t ∈ [ T ] (cid:104) ηg ( w ) , w t − w (cid:63)t (cid:105) ( i ) ≤ V w ( u ) + (cid:88) t ∈ [ T ] (cid:0) V w t ( u ) − V w (cid:63)t ( u ) (cid:1) + T (cid:88) t =0 (cid:0) (cid:104) γ t , w t − w (cid:63)t +1 (cid:105) − V w t ( w (cid:63)t +1 ) (cid:1) + (cid:88) t ∈ [ T ] (cid:104) ηg ( w ) , w t − w (cid:63)t (cid:105) ( ii ) ≤ V w ( u ) + 2 ηϕT + T (cid:88) t =0 (cid:16) (cid:104) clip( η ˜∆ t ) , w t − w (cid:63)t +1 (cid:105) − V w t ( w (cid:63)t +1 ) (cid:17) ( iii ) ≤ V w ( u ) + 2 ηϕT + T (cid:88) t =0 η (cid:13)(cid:13)(cid:13) ˜∆ t (cid:13)(cid:13)(cid:13) w t . (73)Here, ( i ) was by rearranging (72) via the equality (cid:88) t ∈ [ T ] (cid:104) γ t − , w (cid:63)t − u (cid:105) = T (cid:88) t =0 (cid:104) γ t , w t − u (cid:105) − T (cid:88) t =0 (cid:10) γ t , w t − w (cid:63)t +1 (cid:11) , ( ii ) was by the conditions max u (cid:2) V w t ( u ) − V w (cid:63)t ( u ) (cid:3) ≤ ηϕ and (cid:107) w t − w (cid:63)t (cid:107) ≤ ϕLD satisfied by theiterates, and ( iii ) was by the property of clipping (15), as defined in the problem setup. Now byrearranging and using the three-point property of Bregman divergence (i.e. (cid:104)−∇ V w (cid:48) ( w ) , w − u (cid:105) = V w (cid:48) ( u ) − V w ( u ) − V w (cid:48) ( w ) ), it holds that (cid:88) t ∈ [ T ] (cid:68) clip( η ˜∆ t ) + ηg ( w ) , w t − u (cid:69) ≤ V w ( u ) + 2 ηϕT + η T (cid:88) t =0 (cid:13)(cid:13)(cid:13) ˜∆ t (cid:13)(cid:13)(cid:13) w t + αη (cid:88) t ∈ [ T ] ( V w ( u ) − V w ( w (cid:63)t )) ≤ V w ( u ) + 3 ηϕT + η T (cid:88) t =0 (cid:13)(cid:13)(cid:13) ˜∆ t (cid:13)(cid:13)(cid:13) w t + αη (cid:88) t ∈ [ T ] ( V w ( u ) − V w ( ˆ w t )) , (74)58here the second inequality follows from the condition V w ( ˆ w t ) − V w ( w (cid:63)t ) ≤ ϕα satisfied by iteratesof Algorithm 3. To bound the second term of (71), we define the ghost iterate sequnce { s t } by s t = arg min s ∈Z (cid:26) (cid:10) η ∆ t − − clip( η ˜∆ t − ) , s (cid:11) + V s t − ( s ) (cid:27) with s = w . Applying Lemma 13 with Q = 0 and γ t = ( η ∆ t − clip( η ˜∆ t )) , and observing that again γ = 0 , (cid:88) t ∈ [ T ] (cid:10) η ∆ t − clip( η ˜∆ t ) , s t − u (cid:11) ≤ V w ( u ) + T (cid:88) t =0 (cid:110) (cid:104) η ∆ t − clip( η ˜∆ t ) , s t − s t +1 (cid:105) − V s t ( s t +1 ) (cid:111) ≤ V w ( u ) + η T (cid:88) t =0 (cid:18) (cid:107) ∆ t (cid:107) ∗ + (cid:13)(cid:13)(cid:13) ˜∆ t (cid:13)(cid:13)(cid:13) s t (cid:19) . Here, we used properties (15) and (16). Consequently, (cid:88) t ∈ [ T ] (cid:68) η ∆ t − clip( η ˜∆ t ) , w t − u (cid:69) = (cid:88) t ∈ [ T ] (cid:68) η ∆ t − clip( η ˜∆ t ) , w t − s t (cid:69) + (cid:88) t ∈ [ T ] (cid:68) η ∆ t − clip( η ˜∆ t ) , s t − u (cid:69) ≤ V w ( u ) + η T (cid:88) t =0 (cid:18) (cid:107) ∆ t (cid:107) ∗ + (cid:13)(cid:13)(cid:13) ˜∆ t (cid:13)(cid:13)(cid:13) s t (cid:19) + (cid:88) t ∈ [ T ] (cid:68) η ∆ t − clip( η ˜∆ t ) , w t − s t (cid:69) . (75)To bound the third term of (71), we use the condition (cid:107) w t − ˆ w t (cid:107) ≤ ϕLD which implies (cid:88) t ∈ [ T ] (cid:104) ηg ( w t ) − ηg ( ˆ w t ) , w t − u (cid:105) ≤ (cid:88) t ∈ [ T ] (cid:107) ηg ( w t ) − ηg ( ˆ w t ) (cid:107) ∗ (cid:107) w t − u (cid:107) ≤ ηϕT. (76)Combining our three bounds (74), (75), and (76) in the context of (71), using ˆ w = w and ˜ g ( w ) = g ( w ) , and finally dividing through by ηT , we obtain T (cid:88) t ∈ [ T ] (cid:104) g ( w t ) , w t − u (cid:105) − (cid:18) ηT + α (cid:19) V w ( u ) ≤ ϕ + 1 T (cid:88) t ∈ [ T ] (cid:18) η (cid:13)(cid:13)(cid:13) ˜∆ t (cid:13)(cid:13)(cid:13) w t + η (cid:13)(cid:13)(cid:13) ˜∆ t (cid:13)(cid:13)(cid:13) s t + η (cid:107) ∆ t (cid:107) ∗ + (cid:28) ∆ t − η clip( η ˜∆ t ) , w t − s t (cid:29) − α V w ( ˆ w t ) (cid:19) . (77)Since T ≥ αη , taking a supremum over u ∈ Z in (77) and then an expectation yields E sup u ∈Z T (cid:88) t ∈ [ T ] (cid:104) g ( w t ) , w t − u (cid:105) − αV w ( u ) ≤ ϕ + 1 T E (cid:88) t ∈ [ T ] η (cid:13)(cid:13)(cid:13) ˜∆ t (cid:13)(cid:13)(cid:13) w t + η (cid:13)(cid:13)(cid:13) ˜∆ t (cid:13)(cid:13)(cid:13) s t + η (cid:107) ∆ t (cid:107) ∗ + (cid:28) ∆ t − η clip( η ˜∆ t ) , w t − s t (cid:29) − α V w ( ˆ w t ) . (78)59e will show the second line of (78) is nonpositive. To do so, observe for each t ∈ [ T ] , by theproperty (17) of clip( · ) , since conditional on w t , s t , ˜∆ t is unbiased for deterministic ∆ t , (cid:12)(cid:12)(cid:12)(cid:12) E (cid:28) ∆ t − η clip( η ˜∆ t ) , w t − s t (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) E (cid:28) ˜∆ t − η clip( η ˜∆ t ) , w t − s t (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≤ η (cid:13)(cid:13)(cid:13) ˜∆ t (cid:13)(cid:13)(cid:13) w t + η (cid:13)(cid:13)(cid:13) ˜∆ t (cid:13)(cid:13)(cid:13) s t . (79)Finally, by using property 2 of the centered-local estimator ˜∆ t , as well as Remark 1, we have foreach t ∈ [ T ] , E (cid:20) η (cid:13)(cid:13)(cid:13) ˜∆ t (cid:13)(cid:13)(cid:13) w t (cid:21) ≤ ηL V w ( ˆ w t ) , E (cid:20) η (cid:13)(cid:13)(cid:13) ˜∆ t (cid:13)(cid:13)(cid:13) s t (cid:21) ≤ ηL V w ( ˆ w t ) , and η (cid:107) ∆ t (cid:107) ∗ ≤ ηL V w ( ˆ w t ) . (80)Using bounds (79) and (80) in (78), as well as η ≤ α L , E sup u ∈Z T (cid:88) t ∈ [ T ] (cid:104) g ( w t ) , w t − u (cid:105) − αV w ( u ) ≤ ϕ. (81)For the final claim, denote the true average iterate by ¯ w := T (cid:80) t ∈ [ T ] w t . We have ∀ u ∈ Z , (cid:104) g ( ˜ w ) , ˜ w − u (cid:105) ( i ) = − (cid:104) g ( ˜ w ) , u (cid:105) = (cid:104) g ( ¯ w ) − g ( ˜ w ) , u (cid:105) + (cid:104) g ( ¯ w ) , ¯ w − u (cid:105) ( ii ) ≤ ϕ + (cid:104) g ( ¯ w ) , ¯ w − u (cid:105) = 1 T (cid:88) t ∈ [ T ] (cid:104) g ( w t ) , w t − u (cid:105) + ϕ. Here, ( i ) used the fact that linearity of g gives (cid:104) g ( z ) , z (cid:105) = 0 , ∀ z ∈ Z , and ( ii ) used Hölder’s inequality (cid:104) g ( ¯ w ) − g ( ˜ w ) , u (cid:105) ≤ (cid:107) g ( ¯ w ) − g ( ˜ w ) (cid:107) ∗ (cid:107) u (cid:107) ≤ LD (cid:107) ˜ w − ¯ w (cid:107) ≤ ϕ following from the approximationguarantee (cid:107) ˜ w − ¯ w (cid:107) ≤ ϕ LD . Combining with (81) yields the conclusion, as ϕ = ε inner . C Deferred proofs for sublinear methods
C.1 (cid:96) - (cid:96) sublinear coordinate method Assumptions.
The algorithm in this section will assume access to entry queries, (cid:96) norms of rowsand columns, and (cid:96) sampling distributions for rows and columns. Further, it assumes the abilityto sample a row or column proportional to its squared (cid:96) norm; given access to all (cid:96) norms, thealgorithm may spend O ( m + n ) constructing these sampling oracles in O ( m + n ) time, which doesnot affect its asymptotic runtime. We use the (cid:96) - (cid:96) local norm setup (Table 6). We define L , co := (cid:115) (cid:88) i ∈ [ m ] (cid:107) A i : (cid:107) + (cid:88) j ∈ [ n ] (cid:107) A : j (cid:107) . (82) C.1.1 Gradient estimator
For z ∈ B n × B m , we specify two distinct choices of sampling distributions p ( z ) , q ( z ) which obtainthe optimal Lipschitz constant. The first one is an oblivious distribution: p ij ( z ) := (cid:107) A i : (cid:107) (cid:80) k ∈ [ m ] (cid:107) A k : (cid:107) · | A ij |(cid:107) A i : (cid:107) and q ij ( z ) := (cid:107) A : j (cid:107) (cid:80) k ∈ [ n ] (cid:107) A : k (cid:107) · | A ij |(cid:107) A : j (cid:107) . (83)60he second one is a dynamic distribution: p ij ( z ) := [ z y ] i (cid:107) z y (cid:107) · | A ij |(cid:107) A i : (cid:107) and q ij ( z ) := [ z x ] j (cid:107) z x (cid:107) · | A ij |(cid:107) A : j (cid:107) . (84)We now state the local properties of each estimator. Lemma 16.
In the (cid:96) - (cid:96) setup, estimator (23) using the sampling distribution in (83) or (84) isan L , co -local estimator.Proof. For convenience, we restate the distributions here: they are respectively p ij ( z ) := (cid:107) A i : (cid:107) (cid:80) k ∈ [ m ] (cid:107) A k : (cid:107) · | A ij |(cid:107) A i : (cid:107) and q ij ( z ) := (cid:107) A : j (cid:107) (cid:80) k ∈ [ n ] (cid:107) A : k (cid:107) · | A ij |(cid:107) A : j (cid:107) and p ij ( z ) := [ z y ] i (cid:107) z y (cid:107) · | A ij |(cid:107) A i : (cid:107) and q ij ( z ) := [ z x ] j (cid:107) z x (cid:107) · | A ij |(cid:107) A : j (cid:107) . Unbiasedness holds by definition. We first show the variance bound on the x block for distribu-tion (83): E (cid:104) (cid:107) ˜ g x ( z ) (cid:107) (cid:105) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] p ij ( z ) · (cid:18) A ij [ z y ] i p ij ( z ) (cid:19) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] A ij [ z y ] i p ij ( z )= (cid:88) i ∈ [ m ] ,j ∈ [ n ] | A ij |(cid:107) A i : (cid:107) [ z y ] i · (cid:88) i ∈ [ m ] (cid:107) A i : (cid:107) = (cid:88) i ∈ [ m ] (cid:107) A i : (cid:107) . Similarly, we have E (cid:104) (cid:107) ˜ g y ( z ) (cid:107) (cid:105) ≤ (cid:88) j ∈ [ n ] (cid:107) A : j (cid:107) . Now, we show the variance bound on the x block for distribution (84): E (cid:104) (cid:107) ˜ g x ( z ) (cid:107) (cid:105) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] p ij ( z ) · (cid:18) A ij [ z y ] i p ij ( z ) (cid:19) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] A ij [ z y ] i p ij ( z )= (cid:88) i ∈ [ m ] ,j ∈ [ n ] | A ij |(cid:107) A i : (cid:107) (cid:107) z y (cid:107) ≤ (cid:88) i ∈ [ m ] (cid:107) A i : (cid:107) , and a similar bound holds on the y block.We remark that using the oblivious distribution (83) saves a logarithmic factor in the runtimecompared to the dynamic distribution, so for the implementation of all of our (cid:96) - (cid:96) algorithms wewill use the oblivious distribution. 61 .1.2 Implementation details In this section, we discuss the details of how to leverage the
IterateMaintainer data structureto implement the iterations of our algorithm. The algorithm we analyze is Algorithm 1, using thelocal estimator defined in (23), and the distribution (83). We choose η = (cid:15) (cid:16) L , co (cid:17) and T = (cid:24) η(cid:15) (cid:25) ≥ (cid:16) L , co (cid:17) (cid:15) . Lemma 16 implies that our estimator satisfies the remaining requirements for Proposition 2, givingthe duality gap guarantee in T iterations. In order to give a runtime bound, we claim that eachiteration can be implemented in constant time, with O ( m + n ) additional runtime. Data structure initializations and invariants.
At the start of the algorithm, we spend O ( m + n ) time initializing data structures via IM x . Init ( n , b ) , IM y . Init ( m , c ) , where IM x , IM y are instan-tiations of IterateMaintainer data structures. Throughout, we preserve the invariant that thepoints maintained by IM x , IM y correspond to the x and y blocks of the current iterate z t at iteration t of the algorithm. We note that we instantiate data structures which do not support Sample () . Iterations.
For simplicity, we only discuss the runtime of updating the x block as the y blockfollows symmetrically. We divide each iteration into the following substeps, each of which we showrun in constant time. We refer to the current iterate by z = ( z x , z y ) , and the next iterate by w = ( w x , w y ) . Sampling.
Because the distribution is oblivious, sampling both i and j | i using precomputed datastructures takes constant time. Computing the gradient estimator.
To compute c := A ij [ z y ] i /p ij , it suffices to compute A ij , [ z y ] i ,and p ij . Using an entry oracle for A obtains A ij in constant time, and calling IM y . Get ( i ) takesconstant time. Computing p ij using the precomputed row norms and the values of A ij , [ z y ] i takesconstant time. Performing the update.
For the update corresponding to a proximal step, we have w x ← Π X ( z x − η ˜ g x ( z )) = z x − η ˜ g x ( z )max {(cid:107) z x − η ˜ g x ( z ) (cid:107) , } . We have computed ˜ g x ( z ) , so to perform this update, we call IM x . AddSparse ( j, − ηc ); IM x . AddDense ( − η ); IM x . Scale (max { IM x . GetNorm () , } − ); IM x . UpdateSum () . By assumption, each operation takes constant time because we do not support
Sample in ourinstances of IM , giving the desired iteration complexity. It is clear that at the end of performingthese operations, the invariant that IM x maintains the x block of the iterate is preserved.62 veraging. After T iterations, we compute the average point ¯ z x : [¯ z x ] j ← T · IM x . GetSum ( j ) , ∀ j ∈ [ n ] . By assumption, this takes O ( n ) time. C.1.3 Algorithm guaranteeTheorem 6.
In the (cid:96) - (cid:96) setup, the implementation in Section C.1.2 has runtime O (cid:16) L , co (cid:17) (cid:15) + m + n and outputs a point ¯ z ∈ Z such that E Gap(¯ z ) ≤ (cid:15). Proof.
The runtime bound follows from the discussion in Section C.1.2. The correctness followsfrom Proposition 2.
Remark 7.
Using our
IterateMaintainer data structure, the (cid:96) - (cid:96) algorithm of Balamuruganand Bach [7] runs in time O (cid:0) rcs (cid:107) A (cid:107) F /(cid:15) (cid:1) . Our runtime universally improves upon it since (cid:88) i ∈ [ m ] (cid:107) A i : (cid:107) + (cid:88) j ∈ [ n ] (cid:107) A : j (cid:107) ≤ rcs (cid:107) A (cid:107) . C.2 (cid:96) - (cid:96) sublinear coordinate method Assumptions
The algorithm in this section will assume access to every oracle listed in Section 2.3.However, for a specific matrix A , only one of three sampling distributions will be used in thealgorithm; we describe the specific oracle requirements of each distribution following their definition.We use the (cid:96) - (cid:96) local norm setup (Table 6). Throughout this section, we will assume that the linearterm in (23) is g (0) = 0 uniformly.Finally, in this section we assume access to a weighted variant of IterateMaintainer , whichtakes a nonnegative weight vector w as a static parameter. WeightedIterateMaintainer supportstwo modified operations compared to the data structure IterateMaintainer : its GetNorm () op-eration returns (cid:113)(cid:80) j [ w ] j [ x ] j , and its Sample () returns coordinate j with probability proportionalto [ w ] j [ x ] j (cf. Section 2.4.1). We give the implementation of this extension in Appendix G. C.2.1 Gradient estimator
For z ∈ B n × ∆ m and desired accuracy (cid:15) > , we specify three distinct choices of sampling distri-butions p ( z ) , q ( z ) . Each of our distributions induces an estimator with different properties.The first one is p ij ( z ) := | A ij |(cid:107) A i : (cid:107) · [ z y ] i and q ij ( z ) := A ij (cid:107) A (cid:107) . (85)The second one is p ij ( z ) := | A ij |(cid:107) A i : (cid:107) · [ z y ] i and q ij ( z ) := [ z x ] j · { A ij (cid:54) =0 } (cid:80) l ∈ [ n ] cs l · [ z x ] l . (86)63ere, we let cs j ≤ rcs denote the number of nonzero elements in column A : j . The third one is p ij ( z ) := | A ij |(cid:107) A i : (cid:107) · [ z y ] i and q ij ( z ) := | A ij | · [ z x ] j (cid:80) l ∈ [ n ] (cid:107) A : l (cid:107) · [ z x ] l . (87)For L , , (1) co , L , , (2) co , and L , , (3) co to be defined, the estimators induced by these distributionsare local estimators whose guarantees depend on these constants respectively. Furthermore, theseLipschitz constants are in general incomparable and depend on specific properties of the matrix.Therefore, we may choose our definition of L , co to be the minimum of these constants, by choosingan appropriate estimator. We now state the local properties of each estimator. Lemma 17.
In the (cid:96) - (cid:96) setup, estimator (23) using the sampling distributions in (85) , (86) , or (87) is respectively a L , , ( k ) co -local estimator, for k ∈ { , , } , and L , , (1) co := (cid:114) max i ∈ [ m ] (cid:107) A i : (cid:107) + (cid:107) A (cid:107) ,L , , (2) co := (cid:114) rcs max i ∈ [ m ] (cid:107) A i : (cid:107) ,L , , (3) co := (cid:115) max i ∈ [ m ] (cid:107) A i : (cid:107) + (cid:18) max i ∈ [ m ] (cid:107) A i : (cid:107) (cid:19) (cid:18) max j ∈ [ n ] (cid:107) A : j (cid:107) (cid:19) . Proof.
First, we give the proof for the sampling distribution (85). Unbiasedness holds by definition.For the x block, we have the variance bound: E (cid:104) (cid:107) ˜ g x ( z ) (cid:107) (cid:105) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] p ij ( z ) · (cid:18) A ij [ z y ] i p ij ( z ) (cid:19) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] | A ij |(cid:107) A i : (cid:107) [ z y ] i = (cid:88) i ∈ [ m ] (cid:107) A i : (cid:107) [ z y ] i ≤ max i ∈ [ m ] (cid:107) A i : (cid:107) . For arbitrary w y , we have the variance bound on the y block: E (cid:104) (cid:107) ˜ g y ( z ) (cid:107) w y (cid:105) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] q ij ( z ) · (cid:32) [ w y ] i · (cid:18) A ij [ z x ] j q ij ( z ) (cid:19) (cid:33) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] [ w y ] i A ij [ z x ] j q ij ( z )= (cid:88) i ∈ [ m ] ,j ∈ [ n ] [ w y ] i [ z x ] j (cid:107) A (cid:107) ≤ (cid:107) A (cid:107) . Next, we give the proof for the sampling distribution (86). Unbiasedness holds by definition. ByCauchy-Schwarz and our earlier proof, we have the variance bound for the x block: E (cid:104) (cid:107) ˜ g x ( z ) (cid:107) (cid:105) ≤ max i ∈ [ m ] (cid:107) A i : (cid:107) ≤ rcs max i ∈ [ m ] (cid:107) A i : (cid:107) . For arbitrary w y , we have the variance bound on the y block, where S i := (cid:8) j | A ij (cid:54) =0 = 1 (cid:9) : E (cid:104) (cid:107) ˜ g y ( z ) (cid:107) w y (cid:105) = (cid:88) i ∈ [ m ] ,j ∈ S i q ij ( z ) · (cid:32) [ w y ] i · (cid:18) A ij [ z x ] j q ij ( z ) (cid:19) (cid:33) = (cid:88) i ∈ [ m ] ,j ∈ S i [ w y ] i A ij [ z x ] j q ij ( z ) ≤ (cid:88) i ∈ [ m ] ,j ∈ S i [ w y ] i A ij rcs ≤ rcs max k ∈ [ m ] (cid:107) A k : (cid:107) . x block again hold. For arbitrary w y , we have the variance bound on the y block: E (cid:104) (cid:107) ˜ g y ( z ) (cid:107) w y (cid:105) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] q ij ( z ) · (cid:32) [ w y ] i · (cid:18) A ij [ z x ] j q ij ( z ) (cid:19) (cid:33) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] [ w y ] i A ij [ z x ] j q ij ( z ) ≤ (cid:88) i ∈ [ m ] ,j ∈ [ n ] [ w y ] i | A ij | (cid:88) l ∈ [ n ] (cid:107) A : l (cid:107) [ z x ] l ≤ (cid:18) max k ∈ [ m ] (cid:107) A k : (cid:107) (cid:19) (cid:18) max l ∈ [ n ] (cid:107) A : l (cid:107) (cid:19) . By using the definitions of L , , (1) co , L , , (2) co , and L , , (3) co , we define the constant L , co := (cid:115) max i ∈ [ m ] (cid:107) A i : (cid:107) + min (cid:18) (cid:107) A (cid:107) F , rcs max i ∈ [ m ] (cid:107) A i : (cid:107) , (cid:18) max i ∈ [ m ] (cid:107) A i : (cid:107) (cid:19) (cid:18) max j ∈ [ n ] (cid:107) A : j (cid:107) (cid:19)(cid:19) . (88)In particular, by choosing whichever of the distributions (85), (86), or (87) yields the minimialLipschitz constant, we may always ensure we have a L , co -local estimator. We now discuss the specificprecomputed quantities each estimator requires, among those listed in Section 2.3. All distributionsrequire access to entry queries, (cid:96) norms of rows, and (cid:96) sampling distributions for rows.• Using the sampling distribution (85) requires additional access to (cid:96) sampling distributionsfor rows and columns and the Frobenius norm of A .• Using the sampling distribution (86) requires additional access to uniform sampling nonzeroentries of columns.• Using the sampling distribution (87) requires additional access to (cid:96) norms of columns and (cid:96) sampling distributions for columns. C.2.2 Implementation details
In this section, we discuss the details of how to leverage the appropriate
IterateMaintainer and IterateMaintainer data structures to implement the iterations of our algorithm. The algo-rithm we analyze is Algorithm 1, using the local estimator defined in (23), and the best choice ofdistribution among (85), (86), (87). We choose η = (cid:15) (cid:16) L , co (cid:17) and T = (cid:24) η(cid:15) (cid:25) ≥ (cid:16) L , co (cid:17) log(2 m ) (cid:15) . Lemma 17 implies that our estimator satisfies the remaining requirements for Proposition 2, givingthe duality gap guarantee in T iterations. In order to give a runtime bound, we claim that eachiteration can be implemented in O (log mn ) time, with O ( m + n ) additional runtime. For simplicity,because most of the algorithm implementation details are exactly same as the discussion of Sec-tion 4.1.2 for the simplex block y ∈ Y , and exactly the same as the discussion of Section C.1.2 for65he ball block x ∈ X , we discuss the differences here, namely the implementations of sampling andgradient computation.We assume that we have initialized IM y , an instantiation of IterateMaintainer , and IM x , aninstantiation of IterateMaintainer . When the choice of distribution is (86), we also assumeaccess to WIM x , an instantiation of WeightedIterateMaintainer initialized with the weight vec-tor of nonzero counts of columns of the matrix; similarly, for distribution (87) we instantiate a WeightedIterateMaintainer with the weight vector of (cid:96) norms of each column. Sampling.
Recall that p ij ( z ) := | A ij |(cid:107) A i : (cid:107) · [ z y ] i . We first sample coordinate i via IM y . Sample () in O (log m ) , and then sample j using the datastructure corresponding to A i : in O (1) . Next, to sample from the distribution q ij ( z ) := A ij (cid:107) A (cid:107) required by (85), we can sample a coordinate of the matrix proportional to its square in constanttime using our matrix access. To sample from the distribution q ij ( z ) := [ z x ] j · { A ij (cid:54) =0 } (cid:80) l ∈ [ n ] cs l · [ z x ] l required by (86), we first sample coordinate j via WIM x . Sample () in O (log n ) , and then uniformlysample a coordinate i amongst the entries of A : j for which the indicator labels as nonzero. Finally,to sample from the distribution q ij ( z ) := | A ij | · [ z x ] j (cid:80) l ∈ [ n ] (cid:107) A : l (cid:107) · [ z x ] l required by (87), we sample coordinate j via WIM x . Sample () , and then sample a coordinate i pro-portional to its absolute value using a column sampling oracle. Computing the gradient estimator.
By the proofs of Theorem 1 and Theorem 6, it suffices to compute p i x j x , q i y j y in constant time. Calling IM x . Get ( j ) , IM y . Get ( i ) , IM x . GetNorm () , and WIM x . GetNorm () when appropriate, and using access to precomputation allows us to obtain all relevant quantitiesfor the computations in O (1) . C.2.3 Algorithm guaranteeTheorem 7.
In the (cid:96) - (cid:96) setup, the implementation in Section C.2.2 has runtime O (cid:16) L , co (cid:17) log m log( mn ) (cid:15) + m + n and outputs a point ¯ z ∈ Z such that E Gap(¯ z ) ≤ (cid:15). roof. The runtime bound follows from the discussion in Section C.2.2. The correctness followsfrom Proposition 2.
Remark 8.
Using our
IterateMaintainer and IterateMaintainer data structures, the (cid:96) - (cid:96) algorithm of Clarkson et al. [10] runs in time O ( rcs max i ∈ [ m ] (cid:107) A i : (cid:107) log ( mn ) /(cid:15) ) . By noting thedefinition of L , , (2) co , our runtime universally improves upon it since (cid:16) L , co (cid:17) ≤ rcs max i ∈ [ m ] (cid:107) A i : (cid:107) . D Deferred proofs for variance-reduced methods
D.1 Helper proofs
Lemma 3.
For y, y (cid:48) ∈ ∆ m , divergence V y ( y (cid:48) ) generated by r ( y ) = (cid:80) i ∈ [ m ] [ y ] i log[ y ] i − [ y ] i satisfies V y ( y (cid:48) ) ≥ (cid:13)(cid:13) y (cid:48) − y (cid:13)(cid:13) y + y (cid:48) = 12 (cid:88) i ∈ [ m ] ([ y ] i − [ y (cid:48) ] i ) [ y ] i + [ y (cid:48) ] i . Proof.
Let γ ∈ R m . Note that for every τ ∈ [0 , (with elementwise multiplication, divisionand square root), (cid:104) γ, y − y (cid:48) (cid:105) = (cid:28) γ (cid:112) (1 − τ ) y + τ y (cid:48) , y − y (cid:48) √ (1 − τ ) y + τy (cid:48) (cid:29) . Therefore, using (cid:104) u, w (cid:105) ≤(cid:107) u (cid:107) + (cid:107) w (cid:107) , we have for every τ ∈ [0 , , (cid:10) γ, y − y (cid:48) (cid:11) ≤ (cid:88) i ∈ [ m ] (cid:0) (1 − τ )[ y ] i + τ [ y (cid:48) ] i (cid:1) [ γ ] i + (cid:88) i ∈ [ m ] ([ y ] i − [ y (cid:48) ] i ) (1 − τ )[ y ] i + τ [ y (cid:48) ] i . Applying the double integral (cid:82) dt (cid:82) t dτ to both sides of the inequality, and using (cid:82) dt (cid:82) t · dτ = and (cid:82) dt (cid:82) t τ · dτ = gives (cid:10) γ, y − y (cid:48) (cid:11) ≤ (cid:88) i ∈ [ m ] (cid:18)
13 [ y ] i + 16 [ y (cid:48) ] i (cid:19) [ γ ] i + (cid:90) dt (cid:90) t (cid:88) i ∈ [ m ] ([ y ] i − [ y (cid:48) ] i ) (1 − τ )[ y ] i + τ [ y (cid:48) ] i dτ. Identifying the double integral with the expression V y ( y (cid:48) ) = (cid:88) i ∈ [ m ] (cid:18) y (cid:48) i log y (cid:48) i y i + y i − y (cid:48) i (cid:19) = (cid:90) dt (cid:90) t (cid:88) i ∈ [ m ] ( y i − y (cid:48) i ) (1 − τ ) y i + τ y (cid:48) i dτ. (89)for the divergence induced by entropy, the result follows by choosing [ γ ] i = [ y ] i − [ y (cid:48) ] i [ y ] i + [ y (cid:48) ] i . Lemma 5.
Let x (cid:48) ∈ ∆ n be a β -padding of x ∈ ∆ n . Then, (cid:88) j ∈ [ n ] x (cid:48) j log x (cid:48) j − (cid:88) j ∈ [ n ] x j log x j ≤ βne + β (1 + β ) . Proof.
Letting ˜ x be the point inducing x (cid:48) in Definition 2, we have (cid:88) j ∈ [ n ] x (cid:48) j log x (cid:48) j − (cid:88) j ∈ [ n ] x j log x j = (cid:88) j ∈ [ n ] x (cid:48) j log x (cid:48) j − (cid:88) j ∈ [ n ] ˜ x j log ˜ x j + (cid:88) j ∈ [ n ] ˜ x j log ˜ x j − (cid:88) j ∈ [ n ] x j log x j .
67e bound these two terms separately. For the first term, let (cid:107) ˜ x (cid:107) = 1 + b , for some b ≤ β ; wesee that entrywise, (1 + b ) x (cid:48) j = ˜ x j . For each j ∈ [ n ] , x (cid:48) j log x (cid:48) j − ˜ x j log ˜ x j = x (cid:48) j log x (cid:48) j − (1 + b ) x (cid:48) j log (cid:0) (1 + b ) x (cid:48) j (cid:1) = bx (cid:48) j log 1 x (cid:48) j − (1 + b ) x (cid:48) j log(1 + b ) ≤ bx (cid:48) j log 1 x (cid:48) j ≤ βe . The first inequality was due to nonnegativity of (1 + b ) log(1 + b ) and x (cid:48) j , and the second was dueto the maximum value of the scalar function z log z over the nonnegative reals being /e . Summingover all coordinates yields that the first term is bounded by βn/e .For the second term, we have by integration that entrywise ˜ x j log ˜ x j − x j log x j = (cid:90) α =0 (1 + log( x j + α (˜ x j − x j )))(˜ x j − x j ) dα ≤ (cid:90) α =0 (1 + log(˜ x j ))(˜ x j − x j ) dα ≤ (cid:90) α =0 ˜ x j (˜ x j − x j ) dα ≤ (1 + β ) | ˜ x j − x j | . The first inequality is by ˜ x j ≥ x j for all j ∈ [ n ] and log( x ) is monotone in x > ; the second is by log( x ) ≤ x − for all x > ; the third again uses ˜ x j ≥ x j and that ˜ x j ≤ (cid:107) ˜ x (cid:107) ≤ β , and thesecond condition in Definition 2. Finally, combining yields the desired (cid:88) j ∈ [ n ] x (cid:48) j log x (cid:48) j − (cid:88) j ∈ [ n ] x j log x j ≤ βne + β (1 + β ) . D.2 (cid:96) - (cid:96) variance-reduced coordinate method Assumptions.
As in Section C.1, the algorithm in this section will assume access to entry queries, (cid:96) norms of rows and columns, and (cid:96) sampling distributions for rows and columns, and the abilityto sample a row or column proportional to its squared (cid:96) norm. We use the (cid:96) - (cid:96) local norm setup(cf. Table 6). Again, we define L , co := (cid:115) (cid:88) i ∈ [ m ] (cid:107) A i : (cid:107) + (cid:88) j ∈ [ n ] (cid:107) A : j (cid:107) . D.2.1 Gradient estimator
Given reference point w ∈ B n × B m , for z ∈ B n × B m , we specify two distinct sampling distribu-tions p ( z ; w ) , q ( z ; w ) which obtain the optimal Lipschitz constant. The first one is an obliviousdistribution: p ij ( z ; w ) := (cid:107) A i : (cid:107) (cid:80) k ∈ [ m ] (cid:107) A k : (cid:107) · | A ij |(cid:107) A i : (cid:107) and q ij ( z ; w ) := (cid:107) A : j (cid:107) (cid:80) k ∈ [ n ] (cid:107) A : k (cid:107) · | A ij |(cid:107) A : j (cid:107) . (90)68he second one is a dynamic distribution: p ij ( z ; w ) := [ w y − z y ] i (cid:13)(cid:13) w y − z y (cid:13)(cid:13) · | A ij |(cid:107) A i : (cid:107) and q ij ( z ; w ) := [ w x − z x ] j (cid:107) w x − z x (cid:107) · | A ij |(cid:107) A : j (cid:107) . (91)We now state the local properties of each estimator. Lemma 18.
In the (cid:96) - (cid:96) setup, estimator (25) using the sampling distribution in (90) or (91) is a √ L , co -centered-local estimator.Proof. Unbiasedness holds by definition in both cases. We first show the variance bound on the x block for distribution (90): E (cid:104)(cid:13)(cid:13) ˜ g x w ( z ) − g x ( w ) (cid:13)(cid:13) (cid:105) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] p ij ( z ; w ) · (cid:18) A ij [ z y − w y ] i p ij ( z ; w ) (cid:19) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] A ij [ z y − w y ] i p ij ( z ; w )= (cid:88) i ∈ [ m ] ,j ∈ [ n ] | A ij |(cid:107) A i : (cid:107) [ z y − w y ] i · (cid:88) k ∈ [ m ] (cid:107) A k : (cid:107) = (cid:88) i ∈ [ m ] (cid:107) A i : (cid:107) (cid:13)(cid:13) z y − w y (cid:13)(cid:13) . Similarly, we have E (cid:104)(cid:13)(cid:13) ˜ g y w ( z ) − g y ( w ) (cid:13)(cid:13) (cid:105) ≤ (cid:88) j ∈ [ n ] (cid:107) A : j (cid:107) (cid:107) z x − w x (cid:107) . Combining these and using (cid:107) z x − w x (cid:107) + (cid:13)(cid:13) z y − w y (cid:13)(cid:13) = 2 V w ( z ) yields the desired variance bound.Now, we show the variance bound on the x block for distribution (91): E (cid:104)(cid:13)(cid:13) ˜ g x w ( z ) − g x ( w ) (cid:13)(cid:13) (cid:105) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] p ij ( z ; w ) · (cid:18) A ij [ z y − w y ] i p ij ( z ; w ) (cid:19) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] A ij [ z y − w y ] i p ij ( z ; w )= (cid:88) i ∈ [ m ] ,j ∈ [ n ] | A ij |(cid:107) A i : (cid:107) (cid:13)(cid:13) z y − w y (cid:13)(cid:13) = (cid:88) i ∈ [ m ] (cid:107) A i : (cid:107) (cid:13)(cid:13) z y − w y (cid:13)(cid:13) . and a similar bound holds on the y block.Again, for algorithmic considerations (i.e. an additional logarithmic factor in the complexity ofsampling from (91)), we will only discuss using the oblivious distribution (90) in our algorithm. D.2.2 Implementation details
In this section, we discuss the details of how to leverage the
IterateMaintainer data structure toimplement the iterations of our algorithm. The algorithm we analyze is Algorithm 2 with K = α Θ /(cid:15) ,69sing Algorithm 3 as an ( α, -relaxed proximal oracle. In the implementation of Algorithm 3, weuse the centered-local gradient estimator defined in (90). For each use of Algorithm 3, we choose η = α (cid:16) L , co (cid:17) and T = (cid:24) ηα (cid:25) ≥ (cid:16) L , co (cid:17) α . (92)Our discussion will follow in three steps: first, we discuss the complexity of all executions in Al-gorithm 2 other than the calls to the oracles, as well as the initialization procedure for each innerloop. Next, we discuss the complexity of each iteration of Algorithm 3. Finally, we discuss thecomplexity of computing the average iterate in each run of Algorithm 3. For simplicity, when dis-cussing Algorithm 3, we will only discuss implementation of the x -block, and the y -block will followsymmetrically. Altogether, the guarantees of Proposition 3 and Proposition 4 imply that if theguarantees required by the algorithm hold, the expected gap of the output is bounded by (cid:15) . Outer loop extragradient steps and inner loop data structures.
Overall, we execute K = α Θ /(cid:15) iterations of Algorithm 2, and let ε outer = ε inner = 0 to obtain the desired gap, where Θ = 1 in the (cid:96) - (cid:96) setup. We spend O ( nnz ) time executing each extragradient step in Algorithm 2 exactly,where the dominant term in the runtime is the computation of each g ( z k − / ) , for k ∈ [ K ] . Also,we can maintain the average point ¯ z throughout the duration of the algorithm, in O ( m + n ) timeper iteration. At the beginning of each inner loop, we initialize a data structure IM x which does notsupport sampling, an instance of IterateMaintainer , with IM x . Init ( w x , v ) , for v = (1 − κ ) w x − ηκg x ( w ) , where κ := ηα/ . The inner loop will preserve the invariant that the point maintained by IM x isthe x block of the current inner loop iterate w t in each iteration t . To motivate this initialization,we recall the form of the updates, w x t +1 ← Π X (cid:18) κ (cid:18) w x t + (cid:18) κ − (cid:19) w x − η ˜ g x w ( w t ) (cid:19)(cid:19) , (93)where Π X ( w ) = w max { , (cid:107) w (cid:107) } , and the fixed dense part of ˜ g x w ( w t ) is g x ( w ) . Therefore, in thefollowing discussion we will be able to maintain this difference via a scaling by κ , an appropriateaddition of the scaled dense vector, and a sparse update.Finally, we also store the vector w in full, supporting entry queries. Inner loop iterations.
Each inner loop iteration consists of sampling indices for the computationof ˜ g w , computing the sparse part of ˜ g w , and performing the update to the iterate. We show thatwe can run each substep in constant time. Then, this implies that the total complexity of the innerloop, other than initializing the data structures and outputting the average iterate, is O ( T ) = O (cid:16) L , co (cid:17) α . We discuss how to make appropriate modifications to the x -block. For simplicity we denote ourcurrent iterate as z , and the next iterate as w . Recall that the distribution is given by p ij ( z ; w ) := (cid:107) A i : (cid:107) (cid:80) k ∈ [ m ] (cid:107) A k : (cid:107) · | A ij |(cid:107) A i : (cid:107) . ampling. By using precomputed distributions, we can sample i ∝ (cid:107) A i : (cid:107) and then j | i ∝ | A ij | inconstant time. Computing the gradient estimator.
Computing the sparse component of the gradient estimator 25requires computing A ij , [ z y − w y ] i , and p ij ( z ; w ) . Using appropriate use of precomputed access toentries and row norms (it is clear we may pay O ( m + n ) at the beginning of the algorithm to storethe sum (cid:80) k ∈ [ m ] (cid:107) A k : (cid:107) ), entry [ w y ] i , and IM y . Get ( i ) allows us to perform the required computationof the sparse component c := [˜ g x w ( z ) − g ( w )] j in constant time, by assumption. Performing the update.
In order to perform the update, we recall the form of the update given by(93). Thus, it suffices to call IM x . Scale ( κ ); IM x . AddDense (1); IM x . AddSparse ( j, − κηc ); IM x . Scale (max { IM x . GetNorm () , } − ); IM x . UpdateSum () By assumption, each operation takes constant time. By the discussion in the data structure initial-ization section, it is clear that we preserve the invariant that the point maintained by IM x is the x block of the current iterate. Average iterate computation.
At the end of each run of Algorithm 3, we spend O ( n ) timecomputing and returning the average iterate via appropriate calls to IM x . GetSum ( j ) for each j ∈ [ n ] ,and scaling by /T . This operation is asymptotically dominated by the O ( nnz ( A )) cost of theextragradient step. D.2.3 Algorithm guaranteeTheorem 8.
In the (cid:96) - (cid:96) setup, the implementation in Section D.2.2 with the optimal choice of α = max { (cid:15), L , co (cid:112) / nnz } has runtime O nnz + (cid:16) L , co (cid:17) α α(cid:15) = O (cid:32) nnz + √ nnz L , co (cid:15) (cid:33) and outputs a point ¯ z ∈ Z such that E Gap(¯ z ) ≤ (cid:15). Proof.
The correctness of the algorithm is given by the discussion in Section D.2.2 and the guaranteesof Proposition 3 and Proposition 4. The runtime bound is given by the discussion in Section D.2.2,and the optimal choice of α is clear.To better understand the strengths of our runtime guarantee, Proposition 7 shows that Theo-rem 8 implies a universal improvement for (cid:96) - (cid:96) games compared to accelerated gradient descent formatrices A with nonnegative entries (or more generally, for A with (cid:107)| A |(cid:107) op = O ( (cid:107) A (cid:107) op ) ).71 roposition 7. For any A ∈ R m × n , we have L , co := max (cid:115)(cid:88) i (cid:107) A i : (cid:107) , (cid:115)(cid:88) j (cid:107) A : j (cid:107) ≤ √ m + n · (cid:107)| A |(cid:107) op . Proof.
Denote k as the all vector in R k . We have the following sequence of inequalities: (cid:115) (cid:88) i ∈ [ m ] (cid:107) A i : (cid:107) = (cid:13)(cid:13)(cid:13) | A | (cid:62) m (cid:13)(cid:13)(cid:13) = max x ∈ B n (cid:62) m | A | x ≤ (cid:107) m (cid:107) max x ∈ B n (cid:107)| A | x (cid:107) ≤ √ m (cid:107)| A |(cid:107) op . Similarly, bounding max y ∈ B n y (cid:62) | A | n implies (cid:113)(cid:80) j ∈ [ n ] (cid:107) A : j (cid:107) ≤ √ n (cid:107)| A |(cid:107) op . Taking a maxi-mum and using max {√ m, √ n } ≤ √ m + n implies the result. Remark 9.
For matrix A ∈ R m × n , combining the guarantees of Theorem 8 with the bound fromProposition 7 implies a runtime bounded by O (cid:32) nnz + (cid:112) nnz · ( m + n ) (cid:107)| A |(cid:107) op (cid:15) (cid:33) . Whenever (cid:107) A (cid:107) op ≥ (cid:107)| A |(cid:107) op , this is an improvement by a factor of (cid:112) nnz / ( m + n ) compared to theaccelerated full-gradient method (c.f. Table 2), which obtains a runtime of O ( nnz · (cid:107) A (cid:107) op /(cid:15) ) . Thisapplies without any sparsity or numerical sparsity assumptions, and is the same speedup factor aswe obtained for (cid:96) - (cid:96) and (cid:96) - (cid:96) games using a variance reduction framework with row and columnbased gradient estimators in Carmon et al. [8]. The (cid:96) - (cid:96) variance reduction algorithms of Carmonet al. [8] and Balamurugan and Bach [7] do not offer such improvements, and our improvementstems from our coordinate-based gradient estimators and our data structure design. D.3 (cid:96) - (cid:96) variance-reduced coordinate method Assumptions.
The algorithm in this section will assume access to entry queries, (cid:96) norms of rows, (cid:96) sampling distributions for rows and columns, and the Frobenius norm of A . We use the (cid:96) - (cid:96) local norm setup (cf. Table 6). Again, we define L , , (1) co := (cid:114) max i ∈ [ m ] (cid:107) A i : (cid:107) + (cid:107) A (cid:107) , (94) L , , (2) co := (cid:114) rcs max i ∈ [ m ] (cid:107) A i : (cid:107) , (95) L , , (3) co := (cid:115) max i ∈ [ m ] (cid:107) A i : (cid:107) + (cid:18) max i ∈ [ m ] (cid:107) A i : (cid:107) (cid:19) (cid:18) max j ∈ [ n ] (cid:107) A : j (cid:107) (cid:19) . (96)Finally, in this section we assume access to a centered variant of WeightedIterateMaintainer ,which takes a point x as a static parameter, where x is in the space as the iterates x maintained. CenteredIterateMaintainer supports two additional operations compared to the data structure WeightedIterateMaintainer : Sample () returns coordinate j with probability proportional to [ w ] j [ x − x ] j (cf. Section 2.4.1) in O (log n ) time, and we may query (cid:107) x − x (cid:107) w in constant time,where w is a specified weight vector. We give the implementation of this extension in Appendix G.72 .3.1 Gradient estimator Given reference point w ∈ B n × ∆ m , for z ∈ B n × ∆ m and a parameter α > , as in Section C.2,we specify three distinct choices of sampling distributions p ( z ; w ) , q ( z ; w ) .The first one is p ij ( z ; w ) := [ z y ] i + 2[ w y ] i · | A ij |(cid:107) A i : (cid:107) and q ij ( z ; w ) := A ij (cid:107) A (cid:107) . (97)The second one is p ij ( z ; w ) := [ z y ] i + 2[ w y ] i · | A ij |(cid:107) A i : (cid:107) and q ij ( z ) := [ z x − w x ] j · { A ij (cid:54) =0 } (cid:80) l ∈ [ n ] cs l · [ z x − w x ] l . (98)As in Section C.2, cs j is the number of nonzeros of A : j . The third one is p ij ( z ; w ) := [ z y ] i + 2[ w y ] i · | A ij |(cid:107) A i : (cid:107) and q ij ( z ) := | A ij | · [ z x − w x ] j (cid:80) l ∈ [ n ] (cid:107) A : l (cid:107) · [ z x − w x ] l . (99)We now state the local properties of each estimator. Lemma 19.
In the (cid:96) - (cid:96) setup, estimator (25) using the sampling distributions in (97) , (98) , or (99) is respectively a √ L , , ( k ) co -centered-local estimator, for k ∈ { , , } .Proof. First, we give the proof for the sampling distribution (97). Unbiasedness holds by definition.For the x block, we have the variance bound: E (cid:104)(cid:13)(cid:13) ˜ g x w ( z ) − g x ( w ) (cid:13)(cid:13) (cid:105) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] p ij ( z ; w ) (cid:18) A ij [ z y − w y ] i p ij ( z ; w ) (cid:19) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] A ij [ z y − w y ] i p ij ( z ; w ) ≤ i ∈ [ m ] (cid:107) A i : (cid:107) V w y ( z y ) , where in the last inequality we used Lemma 3.For arbitrary w y , we have the variance bound on the y block: E (cid:104)(cid:13)(cid:13) ˜ g y w ( z ) − g y ( w ) (cid:13)(cid:13) w y (cid:105) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] [ w y ] i A ij [ z x − w x ] j q ij ( z ; w )= (cid:88) i ∈ [ m ] ,j ∈ [ n ] [ w y ] i [ z x − w x ] j (cid:107) A (cid:107) ≤ (cid:107) A (cid:107) V w x ( z x ) . Combining these and using (cid:107) ˜ g w ( z ) − g ( w ) (cid:107) w := (cid:107) ˜ g w ( z ) x − g ( w ) x (cid:107) + (cid:107) ˜ g w ( z ) y − g ( w ) y (cid:107) w y yields the desired variance bound. For the remaining two distributions, the same argument demon-strates unbiasedness and the variance bound for the x block. For sampling distribution (98) andarbitrary w y , we have the variance bound on the y block: E (cid:104)(cid:13)(cid:13) ˜ g y w ( z ) − g y ( w ) (cid:13)(cid:13) w y (cid:105) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] [ w y ] i A ij [ z x − w x ] j q ij ( z ; w ) ≤ (cid:88) i ∈ [ m ] ,j ∈ [ n ] [ w y ] i A ij rcs (cid:88) j ∈ [ n ] [ z x − w x ] j ≤ rcs max i ∈ [ m ] (cid:107) A i : (cid:107) V w x ( z x ) . y block: E (cid:104)(cid:13)(cid:13) ˜ g y w ( z ) − g y ( w ) (cid:13)(cid:13) w y (cid:105) = (cid:88) i ∈ [ m ] ,j ∈ [ n ] [ w y ] i A ij [ z x − w x ] j q ij ( z ; w ) ≤ (cid:88) i ∈ [ m ] ,j ∈ [ n ] [ w y ] i | A ij | (cid:88) l ∈ [ n ] (cid:107) A : l (cid:107) · [ z x − w x ] l ≤ (cid:18) max i ∈ [ m ] (cid:107) A i : (cid:107) (cid:19) (cid:18) max j ∈ [ n ] (cid:107) A : j (cid:107) (cid:19) V w x ( z x ) . Finally, as in Section C.2, we define the constant L , co := (cid:115) max i ∈ [ m ] (cid:107) A i : (cid:107) + min (cid:18) (cid:107) A (cid:107) F , rcs max i ∈ [ m ] (cid:107) A i : (cid:107) , (cid:18) max i ∈ [ m ] (cid:107) A i : (cid:107) (cid:19) (cid:18) max j ∈ [ n ] (cid:107) A : j (cid:107) (cid:19)(cid:19) , and note that Lemma 19 implies that we can obtain a √ L , co -centered-local estimator by appro-priately choosing a sampling distribution depending on the minimizing parameter. D.3.2 Implementation details
The algorithm we analyze is Algorithm 2 with K = 3 α Θ /(cid:15) , ε outer = 2 (cid:15)/ using Algorithm 3 as an ( α, ε inner = (cid:15)/ -relaxed proximal oracle with ϕ = (cid:15)/ . In the implementation of Algorithm 2,we again apply the truncate ( · , δ ) operation to each iterate z (cid:63)k , where the truncate operation onlyaffects the y block; choosing δ = ε outer − ε inner αm suffices for its guarantees (see Section 4.2.2 for therelevant discussion). In the implementation of Algorithm 3, we use the centered-local gradientestimator defined in (25), using the sampling distribution amongst (97), (98), or (99) which attainsthe variance bound L , co . For each use of Algorithm 3, we choose η = α (cid:16) L , co (cid:17) and T = (cid:24) ηα (cid:25) = 120 (cid:16) L , co (cid:17) α . For simplicity, because most of the algorithm implementation details are exactly the same as thediscussion of Section 4.2.2 for the simplex block y ∈ Y , and exactly the same as the discussion ofSection D.2.2 for the ball block x ∈ X , we discuss the differences here. Outer loop extragradient steps.
We execute α log(2 m ) /(cid:15) iterations of Algorithm 2 to obtainthe desired gap. We spend O ( nnz ) time executing each extragradient step exactly, and then O ( m + n ) time applying the truncate operation and maintaining the average point ¯ z . When we initialize theinner loop, we also create a data structure supporting sampling from w y in constant time. Data structure initializations and invariants.
On the simplex block, we follow the strategyoutlined in Section 4.2.2. We initialize our simplex maintenance data structure
AEM y ( w y , v, κ, ˜ ε ) with parameters κ := 11 + ηα/ , v := (1 − κ ) log w y − ηκg y ( w ) , ˜ ε := ( m + n ) − .
74e will again maintain the invariant that the data structures maintain “exact” and “approximate”points corresponding to the iterates of our algorithm. The correctness of this setting with respect tothe requirements of Proposition 4, i.e. the approximation conditions in Line 2, 4 and 5 in Algorithm 3,follows from the discussion of Section 4.2.2; we note that the condition min j [ w x ] j ≥ ( m + n ) − = λ again holds, and that − κ ≥ ( m + n ) − . Thus, for the parameter ω used in the interface of ApproxExpMaintainer , we have log( ω ) = log (cid:18) max (cid:18) − κ , mλ ˜ ε (cid:19)(cid:19) = O (log( mn )) . On the ball block, we follow the strategy outlined in Section D.2.2, but instead of using an
IterateMaintainer on the x -block, we use CIM x , an instance of CenteredIterateMaintainer data structure initialized with the point w x , supporting the required sampling operation. For thesampling distribution (98), we use the weight vector of column nonzero counts, and for (99) we usethe weight vector of column (cid:96) norms. Overall, the complexity of the initializations on both blocksis bounded by O ( n + m log ( m ) log ( mn )) . Inner loop iterations.
We discuss how to sample from each of the distributions (97), (98),and (99) in O (log( m ) log( mn )) . Combining with the discussions of implementing the inner loop inSections 4.2.2 and D.2.2, the total complexity of the inner loop, other than outputting the averageiterate, is O (cid:0) T log ( m ) log ( mn ) + nnz + m log( m ) log ( mn ) (cid:1) = O (cid:16) L , , (1) co (cid:17) log ( m ) log ( mn ) α + nnz + m log( m ) log ( mn ) . As in the variance-reduced (cid:96) - (cid:96) setting, the dominant term in the runtime is the complexity ofcalling AEM y . AddSparse in each iteration. Recall that the distribution p in every case is given by p ij ( z ; w ) := [ z y ] i + 2[ w y ] i · | A ij |(cid:107) A i : (cid:107) With probability / we sample a coordinate i from the precomputed data structure for samplingfrom w y , and otherwise we sample i via AEM y . Sample () . Then, we sample an entry j proportionalto its magnitude from the (cid:96) sampling oracle for A i : in constant time. The runtime is dominatedby O (log( m ) log( mn )) .To sample from the distribution q in (97), we follow the outline in Section D.3. Similarly, forsampling from distributions (98) and (99), we follow the outline in Section D.3 but replace all callsto an IterateMaintainer instance with a call to
CIM x initialized with an appropriate weight vector.In all cases, the runtime is O (log m ) which does not dominate the iteration complexity.Finally, it is clear from discussions in previous sections that the iterate maintenance invariantsof our data structures are preserved by the updates used in this implementation.75 .3.3 Algorithm guaranteeTheorem 9. In the (cid:96) - (cid:96) setup, let nnz (cid:48) := nnz + m log( m ) log ( mn ) . The implementation inSection D.3.2 with the optimal choice of α = max (cid:16) (cid:15)/ , L , co log( m ) log ( mn ) / √ nnz (cid:48) (cid:17) has runtime O nnz (cid:48) + (cid:16) L , co (cid:17) log ( m ) log ( mn ) α α log( m ) (cid:15) = O (cid:32) nnz (cid:48) + √ nnz (cid:48) L , co log( mn ) log ( m ) (cid:15) (cid:33) and outputs a point ¯ z ∈ Z such that E [Gap( z )] ≤ (cid:15). Proof.
The correctness of the algorithm is given by the discussion in Section D.3.2 and the guaranteesof Proposition 3 with K = 3 α Θ /(cid:15) , ε outer = 2 (cid:15)/ , ε inner = (cid:15)/ , Proposition 4 with ϕ = (cid:15)/ anddata structure ApproxExpMaintainer with our choice of ˜ ε := ( m + n ) − to meet the approximation conditions in Line 2, 4 and 5 in Algorithm 3. The runtime bound isgiven by the discussion in Section D.3.2, and the optimal choice of α is clear. E Additional results on variance-reduced methods
E.1 Row-column sparsity variance-reduced methods
By instantiating relaxed proximal oracles with row-column based gradient estimators developedin [8], implemented with the data structures we develop in Section 5, we obtain the improvedcomplexities as stated in Table 2. Namely, up to logarithmic factors, we generically replace adependence on O ( m + n ) with O ( rcs ) , where rcs is defined as the maximum number of nonzeroentries for any row or column. In this section, we give implementation details.The estimators ˜ g w of [8], parameterized by reference point w , sample a full column or rowof the matrix (rather than a coordinate). To compute ˜ g w ( z ) we sample i ∼ p ( z ) and j ∼ q ( z ) independently according to a specified distribution depending on the setup, and use the estimator ˜ g w ( z ) := (cid:18) A (cid:62) w y + A i : [ z y ] i − [ w y ] i p i ( w ) , − Aw x − A : j [ z x ] j − [ w x ] j q j ( w ) (cid:19) , (100)The key difference between this estimator with that of Section 3.2.2 is that its difference with g ( w ) is O ( rcs ) -sparse rather than O (1) -sparse, requiring MultSparse steps with O ( rcs ) -sparse vec-tors. In all other respects, the implementation details are exactly the same as those in Section 4.2and Appendix D, so we omit them for brevity. We now state our sampling distributions used withthe estimator form (100), and the corresponding centered local variance bounds.In the (cid:96) - (cid:96) setup, we use the sampling distribution (from reference point w ∈ ∆ m × ∆ n ) p i ( z ) := [ z y ] i + 2[ w y ] i and q j ( z ) := [ z x ] j + 2[ w x ] j . (101) Lemma 20.
In the (cid:96) - (cid:96) setup, gradient estimator (100) using the sampling distribution in (101) is a √ (cid:107) A (cid:107) max -centered-local estimator. roof. Unbiasedness holds by definition. For the variance bound, it suffices to show that E (cid:107) ˜ g w ( z ) − g ( w ) (cid:107) ∞ ≤ (cid:107) A (cid:107) V w x ( z x ); clearly this implies the weaker relative variance bound statement (along with an analogous boundon the y block). To this end, we have E (cid:107) ˜ g w ( z ) − g ( w ) (cid:107) ∞ ≤ (cid:88) i ∈ [ m ] (cid:107) A i : (cid:107) ∞ [ z y − w y ] i p i ( z ) ≤ (cid:107) A (cid:107) V w x ( z x ) , where the last inequality used Lemma 3.In the (cid:96) - (cid:96) setup, we use the oblivious sampling distribution p i = (cid:107) A i : (cid:107) (cid:107) A (cid:107) and q j = (cid:107) A : j (cid:107) (cid:107) A (cid:107) . (102)We proved that gradient estimator (100) using the sampling distribution in (102) admits a (cid:107) A (cid:107) F -centered estimator in [8], which is an equivalent definition to Definition 4 in the (cid:96) - (cid:96) setup. In the (cid:96) - (cid:96) setup, we use the sampling distribution (from reference point w ∈ B n × ∆ m ) p i ( z ) = [ z y ] i + 2[ w y ] i and q j ( z ) = ([ z x ] j − [ w x ] j ) (cid:107) z x − w x (cid:107) . (103) Lemma 21.
In the (cid:96) - (cid:96) setup, gradient estimator (100) using the sampling distribution in (103) is a √ L -centered-local estimator with L = max i ∈ [ m ] (cid:107) A i : (cid:107) = (cid:107) A (cid:107) →∞ .Proof. Unbiasedness holds by definition. For the variance bound, we first note E (cid:104)(cid:13)(cid:13) ˜ g x w ( z ) − g x ( w ) (cid:13)(cid:13) (cid:105) ≤ (cid:88) i ∈ [ m ] (cid:107) A i : (cid:107) (cid:0) [ z y ] i − [ w y ] i (cid:1) [ z y ] i + [ w y ] i ≤ max i ∈ [ m ] (cid:107) A i : (cid:107) (cid:88) i ∈ [ m ] (cid:0) [ z y ] i − [ w y ] i (cid:1) [ z y ] i + [ w y ] i ≤ i ∈ [ m ] (cid:107) A i : (cid:107) V w y ( z y ) , where for the last inequality we use Lemma 3. On the other block, we have max i ∈ [ m ] E (cid:2) ˜ g y w ( w ) − g y ( w ) (cid:3) i ≤ max i ∈ [ m ] (cid:88) j ∈ [ n ] A ij [ w x − w x ] j q j ( w ) = 2 max i ∈ [ m ] (cid:107) A i : (cid:107) V w x ( w x ) . Summing these two bounds concludes the proof.
E.2 Extensions with composite terms
In this section, we give a brief discussion of how to change Proposition 4 and implementations of theprocedures in Sections 3.1 and 3.2 to handle modified regularization in the context of Proposition 6,and composite regularization terms in the objective in the methods of Section 6. Specifically weconsider a composite optimization problem of the form: min x ∈X max y ∈Y y (cid:62) Ax + µ x φ ( x ) − µ y ψ ( y ) where φ = V x x (cid:48) and ψ = V y y (cid:48) . For simplicity of notation we define Υ( x, y ) := µ x φ ( x ) + µ y ψ ( y ) . We remark that x (cid:48) = 0 recoversthe case of φ = r x when X = B n , and x (cid:48) = n recovers the case of φ = r x when X = ∆ n (similarlysetting y (cid:48) allows us to recover this for the y block).77 .2.1 Changes to inner loop In this section, we first discuss the necessary changes to Algorithm 3 and Proposition 4. Forsimplicity of notation, we denote ρ := (cid:112) µ x /µ y , ˆ V x := ρV x , ˆ V y := ρ V y , ˆ V := ˆ V x + ˆ V y . Algorithm 4:
InnerLoop ( w , ˜ g w , ϕ ) Input:
Initial w ∈ Z , ( L, α ) -centered-local gradient estimator ˜ g w , oracle quality α > Parameters:
Step size η , number of iterations T , approximation tolerance ϕ Output:
Point ˜ w satisfying Definition 5 for t = 1 , . . . , T do ˆ w t − ≈ w t − satisfying ˆ V w ( ˆ w t − ) − ˆ V w ( w t − ) ≤ ϕα and (cid:107) ˆ w t − − w t − (cid:107) ≤ ϕLD w (cid:63)t ← arg min (cid:110) (cid:104) clip( η ˜ g w ( ˆ w t − ) − ηg ( w )) , w (cid:105) + η Υ( w ) + ηα ˆ V w ( w ) + ˆ V w t − ( w ) (cid:111) w t ≈ w (cid:63)t satisfying max u (cid:104) ˆ V w t ( u ) − ˆ V w (cid:63)t ( u ) (cid:105) ≤ ϕ √ µ x µ y , ˆ V w ( w t ) − ˆ V w ( w (cid:63)t ) ≤ ϕα , and ˆ V z (cid:48) ( w t ) − ˆ V z (cid:48) ( w (cid:63)t ) ≤ ϕ √ µ x µ y return ˜ w ≈ T (cid:80) Tt =1 w t satisfying (cid:13)(cid:13)(cid:13) ˜ w − T (cid:80) Tt =1 w t (cid:13)(cid:13)(cid:13) ≤ ϕLD , max u (cid:104) ˆ V ˜ w ( u ) − ˆ V ¯ w ( u ) (cid:105) ≤ ϕ √ µ x µ y , ˆ V w (cid:48) ( ˜ w ) − ˆ V w (cid:48) ( ¯ w ) ≤ ϕ √ µ x µ y , and (cid:107) w t − w (cid:63)t (cid:107) ≤ ϕ LD Corollary 1.
Let ( Z , (cid:107)·(cid:107) · , r , Θ , clip ) be any local norm setup. Let w ∈ Z , ε inner > , and ˜ g w be an L -centered-local estimator for some L ≥ α ≥ ε inner . Assume the problem has bounded domainsize max z ∈Z (cid:107) z (cid:107) ≤ D , g is L -Lipschitz, i.e. (cid:107) g ( z ) − g ( z (cid:48) ) (cid:107) ∗ ≤ L (cid:107) z − z (cid:48) (cid:107) , that g is LD -bounded, i.e. max z ∈Z (cid:107) g ( z ) (cid:107) ∗ ≤ LD , and ˆ w = w . Then, for η = α L , T ≥ ηα ≥ L α , ϕ = ε inner , Algorithm 4outputs a point ˆ w ∈ Z such that E max u ∈Z [ (cid:104) g ( ˜ w ) + ∇ Υ( ˜ w ) , ˜ w − u (cid:105) − αV w ( u )] ≤ ε inner , (104) i.e. Algorithm 4 is an ( α, ε inner ) -relaxed proximal oracle.Proof sketch. Note that the only change is in the definition of the regularized mirror descent stepwith extra composite terms w (cid:63)t ← arg min (cid:110) (cid:104) clip( η ˜ g w ( ˆ w t − ) − ηg ( w )) , w (cid:105) + η Υ( w ) + ηα V w ( w ) + ˆ V w t − ( w ) (cid:111) . Denote ∇ Υ( w ) = ( µ x ∇ φ ( w x ) , µ y ∇ ψ ( w y )) , so that for the final regret bound there are two additionalerror terms. The first term comes from the error in regularized mirror descent steps via (denoting z (cid:48) = ( x (cid:48) , y (cid:48) ) ) T (cid:88) t ∈ [ T ] [ −(cid:104)∇ Υ( w (cid:63)t ) , w (cid:63)t − u (cid:105) + (cid:104)∇ Υ( w t ) , w t − u (cid:105) ] ≤ √ µ x µ y T (cid:88) t ∈ [ T ] (cid:16) ˆ V z (cid:48) ( w t ) − ˆ V z (cid:48) ( w (cid:63)t ) + ˆ V w t ( u ) − ˆ V w (cid:63)t ( u ) (cid:17) ≤ ϕ following the approximation guarantee in Line 4. The other term comes from averaging error.Denote the true average iterate by ¯ w := T (cid:80) t ∈ [ T ] w t . We have ∀ u ∈ Z , (cid:104) g ( ˜ w ) , ˜ w − u (cid:105) − T (cid:88) t ∈ [ T ] (cid:104) g ( w t ) , w t − u (cid:105) = − (cid:104) g ( ˜ w ) , u (cid:105) − (cid:104) g ( ¯ w ) , ¯ w − u (cid:105) = (cid:104) g ( ¯ w ) − g ( ˜ w ) , u (cid:105) ≤ ϕ, (cid:104)∇ Υ( ˜ w ) , ˜ w − u (cid:105) = (cid:104)∇ Υ( ˜ w ) − ∇ Υ( ¯ w ) , ˜ w − u (cid:105) + (cid:104)∇ Υ( ¯ w ) , ˜ w − ¯ w (cid:105) + (cid:104)∇ Υ( ¯ w ) , ¯ w − u (cid:105) ( i ) = √ µ x µ y (cid:16) − ˆ V ¯ w ( u ) + ˆ V ˜ w ( u ) + ˆ V ¯ w ( ˜ w ) (cid:17) + (cid:104)∇ Υ( ¯ w ) , ˜ w − ¯ w (cid:105) + (cid:104)∇ Υ( ¯ w ) , ¯ w − u (cid:105) ( ii ) = √ µ x µ y (cid:16) − ˆ V ¯ w ( u ) + ˆ V ˜ w ( u ) + ˆ V w (cid:48) ( ˜ w ) − ˆ V w (cid:48) ( ¯ w ) (cid:17) + (cid:104)∇ Υ( ¯ w ) , ¯ w − u (cid:105) , ( iii ) ≤ ϕ + (cid:104)∇ Υ( ¯ w ) , ¯ w − u (cid:105) ( iv ) ≤ ϕ + 1 T (cid:88) t ∈ [ T ] (cid:104)∇ Υ( w t ) , w t − u (cid:105) . where we use ( i ) the three-point property of Bregman divergence, ( ii ) the fact that ˆ V ¯ w ( ˜ w ) + (cid:104)∇ Υ( ¯ w ) , ˜ w − ¯ w (cid:105) = ˆ V w (cid:48) ( ˜ w ) − ˆ V w (cid:48) ( ¯ w ) again by the three-point property, ( iii ) the approximationguarantee of Line 5, and ( iv ) the fact that (cid:104)∇ Υ( w ) , w − u (cid:105) is convex in w for our choices of Υ .Hence incorporating the above extra error terms into the regret bound yields the conclusion, as ϕ = ε inner by our choice of ϕ . E.2.2 Changes to implementation
Broadly speaking, all of these modifications can easily be handled via appropriate changes to theinitial data given to our data structures
CenteredIterateMaintainer and ApproxExpMaintainer .We discuss general formulations of iterations with these modifications in both simplices and Eu-clidean balls, and provide appropriate modifications to the inital data given to our data structures.Finally, it is simple to check that all relevant parameters are still bounded by a polynomial in the di-mensions of variables, so no additional cost due to the data structure is incurred. For simplicity herewe only considerfor the x -block when φ x ( x ) = µr ( x ) and remark that the case when φ x ( x ) = µV x (cid:48) ( x ) for some x (cid:48) follows similarly. (cid:96) domains. For this section, define a domain X = ∆ n , let r ( x ) = (cid:80) j ∈ [ n ] x j log x j be entropy,and let µ , α , η , ρ be nonnegative scalar parameters. Consider a sequence of iterates of the form x t +1 ← arg min x ∈X (cid:104) ˜ g x ( x t ) , x (cid:105) + µr ( x ) + αρ V x ( x ) + ρη V x t ( x ) . This update sequence, for the form of gradient estimator ˜ g x ( x ) = g ( x ) + b + g (cid:48) ( x ) , where g (cid:48) ( x ) is a vector with suitable sparsity assumptions depending on the point x , and b is somefixed vector, generalizes all of the settings described above used in our various relaxed proximaloracle implementations. Optimality conditions imply that the update may be rewritten as x t +1 ← Π ∆ (cid:32) exp (cid:32) ρη log x t + αρ log x − g ( x ) − b − g (cid:48) ( x t ) µ + αρ + ρη (cid:33)(cid:33) . Thus, initializing an
ApproxExpMaintainer instance with κ = 1 µηρ + αη + 1 , v = αρ log x − g ( x ) − bµ + αρ + ρη enables DenseStep to propagate the necessary changes to the iterate; we propagate changes due to g (cid:48) ( x t ) via AddSparse and the appropriate sparsity assumptions.79 domains. For this section, define a domain X = B n , let r ( x ) = (cid:107) x (cid:107) be entropy, and let µ , α , η , ρ be nonnegative scalar parameters. Consider a sequence of iterates of the form x t +1 ← arg min x ∈X (cid:104) ˜ g x ( x t ) , x (cid:105) + µr ( x ) + αρ V x ( x ) + ρη V x t ( x ) . This update sequence, for the form of gradient estimator ˜ g x ( x ) = g ( x ) + b + g (cid:48) ( x ) , where g (cid:48) ( x ) is a vector with suitable sparsity assumptions depending on the point x , and b is somefixed vector, generalizes all of the settings described above used in our various relaxed proximaloracle implementations. Optimality conditions imply that the update may be rewritten as x t +1 ← Π B n (cid:32) ρη x t + αρ x − g ( x ) − b − g (cid:48) ( x t ) µ + αρ + ρη (cid:33) . Thus, initializing an
CenteredIterateMaintainer instance with v = αρ x − g ( x ) − bµ + αρ + ρη enables AddDense , Scale , and
GetNorm to propagate the necessary changes to the iterate; wepropagate changes due to g (cid:48) ( x t ) via AddSparse and the appropriate sparsity assumptions.
F Deferred proofs from Section 6
F.1 Proofs from Section 6.1
Proof of Lemma 10.
We consider the following ( µ, µ ) -strongly monotone problem, for various levelsof µ : max x ∈ B n min y ∈ ∆ m f µ ( x, y ) := y (cid:62) ˜ Ax + y (cid:62) b + µ (cid:88) i ∈ [ m ] [ y ] i log[ y ] i − µ (cid:107) x (cid:107) . We claim we can implement an ( α, ε ) -relaxed proximal oracle for this problem in time (cid:101) O (cid:16) L , co (cid:17) α . The oracle is a composite implementation of Algorithm 3 as in Algorithm 4, using the estimator ofAppendix D.3. By an application of Proposition 6, the overall complexity of solving this problemis (by choosing the optimal α , and overloading the constant L , co to be with respect to ˜ A ): (cid:101) O nnz + (cid:16) L , co (cid:17) α αµ = (cid:101) O (cid:32) nnz + √ nnz · L , co µ (cid:33) . By conducting a line search over the parameter µ via repeatedly halving, the total cost of solvingeach of these problems is dominated by the last setting, wherein µ = Θ( r ∗ / log m ) , and R/µ = (cid:101) O ( ρ ) ;here, we recall that we rescaled ˜ A so that L , co = O ( R ) . We defer details of the line search procedureto Lemma C.3 of Allen-Zhu et al. [1]. 80 roof of Theorem 3. We solve the problem (58) to duality gap (cid:15) ˆ r/ ≤ (cid:15)r ∗ , using the algorithm ofAppendix D.3 for (cid:96) - (cid:96) games. The complexity of this algorithm is (choosing α optimally) (cid:101) O nnz ( ˜ A ) + (cid:16) L , co (cid:17) α · α(cid:15) ˆ r = (cid:101) O (cid:32) nnz + ρ √ nnz · L , co (cid:15) (cid:33) , as claimed. Here, we used that ˜ A is a rescaling of A by R , and ˆ r is a constant multiplicativeapproximation of r . The approximate solution ( x ∗ (cid:15) (cid:48) , y ∗ (cid:15) (cid:48) ) obtains the requisite duality gap in expec-tation; Markov’s inequality implies that with logarithmic overhead in the runtime, we can obtain apair of points satisfying with high probability max x f ( x, y ∗ (cid:15) (cid:48) ) − min y f ( x ∗ (cid:15) (cid:48) , y ) = max x f ( x, y ∗ (cid:15) (cid:48) ) − f ( x ∗ , y ∗ ) + f ( x ∗ , y ∗ ) − min y f ( x ∗ (cid:15) (cid:48) , y ) ≤ (cid:15) (cid:48) . Because y ∗ is the best response to x ∗ , we have f ( x ∗ , y ∗ (cid:15) (cid:48) ) ≥ f ( x ∗ , y ∗ ) , which implies max x f ( x, y ∗ (cid:15) (cid:48) ) − f ( x ∗ , y ∗ ) = max x f ( x, y ∗ (cid:15) (cid:48) ) − f ( x ∗ , y ∗ (cid:15) (cid:48) ) + f ( x ∗ , y ∗ (cid:15) (cid:48) ) − f ( x ∗ , y ∗ ) ≥ . Combining yields f ( x ∗ , y ∗ ) − min y f ( x ∗ (cid:15) (cid:48) , y ) ≤ (cid:15) (cid:48) ≤ (cid:15)r ∗ , so since f ( x ∗ , y ∗ ) = r ∗ , rearranging implies min y f ( x ∗ (cid:15) (cid:48) , y ) ≥ r ∗ − (cid:15) (cid:48) ≥ (1 − (cid:15) ) r ∗ . Thus, x ∗ (cid:15) (cid:48) is an (cid:15) -approximate solution for Max-IB. F.2 Proofs from Section 6.2
Proof of Lemma 11. If ( x (cid:48) , y (cid:48) ) is an approximately optimal solution with duality gap (cid:15)/ for (61),by definition max y ∈ ∆ m f (cid:15) (cid:48) ( x (cid:48) , y ) − min x ∈ R n f (cid:15) (cid:48) ( x, y (cid:48) ) ≤ (cid:15) . Therefore, the following sequence of inequalities hold: max y ∈ ∆ m f ( x (cid:48) , y ) − min x ∈ R n f ( x, y (cid:48) ) = (cid:18) max y ∈ ∆ m f ( x (cid:48) , y ) − max y ∈ ∆ m f (cid:15) (cid:48) ( x (cid:48) , y ) (cid:19) + (cid:18) max y ∈ ∆ m f (cid:15) (cid:48) ( x (cid:48) , y ) − min x ∈ R n f (cid:15) (cid:48) ( x, y (cid:48) ) (cid:19) + (cid:18) min x ∈ R n f (cid:15) (cid:48) ( x, y (cid:48) ) − min x ∈ R n f ( x, y (cid:48) ) (cid:19) ( i ) ≤ (cid:18) max y ∈ ∆ m f ( x (cid:48) , y ) − max y ∈ ∆ m f (cid:15) (cid:48) ( x (cid:48) , y ) (cid:19) + (cid:15)
16 + (cid:18) min x ∈ R n f (cid:15) (cid:48) ( x, y (cid:48) ) − min x ∈ R n f ( x, y (cid:48) ) (cid:19) ( ii ) ≤ (cid:15)
32 + (cid:15)
16 + (cid:15)
32 = (cid:15) . In ( i ) , we used the fact that the pair ( x (cid:48) , y (cid:48) ) has good duality gap with respect to f (cid:15) (cid:48) , and in ( ii ) we used that for the first summand, f (cid:15) (cid:48) ( x (cid:48) , · ) approximates f ( x (cid:48) , · ) to an additive (cid:15)/ , and for thethird summand, − (cid:15) (cid:48) (cid:80) i ∈ [ m ] [ y (cid:48) ] i log[ y (cid:48) ] i is bounded by (cid:15)/ , and all other terms cancel. F.3 Proofs from Section 6.3
Proof of Lemma 12.
At optimality for (62), it holds that (cid:40) y ∗ x (cid:48) = β ( Ax ∗ x (cid:48) − b ) x ∗ x (cid:48) = x (cid:48) − β A (cid:62) y ∗ x (cid:48) .
81y substituting y ∗ x (cid:48) and rearranging terms we get (cid:18) I + 1 β A (cid:62) A (cid:19) ( x ∗ x (cid:48) − x ∗ ) = x (cid:48) − x ∗ , which in turn gives (cid:107) x ∗ x (cid:48) − x ∗ (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) I + 1 β A (cid:62) A (cid:19) − ( x (cid:48) − x ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤
11 + µ/β (cid:13)(cid:13) x (cid:48) − x ∗ (cid:13)(cid:13) . For the last inequality we use the fact that (cid:13)(cid:13)(cid:13)(cid:13) I + 1 β A (cid:62) A (cid:13)(cid:13)(cid:13)(cid:13) − = λ min (cid:18) I + 1 β A (cid:62) A (cid:19) − = 11 + µ/β , by the definition of µ and since I and A (cid:62) A commute. Theorem 5.
Given data matrix A ∈ R m × n , vector b ∈ R m , and desired accuracy (cid:15) ∈ (0 , ,assuming A (cid:62) A (cid:23) µI for µ > , Algorithm 5 outputs an expected (cid:15) -accurate solution ˜ x , i.e. E [ (cid:107) ˜ x − x ∗ (cid:107) ] ≤ (cid:15), and runs in time (cid:101) O nnz + √ nnz · max (cid:110)(cid:113)(cid:80) i (cid:107) A i : (cid:107) , (cid:113)(cid:80) j (cid:107) A : j (cid:107) (cid:111) √ µ . Proof.
We first prove correctness. We bound the progress from x ( h ) to x ( h +1) , for some h ∈ [ H ] , by (cid:13)(cid:13)(cid:13) x ( h +1) − x ∗ (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) x ( h +1) − x ∗ x ( h ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13) x ∗ x ( h ) − x ∗ (cid:13)(cid:13) ≤ V z ( h +1) ( z ∗ x ( h ) ) + (cid:13)(cid:13) x ∗ x ( h ) − x ∗ (cid:13)(cid:13) . (105)The first inequality used (cid:107) a + b (cid:107) ≤ (cid:107) a (cid:107) + 2 (cid:107) b (cid:107) , and the second used the definition of thedivergence in the (cid:96) - (cid:96) setup. Next, choosing a sufficiently large value of K = (cid:101) O ( β/µ ) , we useProposition 6 to obtain a point z ( h +1) satisfying V z ( h +1) ( z ∗ x ( h ) ) ≤ (cid:15) V z ( h ) ( z ∗ x ( h ) ) ≤ (cid:15) V z ( h ) ( z ∗ ) + (cid:15) V z ∗ ( z ∗ x ( h ) ) . (106)Further, using Lemma 12 with x (cid:48) = x ( h ) , β = √ µ yields (cid:13)(cid:13) x ∗ x ( h ) − x ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) x ( h ) − x ∗ (cid:13)(cid:13)(cid:13) . (107)Plugging these two bounds into (105), and using the form of the divergence in the (cid:96) - (cid:96) setup, (cid:13)(cid:13)(cid:13) x ( h +1) − x ∗ (cid:13)(cid:13)(cid:13) (106) ≤ (cid:15) V z ( h ) ( z ∗ ) + (cid:15) V z ∗ ( z ∗ x ( h ) ) + (cid:13)(cid:13) x ∗ x ( h ) − x ∗ (cid:13)(cid:13) (107) ≤ (cid:18) (cid:15)
20 + (cid:15)
20 + 12 (cid:19) (cid:13)(cid:13)(cid:13) x ( h ) − x ∗ (cid:13)(cid:13)(cid:13) + (cid:15) (cid:18)(cid:13)(cid:13)(cid:13) y ( h ) − y ∗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13) y ∗ x ( h ) − y ∗ (cid:13)(cid:13) (cid:19) ≤ · (cid:13)(cid:13)(cid:13) x ( h ) − x ∗ (cid:13)(cid:13)(cid:13) + (cid:15) . (108)82 lgorithm 5: Coordinate variance reduced method for linear regression
Input:
Matrix A ∈ R m × n with i th row A i : and j th column A : j , vector b ∈ R m , accuracy (cid:15) Output:
A point ˜ x with (cid:107) ˜ x − x ∗ (cid:107) ≤ (cid:15) L ← max (cid:110)(cid:113)(cid:80) i (cid:107) A i : (cid:107) , (cid:113)(cid:80) j (cid:107) A : j (cid:107) (cid:111) , α ← L/ √ nnz , β = √ µ , η ← α L T ← (cid:108) ηα (cid:109) , K ← (cid:101) O ( α/β ) , H = (cid:101) O (1) , z (0) = ( x (0) , y (0) ) ← ( n , m ) , ( z x , z y ) ← ( n , m ) for h = 1 , , · · · , H do for k = 1 , . . . , K do (cid:46) Relaxed oracle query: ( x , y ) ← ( z x k − , z y k − ) , ( g x , g y ) ← ( A (cid:62) y + β ( x − x ( h − ) , − Ax + βy ) for t = 1 , . . . , T do (cid:46) Gradient estimation: Sample i ∼ p where p i = ([ y t − ] i − [ y ] i ) (cid:107) y t − − y (cid:107) Sample j ∼ q where q j = ([ x t − ] j − [ x ] j ) (cid:107) x t − − x (cid:107) Set ˜ g t − = g + (cid:18) A i : [ y t − ] i − [ y ] i p i , − A : j [ x t − ] j − [ x ] j q j (cid:19) (cid:46) Mirror descent step: x t ←
11 + ηα/ (cid:16) x t − + ηα x − η ˜ g x t − (cid:17) y t ← Π Y (cid:18)
11 + ηα/ (cid:16) y t − + ηα y − η ˜ g y t − (cid:17)(cid:19) (cid:46) Π Y ( v ) = v max { , (cid:107) v (cid:107) } z k − / ← T T (cid:88) t =1 ( x t , y t ) (cid:46) Extragradient step: z x k ← αα + 2 β z x k − + 2 βα + 2 β z x k − / − α + 2 β (cid:16) A (cid:62) z y k − / + β ( z x k − / − x ( h − ) (cid:17) z y k ← Π Y (cid:18) αα + 2 β z y k − + 2 βα + 2 β z y k − / + 1 α + 2 β (cid:16) Az x k − / − βz y k − / (cid:17)(cid:19) (cid:46) Reshifting the oracle: z ( h ) = ( x ( h ) , y ( h ) ) ← z K = ( z x K , z y K ) return ˜ x ← x ( H )
83n the last inequality we use the conditions that (cid:15) ∈ (0 , and Y = B m . Recursively applying thisbound for h ∈ [ H ] , and for a sufficiently large value of H = (cid:101) O (1) , we have the desired (cid:13)(cid:13)(cid:13) x ( H ) − x ∗ (cid:13)(cid:13)(cid:13) ≤ (cid:18) (cid:19) H (cid:13)(cid:13)(cid:13) x (0) − x ∗ (cid:13)(cid:13)(cid:13) + 4 (cid:15) ≤ (cid:15) . To bound the runtime, recall the inner loop runs for T = O (( L , co ) /α ) iterations, each costingconstant time, and the outer loop runs for K = (cid:101) O ( α/β ) iterations, each costing O ( T + nnz ) . Finally,since H = (cid:101) O (1) , the overall complexity of the algorithm is (cid:101) O nnz + (cid:16) L , co (cid:17) α αβ . Choosing α = max { L , co / √ nnz , β } optimally and substituting β = √ µ, L , co = max (cid:115)(cid:88) i (cid:107) A i : (cid:107) , (cid:115)(cid:88) j (cid:107) A : j (cid:107) , we have the desired runtime bound on Algorithm 5. G IterateMaintainer : numerical stability and variations G.1 Numerical stability of
IterateMaintainer . We discuss the implementation of a numerically stable version of
IterateMaintainer , and thecomplexity of its operations, for use in our sublinear algorithms in Section 4.1 and Section C.2. Wediscuss this implementation for a simplex block, e.g. a simplex variable of dimension n , as for an (cid:96) geometry numerical stability is clear. The main modifications we make are as follow.• We reinitialize the data structure whenever the field ν grows larger than some fixed polynomialin n , or if n/ iterations have passed.• We track the coordinates modified between restarts.• Every time we reinitialize, we maintain the invariant that the multiplicative range of coordi-nates of x is bounded by a polynomial in n , i.e. max j x j / min j x j is bounded by some fixedpolynomial in n . We will implement this via an explicit truncation, and argue that such anoperation gives negligible additive error compared to the accuracy of the algorithm.• We implicitly track the set of truncated coordinates at each data structure restart. We do soby explicitly tracking the set of non-truncated coordinates whenever a truncation operationhappens (see the discussion below), in constant amortized time.We now discuss the complexity and implementation of these restarts. First, note that ν can neverdecrease by more than a multiplicative polynomial in n between restarts, because of nonnegativityof the exponential, the fact that the original range at the time of the last restart is multiplicativelybounded, and we restart every time half the coordinates have been touched. Thus, the only sourceof numerical instability comes from when ν grows by more than a multiplicative polynomial in n .Suppose this happens in τ iterations after the restart. Then,84 If τ < n/ , we claim we can implement the restart in O ( τ ) , so the amortized cost per iterationis O (1) . To see this, for every coordinate touched in these τ iterations, we either keep orexplicitly truncate if the coordinate is too small. For every coordinate not touched in these τ iterations, the relative contribution is at most inverse polynomial in n ; we truncate all suchcoordinates. Then, we compute the normalization constant according to all non-truncatedcoordinates, such that the value of all truncated coordinates is set to a fixed inverse polynomialin n . We can implement this by implicitly keeping track of the set of truncated coordinatesas well as their contribution to the normalization factor, and explicitly setting their value inthe data structure when they are updated by AddSparse . Overall, this does not affect thevalue of the problem by more than a small multiple of (cid:15) , by our assumptions on L rc /(cid:15) . To seethat we can track the non-truncated coordinates explicitly, we note that it is a subset of theat most τ coordinates that were touched, so this can be done in constant amortized time.• If τ = n/ , we claim we can implement the restart in O ( n ) , so the amortized cost per iterationis O (1) . This is clear: we can do so by explicitly recomputing all coordinates, and truncatingany coordinates which have become too small.We describe how the data structure implements this through its maintained fields: for non-truncatedcoordinates, we do not do anything other than change the scaling factor ν , and for truncatedcoordinates, we reset the values of u, u (cid:48) in that coordinate appropriately once they have beensparsely updated. Overall, this does not affect the amortized runtime of our algorithm. G.2
WeightedIterateMaintainer In this section, we give implementation details for a weighted generalization of
IterateMaintainer ,which we will call WeightedIterateMaintainer . It is used in Section C.2, when using the sam-pling distribution (87). At initialization, WeightedIterateMaintainer is passed an additionalparameter w ∈ R n ≥ , a nonnegative weight vector. We let (cid:104) u, v (cid:105) w := (cid:88) j ∈ [ n ] [ w ] j [ u ] j [ v ] j , (cid:107) v (cid:107) w := (cid:113) (cid:104) v, v (cid:105) w . WeightedIterateMaintainer supports all the same operations as IterateMaintainer , with twodifferences:• For the current iterate x , WeightedIterateMaintainer . Norm () returns weighted norm (cid:107) x (cid:107) w .• For the current iterate x , WeightedIterateMaintainer . Sample () returns a coordinate j withprobability proportional to [ w ] j [ x ] j .Similarly to IterateMaintainer , WeightedIterateMaintainer maintains the following fields.• Scalars ξ u , ξ v , σ u , σ v , ι , ν • Vectors u, u (cid:48) , v, w • Precomputed value (cid:107) v (cid:107) w .We maintain the following invariants on the data structure fields at the end of every operation:• x = ξ u u + ξ v v , the internal representation of x s = v + σ u u + σ v v , the internal representation of running sum s • ι = (cid:104) x, v (cid:105) w , the weighted inner product of the iterate with fixed vector v • ν = (cid:107) x (cid:107) w , the weighted norm of the iterateTo support sampling, our data structure also maintains a binary tree dist x of depth O (log n ) .For the node corresponding to S ⊆ [ n ] (where S may be a singleton), we maintain• (cid:80) j ∈ S [ w ] j [ u ] j , (cid:80) j ∈ S [ w ] j [ u ] j [ v ] j , (cid:80) j ∈ S [ w ] j [ v ] j We now give the implementation of the necessary operations for
WeightedIterateMaintainer ,giving additional proofs of correctness when applicable. Initialization. • Init ( x , v, w ) . Runs in time O ( n ) .1. ( ξ u , ξ v , u ) ← (1 , , x ) .2. ( σ u , σ v , u (cid:48) ) ← (0 , , n ) .3. ( ι, ν ) ← ( (cid:104) x , v (cid:105) w , (cid:107) x (cid:107) w ) .4. Compute and store (cid:107) v (cid:107) w .5. Initialize dist x , storing the relevant sums in each internal node. Updates.
Scale ( c ) and UpdateSum () follow identically to the analysis of IterateMaintainer .• AddSparse ( j, c ) : [ x ] j ← [ x ] j + c . Runs in time O (log n ) .1. u ← u + cξ u e j .2. u (cid:48) ← u (cid:48) − cσ u ξ u e j .3. ν ← (cid:112) ν + 2 c [ w ] j [ ξ u u + ξ v v ] j + c [ w ] j .4. ι ← ι + c [ w ] j [ v ] j .5. For internal nodes of dist x on the path from leaf j to the root, update (cid:80) j ∈ S [ w ] j [ u ] j , (cid:80) j ∈ S [ w ] j [ u ] j [ v ] j appropriately.• AddDense ( c ) : x ← x + cv . Runs in time O (1) .1. ξ v ← ξ v + c .2. ν ← (cid:113) ν + 2 cι + c (cid:107) v (cid:107) w .3. ι ← ι + c (cid:107) v (cid:107) w .We demonstrate that the necessary invariants on ι, ν are preserved. Regarding correctness of AddSparse , the updates to u and u (cid:48) are identical to in the analysis of IterateMaintainer . Next,because only [ x ] j changes, the updates to ν, ι are correct respectively by [ w ] j · [ ξ u u + ξ v v + c ] j = [ w ] j · (cid:0) [ ξ u u + ξ v v ] j + 2 c [ ξ u u + ξ v v ] j + c (cid:1) , [ w ] j · ([ ξ u u + ξ v v + c ] j ) · [ v ] j = [ w ] j · ([ ξ u u + ξ v v ] j · [ v ] j + c [ v ] j ) . Regarding correctness of
AddDense , (cid:107) x + cv (cid:107) w = ν + 2 cι + c (cid:107) v (cid:107) w , (cid:104) x + cv, v (cid:105) w = ι + c (cid:107) v (cid:107) w . Here, we used that the invariants ν = (cid:107) x (cid:107) w and ι = (cid:104) x, v (cid:105) w held.86 ueries. Get ( j ) and GetSum ( j ) follow identically to the analysis of IterateMaintainer .• Norm () : Return (cid:107) x (cid:107) w . Runs in time O (1) .1. Return ν . Sampling.
To support
Sample , we must produce a coordinate j with probability proportional to [ w ] j [ x ] j . To do so, we recursively perform the following procedure, where the recursion depth is atmost O (log n ) , starting at the root node and setting S = [ n ] : the proof of correctness is identicalto the proof in the analysis of IterateMaintainer . Sample () .1. Let S , S be the subsets of coordinates corresponding to the children of the current node.2. Using scalars ξ u , ξ v , and the maintained (cid:80) j ∈ S i [ w ] j [ u ] j , (cid:80) j ∈ S i [ w ] j [ u ] j [ v ] j , (cid:80) j ∈ S i [ w ] j [ v ] j , com-pute (cid:80) j ∈ S i [ w ] j [ x ] j = (cid:80) j ∈ S i [ w ] j [ ξ u u + ξ v v ] j for i ∈ { , } .3. Sample a child i ∈ { , } of the current node proportional to (cid:80) j ∈ S i [ w ] j [ x ] j by flipping anappropriately biased coin. Set S ← S i . G.3
CenteredIterateMaintainer In this section, we give implementation details for a generalization of
WeightedIterateMaintainer ,which we call CenteredIterateMaintainer . It is used in Section D.3, when using the samplingdistributions (98) and (99). At initialization, CenteredIterateMaintainer is passed an addi-tional parameter x ∈ R n , a reference point. CenteredIterateMaintainer supports all the sameoperations as WeightedIterateMaintainer , with two differences:• For the current iterate x , CenteredIterateMaintainer . Sample () returns a coordinate j withprobability proportional to [ w ] j [ x − x ] j .• CenteredIterateMaintainer supports querying (cid:107) x − x (cid:107) w in constant time.Because all the other operations, fields, and invariants supported and maintained by the datastructure are exactly the same as IterateMaintainer , we only discuss the changes made to thebinary tree dist x in this section for brevity. In particular, to support sampling, our data structurealso maintains a binary tree dist x of depth O (log n ) . For the node corresponding to S ⊆ [ n ] (where S may be a singleton), we maintain• (cid:80) j ∈ S [ w ] j [ u ] j , (cid:80) j ∈ S [ w ] j [ u ] j [ v ] j , (cid:80) j ∈ S [ w ] j [ v ] j • (cid:80) j ∈ S [ w ] j [ x ] j , (cid:80) j ∈ S [ w ] j [ u ] j [ x ] j , (cid:80) j ∈ S [ w ] j [ v ] j [ x ] j At initialization,
CenteredIterateMaintainer creates this data structure and stores the relevantsums in each internal node. Upon modifications to u due to updates of the form AddSparse ( j, c ) , CenteredIterateMaintainer propagates the changes along internal nodes of dist x on the pathfrom leaf j to the root. Thus, using these maintained values and the stored values ξ u , ξ v , it is clearthat for any appropriate subset S , we are able to compute the quantity (cid:88) j ∈ S [ w ] j [ ξ u + ξ v v − x ] j = (cid:88) j ∈ S [ w ] j (cid:0) ξ u [ u ] j + ξ v [ v ] j + 2 ξ u ξ v [ u ] j [ v ] j + [ x ] j + 2 ξ u [ u ] j [ x ] j + 2 ξ v [ v ] j [ x ] j (cid:1) in constant time, admitting the sampling oracle in time O (log n ) by propagating down the tree main-tained by dist x . This proves the desired sampling complexity. Finally, by appropriately queryingthe stored values in the root node, we can return (cid:107) x − x (cid:107) ww