[PDF] Infeasibility detection with primal-dual hybrid gradient for large-scale linear programming

Abstract

Full PDF

IInfeasibility detection with primal-dual hybrid gradient forlarge-scale linear programming

David Applegate ∗ Mateo Díaz † Haihao Lu ‡ Miles Lubin § Abstract

We study the problem of detecting infeasibility of large-scale linear programming problems usingthe primal-dual hybrid gradient method (PDHG) of Chambolle and Pock (2011). The literature onPDHG has mostly focused on settings where the problem at hand is assumed to be feasible. Whenthe problem is not feasible, the iterates of the algorithm do not converge. In this scenario, we showthat the iterates diverge at a controlled rate towards a well-deﬁned ray. The direction of this ray isknown as the inﬁmal displacement vector v . The ﬁrst contribution of our work is to prove that thisvector recovers certiﬁcates of primal and dual infeasibility whenever they exist. Based on this fact, wepropose a simple way to extract approximate infeasibility certiﬁcates from the iterates of PDHG. Westudy three diﬀerent sequences that converge to the inﬁmal displacement vector: the diﬀerence of iterates,the normalized iterates, and the normalized average. All of them are easy to compute, and thus theapproach is suitable for large-scale problems. Our second contribution is to establish tight convergencerates for these sequences. We demonstrate that the normalized iterates and the normalized averageachieve a convergence rate of O (cid:0) k (cid:1) , improving over the known rate of O (cid:16) √ k (cid:17) . This rate is general andapplies to any ﬁxed-point iteration of a nonexpansive operator. Thus, it is a result of independent interestsince it covers a broad family of algorithms, including, for example, ADMM, and can be applied settingsbeyond linear programming, such as quadratic and semideﬁnite programming. Further, in the case oflinear programming we show that, under nondegeneracy assumptions, the iterates of PDHG identify theactive set of an auxiliary feasible problem in ﬁnite time, which ensures that the diﬀerence of iteratesexhibits eventual linear convergence to the inﬁmal displacement vector. First-order methods (FOMs) have been extensively studied by the optimization community since the late2000s, following a long period where interior-point methods dominated research in continuous optimization.FOMs, which use only gradient information, are appealing for their simplicity and low computational overhead,in particular when solving large-scale optimization problems that arise in machine learning and data science.These methods have matured in many aspects (see, e.g., the recent textbook by Beck [10]) and are knownto be useful for obtaining moderately accurate solutions to convex and non-convex optimization problemsin a reasonable amount of time. Despite this progress, FOMs have made only modest inroads into linearprogramming (LP), a fundamental problem in mathematical optimization.FOMs applied to LP provide relatively simple methods whose most expensive operations are matrix-vectormultiplications with the (typically sparse) constraint matrix. Such matrix-vector products are amenable toscale eﬃciently given increasingly ubiquitous computing resources like multi-threaded CPUs, GPUs [63], ordistributed clusters [24]. In contrast, interior-point and simplex-based methods that dominate current practiceare limited in how they use available computing resources because they depend on matrix inversion. To mark ∗ Google Research ( [email protected] ) † Cornell University Center for Applied Mathematics ( [email protected] ). Work done while interning at Google Research. ‡ University of Chicago Booth School of Business and Google Research ( [email protected] ) § Google Research ( [email protected] ) a r X i v : . [ m a t h . O C ] F e b lgorithm 1: Primal-dual hybrid gradient

Data: x ∈ R d Step k : ( k ≥ )Update x k +1 ← prox ηf ( x k − ηA (cid:62) y k ) ,Update y k +1 ← prox τh (cid:0) y k + τ A (2 x k +1 − x k ) (cid:1) .this distinction, Nesterov [50] deﬁnes methods that use at most matrix-vector products as capable of handling large-scale problems and methods that use matrix inversion as handling medium-scale problems. In thecontext of LP, these deﬁnitions of scale perhaps belie the reliability and practical eﬃciency of interior-pointand simplex methods, but nevertheless the contrast in the computing requirements of the algorithms isan important one. Even though such computational aspects are outside the scope of this paper, it is thispractical potential to eﬃciently solve large-scale LP that motivates the theoretical developments in this work.While FOMs are typically studied in more general settings, the underlying assumptions and convergencerates in these settings do not necessarily hold or may not be tight for the special case of LP. Of particularrelevance to this work, theory for FOMs is often developed under the assumption that an optimal solutionexists, whereas LP solvers need to be able to detect infeasibility (i.e., when no optimal solution exists) andcompute corresponding certiﬁcates. Infeasibility detection and computation of certiﬁcates are an essentialaspect of solving LP, not only to provide feedback on modeling errors but also for algorithms that directlyexploit LP certiﬁcates like Benders decomposition and branch-and-cut [1].This work addresses the question of how to detect infeasibility in LP using the Primal-Dual HybridGradient method (PDHG). PDHG is a popular ﬁrst-order method introduced by Chambolle and Pock [15] tosolve convex-concave minimax problems , that is, problems of the form min x ∈ R n max y ∈ R m (cid:104) Ax, y (cid:105) + g ( x ) − h ( y ) (1)where g : R n → R ∪ {∞} and h : R m → R ∪ {∞} are proper lower semicontinuous convex functions and A ∈ R m × n . LP can be recast as a minimax problem through duality, and hence PDHG is applicable. Themethod consists of alternating updates between the primal and dual variables, see Algorithm 1. In particular,when instantiated for LP, these updates correspond to matrix-vector products and projections onto simplesets (such as the positive orthant). In contrast with other methods like the Alternating Direction Methodof Multipliers (ADMM), PDHG does not require projections onto linear subspaces, which involve matrixinversions by direct or indirect methods.The behavior of PDHG for feasible problems (i.e., problems that have an optimal solution) has beenstudied in depth under several regularity assumptions. In their seminal work, Chambolle and Pock [15]show that the algorithm converges at a rate of O (1 /k ) given appropriate choices for the step sizes η and τ .However, the situation for infeasible problems remains largely unstudied.While it is relatively straightforward to formulate always-feasible auxiliary problems that can be usedto detect infeasibility, for example, by penalizing violations of primal and dual constraints, this approach isunappealing for two reasons: First is the aesthetic interest of having a single algorithm that robustly handlesall possible input [32]. Second is the practical interest in eﬀectively using available computing resources, assolving such auxiliary problems would approximately double the necessary work. Instead, we aim to use one execution of PDHG and ask the following question: Do the PDHG iterates encode information about infeasibility?

We answer this question in the aﬃrmative. We show that if the primal (and/or dual) problem is infeasible, theiterates of PDHG recover primal (and/or dual) infeasibility certiﬁcates. Moreover, we completely characterizethe behavior of the iterates under diﬀerent infeasibility settings. Before diving into our main contributions,let us present an illustrative example. Recall that for a primal-dual LP pair, there exist three exhaustiveand mutually exclusive possibilities: (1) both primal and dual are feasible, (2) both primal and dual are2 a) Both feasible (b) Both infeasible (c) Unbounded dual (d) Unbounded primal

Figure 1: Four diﬀerent settings depicted in Example 1. Every subplot shows the component-wise value ofthe iterates against the iteration count. The ﬁrst and the second rows correspond to the primal and dualiterates, respectively.infeasible, and (3) one of the two problems is unbounded, and consequently, the other problem is infeasible.Small numerical experiments reveal that the behavior of PDHG is diﬀerent depending on the setting.

Example 1.

Consider the LP problem with constants α, β ∈ R :minimize x + x − αx subject to x + 2 x ≤ x + x ≤ x + x ≥ β . Figure 1 displays three choices of α and β Both feasible . Set α = 0 and β = 1 , then both primal and dual problems are feasible. In this case,both the primal and dual variables converge to a solution.2. Both infeasible . Set α = 1 and β = 2 , then both primal and dual are infeasible. We observe that bothprimal and dual iterates diverge at a rate proportional to the number of iterations.3. Unbounded dual . Set α = 0 and β = 2 , then the dual problem is unbounded and, thus, the primalproblem is infeasible and the dual is feasible. Then the dual iterates diverge, and, interestingly, theprimal iterates converge.4. Unbounded primal . Set α = 1 and β = 1 , then the primal problem is unbounded and, thus, the dualproblem is infeasible and the primal is feasible. Then the primal iterates diverge, and the dual iteratesconverge. From the experiments, we see that the iterates have a very stable asymptotic behavior. In particular,if the primal is feasible, then the dual variables converge, and analogously if the dual is feasible, then theprimal iterates converge. Similarly, whenever the primal is infeasible, the dual iterates diverge at a controlledlinear rate and vice-versa. Such behavior has not been previously observed or characterized in the literature.3 .1 Main contributions

For notational convenience, we use z = ( x, y ) as the primal-dual pair, and ¯ z k := k (cid:80) kj =1 z j as the average ofiterates. We propose to detect infeasibility using three sequences: (Diﬀerence of iterates) d k = ( z k +1 − z k ) , (2a) (Normalized iterates) z k k , (2b) (Normalized average iterates) k + 1 ¯ z k . (2c)Our proposal to detect infeasibility is as follows:Use these three sequences’ primal and dual components as candidates for dual and primalinfeasibility certiﬁcates. The algorithm should periodically check if any of these iterates satisfythe conditions that deﬁne an infeasibility certiﬁcate within numerical tolerances. If at any pointthis happens, it should conclude that the problem is (primal or dual) infeasible.The overhead cost of extracting the certiﬁcates is negligible, making it suitable for large-scale problems. Mostof the content of this work justiﬁes this strategy theoretically.Operator theory shows that all three of these sequences converge to a point v known as the inﬁmaldisplacement vector . Section 2 will give a formal deﬁnition of this and other relevant concepts. We list ourcontributions assuming, for now, the existence of such a vector v . Sublinear convergence rate to the inﬁmal displacement vector (Theorem 3). It is natural towonder how fast the three sequences (2) converge to the inﬁmal displacement vector. We study this questionthrough the lenses of operators and ﬁxed-point iteration. To the best of our knowledge, the only knownresult in this vein ensures a rate of O (cid:16) √ k (cid:17) for the diﬀerence of iterates d k [39], which is known to be tight[19]. In this same situation, we show that both the normalized iterates and the normalized average iteratesconverge at a faster rate of O (cid:0) k (cid:1) . This result generalizes to any ﬁxed-point iteration of a nonexpansiveoperator , not only the ﬁrmly nonexpansive operators studied in [39]. Speciﬁcally, it also applies to manypopular ﬁrst-order methods, including but not limited to PDHG, ADMM [56], and Mirror-prox [49], and itextends to other settings beyond LP such as quadratic convex programming and semideﬁnite programming.Furthermore, we show that this result is tight for PDHG; i.e., there exist instances with a convergence ratelower bounded by Ω (cid:0) k (cid:1) . This result suggests that current ADMM-based codes like OSQP [61] that useexclusively the diﬀerence of iterates to detect infeasibility should additionally consider the normalized iteratesand normalized average iterates.

Characterization of the iterates for infeasible problems (Theorem 4). We characterize the behaviorof PDHG for all the LP feasibility scenarios (see Table 1). In particular, we show that if the primal (or dual)iterates diverge, then the iterates diverge in the direction of a ray, where the direction of the ray recoverscertiﬁcates of dual (or primal) infeasibility. Such direction turns out to be the inﬁmal displacement vector ( v x , v y ) . This justiﬁes using the sequences (2) as infeasibility certiﬁcate candidates. Furthermore, we showthat when the primal problem is feasible, then the dual iterates, without any normalization, converge to some y (cid:63) that is closely related to v . An analogous result holds if the dual is feasible. This describes the dynamicsof PDHG for unbounded problems. The next table summarizes our ﬁndings for the four possible cases. Primal Dual Feasible InfeasibleFeasible x k , y k both converge x k diverges, y k converges Infeasible x k converges, y k diverges x k , y k both divergeTable 1: Bahavior of PDHG for solving under diﬀerent feasibility assumptions. Eventual linear convergence for nondegenerate problems (Theorem 6). In the process of char-acterizing the dynamics of PDHG, we show that the iterates ( x k , y k ) always converge to a unique ray4 ( x (cid:63) , y (cid:63) ) + λv | λ ∈ R } . We show that if the limit point ( x (cid:63) , y (cid:63) ) satisﬁes a strict complementarity condition,for a related auxiliary problem, then the iterates ( x k , y k ) ﬁx their active set after ﬁnitely many iterations. Inturn, this leads to the eventual linear convergence of the diﬀerence of iterates (2a). Formally, we show thatthere exists K ≥ such that for all suﬃciently large k ≥ K we have (cid:107) d k − v (cid:107) ≤ O (cid:0) γ k − K (cid:1) for some γ ∈ (0 , . We further show that even after the active set is ﬁxed, the normalized iterates and normalized average do notexhibit faster convergence. Thus, it is strictly better to use the diﬀerence of iterates to detect infeasibility inthis regime.

Computational experiments (Section 6). We complement our theoretical results with numericalexperiments displaying the eﬃcacy of the diﬀerent certiﬁcate candidates (2). Speciﬁcally, our experimentsshow that using all three sequences in (2) is beneﬁcial. On the one hand, if the active set’s ﬁnite timeidentiﬁcation occurs relatively quickly, then the diﬀerences of iterates (2a) exhibit faster convergence. Onthe other hand, for some problems, identifying the active set might not happen in a reasonable amount oftime. In this case, both the normalized iterates (2b) recovers approximate infeasibility certiﬁcates much moreeﬃciently than the diﬀerences.

Primal-dual hybrid gradient.

Chambolle and Pock [16] review PDHG among other methods and describeits applications in computer vision. O’Connor and Vandenberghe [51] show that PDHG is in fact a particularapplication of Douglas-Rachford Splitting (DRS) [20, 30, 27, 38].

FOMs for LP.

Lan et al. [33] and Renegar [58] develop FOMs for LP, considered as a special case ofsemideﬁnite programming, with O (cid:0) k (cid:1) convergence rates. Gilpin et al. [29] obtain a restarted FOM for LPwith a linear convergence rate. These analyses assume an optimal solution exists. Pock and Chambolle [55]apply PDHG with diagonal preconditioning to LP on a small number of test instances. They note that onsmall-scale problems, interior-point methods clearly dominate, while their method outperforms MATLAB’sLP solver on one larger LP motivated by a computer vision application. Most recently, Basu et al. [5] applyaccelerated gradient descent to smoothed LPs, obtaining solutions to industrial problems with up to variables. Infeasibility detection.

Classically, the primal simplex method for LP detects primal infeasibility whilesolving a “phase-one” auxiliary problem for an initial feasible basis and detects dual infeasibility based onconditions when computing a step size (i.e., the ratio test) [42]. Infeasibility certiﬁcates are extracted fromthe iterates of interior-point methods without substantial extra work [62]. Infeasibility detection is only theﬁrst step of diagnosing the cause of the infeasibility in an LP model [17].Most research on infeasibility detection capabilities for FOMs for convex optimization has focused onADMM or equivalently Douglas-Rachford Splitting. Eckstein and Bertsekas [23] show that when no solutionexists, then the iterates diverge. Recent practical successes motivated further research in this direction,characterizing the asymptotic behavior of the iterates under additional assumptions. For example, the lineof work [7, 8, 48] studies Douglas-Rachford applied to problems that look for a point at the intersectionof two non-intersecting convex sets. On the other hand, Raghunathan and Di Cairano [57] investigate theasymptotic dynamics of ADMM for convex quadratic problems when the matrices involved are full rank.Banjak et al. [4] show that the inﬁmal displacement vector of ADMM recovers certiﬁcates of infeasibilityfor convex quadratic problems with conic constraints. Based on this, they proposed to use the diﬀerenceof iterates to test infeasible. Complementing this work, [39] establishes a O (cid:16) √ k (cid:17) convergence rate for thediﬀerence of iterates of any algorithm that induces a ﬁrmly nonexpansive operator and introduced a schemethat utilizes multiple runs of ADMM to detect infeasibility. This type of scheme aims to handle pathologicalscenarios that do not occur in LP.O’Donoghue et al. [53] propose to apply ADMM to a homogeneous self-dual embedding of a convexconic problem . A nice byproduct of this approach is that it automatically produces infeasibility certiﬁcates. Linear objective with conic constraints.

Finite time identiﬁability.

Finite time identiﬁability has a long history in the ﬁeld of optimization.This phenomenon is ﬁrst documented for the projected gradient descent method [22, 14, 13, 12]. Soonafter it is studied for other methods, such as the Proximal Point Method [25] and Projected subgradientdescent [26], among others [2]. Identiﬁability is also exploited as tool for algorithmic design for the socalled “ UV -algorithms” [46]. Recent works [47, 36, 37] study ﬁnite time identiﬁcation for popular FOMs. Inparticular, Liang et al. [37] show that the iterates of PDHG identify the active constraints in ﬁnite time,provided the limit point is nondegenerate. All of these works assume the underlying problem is feasible. Thesigniﬁcant number of algorithms exhibiting this behavior motivated researchers to develop general theory(even beyond the realms of optimization) [64, 44, 45, 34, 21, 35]. We refer the interested reader to [35] for anelegant geometrical deﬁnition that generalizes most notions of nondegeneracy. The paper is structured as follows. Section 2 presents all the necessary background. In Section 3, we show aconvergence rate of O (1 /k ) for the normalized iterates and normalized average generated by the ﬁxed-pointiteration of a nonexpansive operator. Then, Section 4 shows a complete characterization of the behavior ofPDHG under diﬀerent infeasibility assumptions. In Section 5, we study a condition that ensures ﬁnite timeidentiﬁability of the active set. We show that under this condition, the diﬀerence of iterates exhibits eventuallinear convergence. We present numerical experiments that complement the theoretical results in Section 6.We close the paper with some conclusions and open questions in Section 7. We use the symbols N and R to denote the natural numbers and reals. Our results take place in a ﬁnitedimensional spaces R d . We denote the cone of nonnegative vectors as R d + . We endow R d with the standarddot product (cid:104) x, y (cid:105) = x (cid:62) y and norm (cid:107) x (cid:107) := (cid:104) x, x (cid:105) . We use I to denote both the identity map and the identitymatrix in R d × d . Given a matrix M ∈ R d × d we use λ ( M ) , . . . , λ d ( M ) to denote its eigenvalues. We deﬁnethe spectral radius of a matrix M as ρ ( M ) = max j ∈ [ d ] | λ j ( M ) | . Similarly, for a matrix M ∈ R m × d , with d ≤ m , we use σ ( M ) ≥ · · · ≥ σ d ( M ) for its singular values and we let σ max ( M ) and σ min ( M ) denote themaximum and minimum nonzero singular values. We deﬁne the operator norm as (cid:107) M (cid:107) = σ max ( M ) . Forany matrix M we denote its pseudo-inverse as M † . A symmetric matrix M ∈ R d × d is positive deﬁnite ifits minimum eigenvalue is positive, min j ∈ [ d ] λ j ( M ) > . Every positive deﬁnite matrix M deﬁnes an innerproduct and norm given by (cid:104) x, y (cid:105) M = x (cid:62) M y and (cid:107) x (cid:107) M = (cid:104) x, x (cid:105) M , respectively. Frequently, we will makeuse of the symbol (cid:107) · (cid:107) to denote an arbitrary matrix norm in R n . Often, (cid:107) · (cid:107) will denote a norm with respectto which an operator is (ﬁrmly) nonexpansive; see next section for a formal deﬁnition.Let C ⊆ R d be an arbitrary set. The symbol cl ( C ) denotes the closure of C . We use proj C ( · ) todenote the Euclidean projection mapping onto C ; this map is well-deﬁned provided C is closed and convex.We use ( z ) + as a shorthand for proj R d + ( z ) . We deﬁne the indicator function of C as ι C : R d → R ∪ {∞} given by ι C ( z ) = 0 if z ∈ C and ι C ( z ) = ∞ otherwise. Given a map T : R d → R d , its range is deﬁned as range( T ) = { T ( z ) | z ∈ R d } . Given two mappings T , T : R d → R d , we denote their composition as T ◦ T ,that is T ◦ T ( z ) = T ( T ( z )) . Since we study primal and dual problems, we use z = ( x, y ) ∈ R n + m as aplaceholder for primal and dual variables. We will sometimes refer to a vector v ∈ R n + m and use v x and v y to denote its primal and dual components. We use superscripts to denote iteration counts, consequently z k isthe k th iterate. We use supp( x ) := { i ∈ [ d ] | x i (cid:54) = 0 } to denote the support of the vector x .6 Preliminaries

Convex analysis [59, 11, 6]. For any closed convex function f : R d → R ∪ {∞} , we deﬁne the subdiﬀerentialset of f at ¯ x ∈ R m , denoted by ∂f (¯ x ) , as the set of all g ∈ R d such that f ( x ) ≥ f (¯ x ) + (cid:104) g, x − ¯ x (cid:105) for all x ∈ R d . This is a generalization of the gradient mapping for functions that lack diﬀerentiability. The subdiﬀerentialset characterizes the set of global minima. In fact, it is easy to show that x (cid:63) is a minimizer of f if, and onlyif, ∈ ∂f ( x (cid:63) ) . For any closed convex set C ⊆ R d , we deﬁne its normal cone at ¯ x ∈ C to be N C (¯ x ) := { g ∈ R d | (cid:104) g, x − ¯ x (cid:105) ≤ for all x ∈ C } . Analogously, N C (¯ x ) can be seen as the set of all points x such that proj C ( x ) = ¯ x . The convex subgradient isknown to follow a nice of set of calculus rules. In this work we will make use of two of them: let f : R d → R and f : R d → R ∪ {∞} be closed convex functions and C be a closed convex set, then ∂ ( f + f )( x ) = ∂f ( x ) + ∂f ( x ) and ∂ι C ( x ) = N C ( x ) for all x ∈ R d . (3)Given any convex function f , we deﬁne its proximal operator prox f : R d → R d as the unique solution of prox f (¯ x ) := argmin x f ( x ) + 12 (cid:107) x − ¯ x (cid:107) . (4)Using optimality conditions one can prove that x ∈ ( I + ∂f )( x + ) ⇐⇒ x + = prox f ( x ) . (5) Linear programming [18, 43]. LP problems can be parameterized using multiple equivalent forms. Forour theoretical results we focus on the standard form of LP:minimize c (cid:62) x subject to Ax = b ( ∈ R m ) x ≥ ∈ R n ) , ( P )where A ∈ R m × n , b ∈ R m , and c ∈ R n are given. The dual of this problem is given bymaximize b (cid:62) x subject to A (cid:62) y ≤ c ( ∈ R n ) . ( D )Although the proofs in this paper are tailored to this form, the techniques we use should extend easily to anyother form.Farkas’ Lemma states that a feasible solution of ( P ) exists if, and only if, the following set is empty { y ∈ R m | b (cid:62) y < and A (cid:62) y ≥ } . (6)We call the elements of this set certiﬁcates of primal infeasibility , as their existence guarantees that the primalproblem is infeasible. Analogously, the certiﬁcates of infeasibility for the dual problem ( D ) are { x ∈ R n | c (cid:62) x < , Ax = 0 and x ≥ } . (7) Primal-dual hybrid gradient . Chambolle and Pock [15] establish convergence to a saddle point ata rate of O (1 /k ) provided that a saddle exists and ητ (cid:107) A (cid:107) < . The primal-dual problems ( P )-( D ) canbe recast as a convex-concave saddle point problem . In particular we choose g ( x ) = c (cid:62) x + ι { x ≥ } ( x ) and More precisely, Chambolle and Pock proved this rate for the primal-dual gap of the averaged iterates. ( y ) = b (cid:62) y . In this case the proximal updates can be computed in closed form. In fact, a PDHG updatereduces to x + = proj R n + ( x − ηA (cid:62) y − ηc ) y + = y + τ A (2 x + − x ) − τ b . (8)Observe that the most complex operations in the update formula are matrix-vector products, and all otheroperations are separable by component.An update of PDHG (Algorithm 1) can be equivalently deﬁned with a diﬀerential inclusion of the form M (cid:20) x k − x k +1 y k − y k +1 (cid:21) ∈ (cid:20) ∂g ( x k +1 ) ∂h ( y k +1 ) (cid:21) + (cid:20) A (cid:62) y k +1 − Ax k +1 (cid:21) with M := (cid:20) η I n − A T − A τ I m (cid:21) , (9)this follows from (5). We will later leverage this inclusion in our proofs. Operators and the ﬁxed-point iteration.

We will ﬁnd it useful to think of iterative algorithms froman operator viewpoint. Given an arbitrary map T : R d → R d , the corresponding ﬁxed-point iteration isdeﬁned as z k +1 = T ( z k ) . (10)Most ﬁrst-order methods can be described in this form. The primal-dual hybrid gradient method can beencoded as T ( x, y ) = ( x + , y + ) where the output pair is deﬁned in (8). When looking at an algorithm fromthis perspective, we transform the problem of ﬁnding a solution of the optimization problem to that of ﬁndinga ﬁxed-point of the operator, i.e., z (cid:63) = T ( z (cid:63) ) . This idea has proven fruitful for proving optimal convergerates for a variety of algorithms [19].Here we make a minimal assumption that is suﬃcient to analyze PDHG in the infeasible case. An operator T is said to be nonexpansive if it is -Lipschitz continuous with respect to a matrix norm (cid:107) · (cid:107) , meaning thatfor any z , z ∈ R d we have (cid:107) T ( z ) − T ( z ) (cid:107) ≤ (cid:107) z − z (cid:107) . (11)Nonexpansiveness does not ensure the convergence of iterates in the feasible case. Yet a slightly strongercondition does. An operator T is ﬁrmly nonexpansive if it satisﬁes (cid:107) T ( z ) − T ( z ) (cid:107) ≤ (cid:107) z − z (cid:107) − (cid:107) ( T − I )( z ) − ( T − I )( z ) (cid:107) for all z , z ∈ R d . Note that the norm here is not necessarily the Euclidean norm. All the results concerning (ﬁrmly) nonexpan-siveness in this section and the next one are with respect to the norm in which these properties hold. Thefollowing is a beautiful geometrical result proved by Pazy that deﬁnes a pivotal object in our studies.

Lemma 1 (Lemma 4 in [54]) . Let T be a nonexpansive operator, then the set cl (range( T − I )) is convex.Consequently, there exists a unique minimum norm vector in this set: v T := argmin z ∈ cl (range( T − I )) (cid:107) z (cid:107) . (12)This vector is known as the inﬁmal displacement vector . We drop the subscript T and make thecorresponding operator clear from the context. Intuitively, v is the minimum size perturbation we shouldsubtract from T to ensure it has a ﬁxed point. Theorem 1 ([54] and [3]) . Let T be a nonexpansive operator and ( z k ) be a sequence generated by theﬁxed-point iteration (10) . Then, we have lim k →∞ z k k = v . (13) If further T is ﬁrmly nonexpansive, then lim k →∞ z k +1 − z k = v . (14)8hat is the normalized iterate converges to the inﬁmal displacement vector when T is nonexpansive and if T is ﬁrmly nonexpansive the diﬀerence of iterates also converge. One might wonder whether the the strongercondition is necessary. This turns out to be the case.The following proposition shows two things: ﬁrst, (14) is provably stronger than (13) and second, theconvergence of the iterates ensures converges of the normalized average. Recall from the previous sectionthat we use ¯ z k := k (cid:80) kj =1 z j to denote the average.The proof of this proposition is technical so it is deferred to Appendix A.1. Proposition 1.

Let ( z k ) ∞ k =1 ⊆ R d be an arbitrary sequence and let v ∈ R d be a ﬁxed vector. Then thefollowing implications hold:1. Diﬀerence convergence implies normalized iterate convergence . lim k →∞ ( z k +1 − z k ) = v = ⇒ lim k →∞ z k k = v . Normalized iterate convergence implies normalized average convergence . lim k →∞ z k k → v = ⇒ lim k →∞ z k ( k + 1) → v . Moreover, these implications cannot be reversed as there exist simple counterexamples in R . Naturally when concerned with practical algorithms one would like to have convergence rates for (13) and(14). As far as we know, the state-of-the-art result in this vein is due to Liu, Ryu, and Yin [39].

Theorem 2 ([39]) . Let T be a ﬁrmly nonexpansive operator and ( z k ) be a sequence generated by (10) . Then,for any ε > , then there exists a point z ε such that (Average iterate rate) . (cid:13)(cid:13)(cid:13)(cid:13) v − k + 1 (¯ z k − z ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:114) k + 1 (cid:107) z − z ε (cid:107) + ε , (Last iterate rate) . (cid:13)(cid:13)(cid:13)(cid:13) v − k ( z k − z ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:114) k (cid:107) z − z ε (cid:107) + ε , (Diﬀerence rate) . min j ≤ k (cid:107) v − z j +1 − z j (cid:107) ≤ (cid:114) k (cid:107) z − z ε (cid:107) + ε . Remark 1.

In the paper [39], this result is presented only for the diﬀerence of iterates, yet a simplemodiﬁcation of their argument proves the other two results.

The theorem guarantees a rate of convergence that depends on a target accuracy ε . The rate could getworst as ε → . Indeed, z ε could diverge as ε goes to zero, see Example 4 in Appendix B. We will see in thenext section that for LP it is possible to get rates that are independent of the accuracy.Since the algorithm of interest is PDHG, we might wonder whether or not its operator is ﬁrmly nonexpansive.It turns out that it is, but not with respect to the standard Euclidian norm. Proposition 2. If ητ (cid:107) A (cid:107) < , then the operator deﬁned by a PDHG iteration is ﬁrmly nonexpansive withrespect to the norm the (cid:107) · (cid:107) M with M deﬁned as in (9) .Proof. This is a known result [31], yet we include a proof for the interested reader. A Schur complementargument proves that the condition ητ (cid:107) A (cid:107) < ensures that M (cid:31) is positive deﬁnite. Then a directapplication of Proposition 4.2 in [9] proves the result.9 Sublinear convergence of nonexpansive operators

This section presents the O (1 /k ) convergence rate of the normalized iterates and the normalized averagefor nonexpansive operators. This rate applies to a broader class of operators than the previously knownresults (restated in Theorem 2) as it does not require the operator to be ﬁrmly nonexpansive. The resultingrate applies to many popular FOMs for convex optimization, including but not limited to PDHG [15], theAlternating Method of Multipliers (ADMM) or equivalently Douglas-Rachford Splitting (DRS) [56], andMirror-Prox [49].Theorem 3 presents our main result in this section. Theorem 3.

Let T be a nonexpansive operator for some norm (cid:107) · (cid:107) and deﬁne v to be the minimum normelement in cl (range( T − I )) . Then, for any ε > , there exists z ε such that the following two inequalities hold (Average iterate rate). (cid:13)(cid:13)(cid:13)(cid:13) v − k + 1) (cid:0) ¯ z k − z (cid:1)(cid:13)(cid:13)(cid:13)(cid:13) ≤ k + 1 (cid:107) z − z ε (cid:107) + ε . (15) (Last iterate rate). (cid:13)(cid:13)(cid:13)(cid:13) v − k ( z k − z ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ k (cid:107) z − z ε (cid:107) + ε . (16) Furthermore, if range( T − I ) is closed, then there exists a ﬁnite z (cid:63) such that T ( z (cid:63) ) = z (cid:63) + v and for anysuch z (cid:63) and all k : (Average iterate rate). (cid:13)(cid:13)(cid:13)(cid:13) v − k + 1) (cid:0) ¯ z k − z (cid:1)(cid:13)(cid:13)(cid:13)(cid:13) ≤ k + 1 (cid:107) z − z (cid:63) (cid:107) . (17) (Last iterate rate). (cid:13)(cid:13)(cid:13)(cid:13) v − k ( z k − z ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ k (cid:107) z − z (cid:63) (cid:107) . (18) Remark 2.

We comment that when range( T − I ) is not closed, Theorem 3 may not imply a O (1 /k ) sublinearconvergence rate. In fact, as ε → , the vector (cid:107) z ε (cid:107) could grow to inﬁnity, see Example 4 in Appendix B for aone dimensional example. When range( T − I ) is closed, we obtain a O (1 /k ) sublinear rate. Remark 3.

When range( T − I ) is closed, the above result together with the lower bound proved later inTheorem 6 shows that the normalized iterates and normalized average of a nonexpansive operator exhibit a Θ (cid:0) k (cid:1) convergence rate. It is faster than the diﬀerence of iterates, by noticing that the diﬀerence of iteratesconverges at rate Θ (cid:16) √ k (cid:17) (see Theorem 8 of [19]). The next Lemma is used in the proof of Theorem 3.

Lemma 2.

Suppose the assumptions of Theorem 3. Fix ε > , then there exists a point z ε , such that thefollowing two inequalities hold for all k ≥ :1. (cid:107) T k +1 ( z ε ) − T k ( z ε ) − v (cid:107) ≤ ε .2. (cid:107) ( T k ( z ε ) − z ε ) − kv (cid:107) ≤ kε. Furthermore, if range( T − I ) is closed, then there exists a point z (cid:63) such that T ( z (cid:63) ) = z (cid:63) + v . For all such z (cid:63) ,for all k ≥ :3. T k +1 ( z (cid:63) ) − T k ( z (cid:63) ) = v. . T k ( z (cid:63) ) − z (cid:63) = kv .Proof. Without loss of generality we assume that ε ≤ . Notice v ∈ cl (range( T − I )) , thus there exists z ε such that (cid:107) T ( z ε ) − z ε − v (cid:107) ≤ ε max { , (cid:107) v (cid:107) + 1) } . (19)We start by proving the ﬁrst claim. Fix an arbitrary k ≥ . We will make use of two facts in the proof.Since T is a nonnexpansive operator, an application of the triangle inequality yields (cid:107) T k ( z ε ) − T k − ( z ε ) (cid:107) − (cid:107) v (cid:107) ≤ (cid:107) T ( z ε ) − z ε (cid:107) − (cid:107) v (cid:107) ≤ (cid:107) T ( z ε ) − z ε − v (cid:107) ≤ ε (cid:107) v (cid:107) + 1) . (20)Noticing v is the nearest point to zero in W = cl (range( T − I )) with respect to the norm (cid:107)(cid:107) and the set W is convex, it follows from the optimality conditions of this problem that (cid:104) w, v (cid:105) ≥ (cid:107) v (cid:107) for all w ∈ W . (21)Armed with these two facts, we derive for any arbitrary k : (cid:107) T k ( z ε ) − T k − ( z ε ) − v (cid:107) = (cid:107) T k ( z ε ) − T k − ( z ε ) (cid:107) − (cid:104) T k ( z ε ) − T k − ( z ε ) , v (cid:105) + (cid:107) v (cid:107) ≤ (cid:107) T k ( z ε ) − T k − ( z ε ) (cid:107) − (cid:107) v (cid:107) + (cid:107) v (cid:107) = (cid:0) (cid:107) T k ( z ε ) − T k − ( z ε ) (cid:107) + (cid:107) v (cid:107) (cid:1) (cid:0) (cid:107) T k ( z ε ) − T k − ( z ε ) (cid:107) − (cid:107) v (cid:107) (cid:1) ≤ ( (cid:107) T ( z ε ) − z ε − v (cid:107) + 2 (cid:107) v (cid:107) ) ε (cid:107) v (cid:107) + 1) ≤ (cid:0) ε + 2 (cid:107) v (cid:107) (cid:1) ε (cid:107) v (cid:107) + 1) ≤ ε , where the ﬁrst inequality utilizes (21) by noticing T k ( z ε ) − T k − ( z ε ) ∈ W , the second inequality uses (20)and the triangle inequality, the third inequality is from (19), and the last inequality uses ε ≤ . This provesthe ﬁrst statement.The second claim follows by induction. The base case k = 0 holds directly. For the inductive step, assumethat the statement holds for k − . Then (cid:107) ( T k ( z ε ) − z ε ) − kv (cid:107) ≤ (cid:107) T k ( z ε ) − T k − ( z ε ) − v (cid:107) + (cid:107) T k − ( z ε ) − z ε − ( k − v (cid:107) ≤ kε , where we used the ﬁrst claim and the inductive hypothesis.Furthermore, if range( T − I ) is closed, the statements follow by taking ε = 0 in the previous proofs. Proof of Theorem 3.

Let z ε be the point given by Lemma 2. We proceed to prove the ﬁrst two statements.1. It follows from z j = T j ( z ) that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k ( k + 1) k (cid:88) j =1 ( z j − z ) − v (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k ( k + 1) k (cid:88) j =1 (cid:0) T j ( z ) − z − jv (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k ( k + 1) k (cid:88) j =1 (cid:0) ( T j ( z ) − T j ( z ε )) + ( z ε − z ) + ( T j ( z ε ) − z ε − jv ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k ( k + 1) k (cid:88) j =1 (cid:0) T j ( z ) − T j ( z ε ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + 2( k + 1) (cid:13)(cid:13) z − z ε (cid:13)(cid:13) + ε , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k ( k + 1) k (cid:88) j =1 (cid:0) T j ( z ) − T j ( z ε ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ k ( k + 1) k (cid:88) j =1 (cid:13)(cid:13) T j ( z ) − T j ( z ε ) (cid:13)(cid:13) ≤ k ( k + 1) k (cid:88) j =1 (cid:13)(cid:13) z − z ε (cid:13)(cid:13) = 2( k + 1) (cid:13)(cid:13) z − z ε (cid:13)(cid:13) , where the second inequality follows since T is nonexpansive.2. Notice that (cid:13)(cid:13)(cid:13)(cid:13) k ( z k − z ) − v (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) k (cid:0) ( T k ( z ) − z ) − ( T k ( z ε ) − z ε ) + ( T k ( z ε ) − z ε − kv ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13) k (cid:0)(cid:0) T k ( z ) − z (cid:1) − ( T k ( z ε ) − z ε ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13) + ε ≤ k (cid:0) (cid:107) T k ( z ) − T k ( z ε ) (cid:107) + (cid:107) z − z ε (cid:107) (cid:1) + ε ≤ k (cid:107) z − z ε (cid:107) + ε , where the ﬁrst inequality uses triangle inequality and Lemma 2, and the last inequality is from non-expansiveness of T .Finally, if range( T − I ) is closed, we obtain the results following the same argument as above with ε = 0 ,using the results for closed range( T − I ) from Lemma 2.A drawback of (15) and (16) in Theorem 3, as well as the results in Theorem 2, is that the constantsaccompanying the rates depend on ε . Nonetheless, we can bypass this issue, using (17) and (18), for problemswhere range( T − I ) is closed. The next proposition guarantees that range( T − I ) is indeed closed for a broadfamily of algorithms for solving LP. Proposition 3.

Let T : R d → R d be an operator that can be decomposed as T = T k ◦ · · · ◦ T where T j iseither an aﬃne mapping or a projection onto a polyhedron. Then, range( T ) is a ﬁnite union of polyhedra.Proof. The proof follows inductively. Assume that C = range( T j ◦ · · · ◦ T ) is a ﬁnite union of polyhedra.Without loss of generality, we can assume that C is equal to a single polyhedron. Now we consider two cases: Case 1 . Assume that T j +1 is an aﬃne transformation. This is a well-known consequence of Fourier-Motzkin elimination [43]. Case 2 . Assume that T j +1 is a projection onto a polyhedron Q . First, we start with an intuitive sketchof the proof and then formalize it. In this case, diﬀerent pieces of the polyhedron C are going to beprojected to diﬀerent faces of the polyhedron Q . Each one of these pieces is a polyhedron and sincethere are only ﬁnitely many faces of Q , the projection is a ﬁnite union of polyhedra.More formally, any polytope Q deﬁnes a ﬁnite polyhedral partition of the space { P F } F ∈ ∆ where ∆ isthe collection of faces of the polyhedron Q . Each cell P F corresponds to the region of the space thatprojects onto F , that is proj Q ( P F ) = F. Deﬁne a partition of the polyhedron C as { C F } F ∈ ∆ given by C F = P F ∩ C. Within each cell P F the projection proj Q | P F is an aﬃne transform. Thus, by Case 1we have that proj Q ( C F ) is a polyhedron and thus T j +1 ( C ) = proj Q ( C ) = (cid:91) F ∈ F proj Q ( C F ) is a ﬁnite union of polyhedra. I.e., each cell P F ⊆ R d is a polyhedron and ∪ F ∈ ∆ P F = R d .

12s a result of Proposition 3 applied to the PDHG update for solving an LP problem, range( T − I ) is apolyhedron, thus closed: Corollary 1.

Let T be the PDHG operator for an LP problem, then range ( T − I ) is a ﬁnite union of closedpolyhedra.Proof. Notice that the operator T − I is composite of linear operators and projection operators, thus weobtain the results by using Proposition 3. Remark 4.

Proposition 3 shows that range ( T − I ) is closed when T is an operator that corresponds to otherﬁrst-order algorithms, such as ADMM and mirror-prox, to solve an LP problem. Further, one can extend theresult to cover convex quadratic problems with polyhedral constraints. In Figure 1, we saw low-dimensional examples of the dynamics of PDHG when solving LP problems indiﬀerent feasibility settings. Indeed, such convergence/divergence dynamics generally hold when using PDHGto solve arbitrary LP problems. In this section, we present a complete description of the behavior of PDHGfor feasible and infeasible LP problems and discuss how to recover the infeasibility certiﬁcate from the iteratesof PDHG. The next theorem compiles the full characterization. The proof of this theorem will be deferred toSection 4.3.

Theorem 4.

Consider the primal ( P ) and dual ( D ) problems. Assume that ητ (cid:107) A (cid:107) < , let T be theoperator induced by (8) , and let ( z k ) be a sequence generated by the ﬁxed-point iteration for an arbitrarystarting point z . Then, one of the following holds:1. If both primal and dual are feasible , then the iterates ( x k , y k ) converge to a primal-dual solution z (cid:63) = ( x (cid:63) , y (cid:63) ) and v = ( T − I )( z (cid:63) ) = 0 .

2. If both primal and dual are infeasible , then both primal and dual iterates diverge to inﬁnity.Moreover, the primal and dual components of the inﬁmal displacement vector v = ( v x , v y ) give certiﬁcatesof dual and primal infeasibility, respectively.3. If the primal is infeasible and the dual is feasible , then the dual iterates diverge to inﬁnity, whilethe primal iterates converge to a vector x (cid:63) . The dual-component v y is a certiﬁcate of primal infeasibility.Furthermore, there exists a vector y (cid:63) such that v = ( T − I )( x (cid:63) , y (cid:63) ) .4. If the primal is feasible and the dual is infeasible , then the same conclusions as in the previousitem hold by swapping primal with dual. To show this characterization, we establish two intermediate results: ﬁrst, the inﬁmal displacement vector v is nonzero if, and only if, either the primal or dual problems are infeasible; and second, the iterates ( x k , y k ) “converge” to a well-deﬁned ray of the form ( x (cid:63) , y (cid:63) ) + λv for λ ∈ R + . The ﬁrst result describes the asymptoticdivergent behavior of the primal (resp. dual) iterates when the dual (resp. primal) problem is infeasible. Thesecond one, ensures the asymptotic convergence of the primal (resp. dual) iterates without any normalizationwhen the dual (resp. primal) problem is feasible. These two intermediate results are proved in Section 4.1and Section 4.2, respectively. 13 .1 The inﬁmal displacement vector recovers infeasibility certiﬁcates In Section 3, we demonstrated that the diﬀerences of iterates, the normalized iterates, and the normalizedaverage for a nonexpansive operator converge to the inﬁmal displacement vector v . Here, we show that theinﬁmal displacement vector v for PDHG applied to LP recovers infeasibility certiﬁcates whenever it is nonzero.First, some simple properties of v . Lemma 3.

Consider the primal ( P ) and dual ( D ) problems. Assume that ητ (cid:107) A (cid:107) < , let T be the operatorinduced by (8) , and let v = ( v x , v y ) be the inﬁmal displacement vector of T . Then v x ≥ , Av x = 0 , and A (cid:62) v y ≥ .Proof. From Theorem 1, k ( x k , y k ) → v = ( v x , v y ) and k ( z k +1 − z k ) → . (22)Notice that PDHG for LP has the following iteration update in terms of a diﬀerential inclusion, M (cid:20) x k − x k +1 y k − y k +1 (cid:21) ∈ (cid:20) N R n + ( x k +1 ) + A (cid:62) y k +1 + c − Ax k +1 + b (cid:21) , (23)where this relation comes from (9) and (3). Dividing (23) by k and letting k → ∞ , we have from (22) that ∈ lim k ∈∞ N R n + ( x k ) + A (cid:62) k y k ⊆ − R n + + A (cid:62) v y = ⇒ A (cid:62) v y ≥ , (24)where we utilize the fact that N R n + ( x ) ⊆ − R n + for any x ∈ R n + and lim k →∞ k y k = v y ; and k ∈∞ − k Ax k = − Av x = ⇒ Av x = 0 . (25)Furthermore, note that v x ≥ since v x = lim k →∞ x k /k and x k ≥ for all k .Now we derive the main result of this section. Proposition 4.

Consider the primal ( P ) and dual ( D ) problems. Assume that ητ (cid:107) A (cid:107) < , let T be theoperator induced by (8) , and let ( z k ) k ∈ N be a sequence generated by the ﬁxed-point iteration for an arbitrarystarting point z . Then, the primal problem ( P ) is infeasible if and only if v y is a nonzero vector, and in thiscase, v y is an infeasibility certiﬁcate for the primal problem. Analogously, the dual problem ( D ) is infeasibleif and only if v x is a nonzero vector, and in this case, v x is an infeasibility certiﬁcate for the dual problem.Proof. To establish the ﬁrst implication in this result we have to prove that if v y is non-zero, then v y is aninfeasibility certiﬁcate for the primal problem, namely, A (cid:62) v y ≥ and b (cid:62) v y < , thus the primal problem is infeasible. Similarly if v x is non-zero, then v x is an infeasibility certiﬁcate for thedual problem, namely Av x = 0 , v x ≥ , and c (cid:62) v x < , thus the dual problem is infeasible. We proved all the nonstrict inequalities in Lemma 3, so it suﬃces toshow the strict ones.First, consider the case when v x (cid:54) = 0 . Let B = { i ∈ [ n ] | ( v x ) i > } and let N = { i ∈ [ n ] | ( v x ) i = 0 } , then B ∪ N = [ n ] by noticing v x ≥ (from Lemma 3). Given a vector x , let x B be the vector of entries of x with indices in B ; similarly given a matrix A , let A B be the submatrix with columns of A with indices in B .Then for any i ∈ B , we have ( v x ) i > , thus there exists some K such that ( x k /k ) i > for all k ≥ K , andfurthermore ( x k +1 ) B = ( x k ) B − ηA (cid:62) B y k − ηc B . k → ∞ and noticing lim k →∞ ( x k +1 ) B − ( x k ) B = ( v x ) B , we obtain ( v x ) B = lim k →∞ − η ( A (cid:62) B y k + c B ) . Thus it holds that c (cid:62) v x = c (cid:62) B ( v x ) B = − η (cid:107) v x (cid:107) − lim k →∞ ( y k ) (cid:62) A B ( v x ) B = − η (cid:107) v x (cid:107) < , (26)where the last equality uses A B ( v x ) B = Av x = 0 . Combining with v x ≥ and Av x = 0 proves that v x is acertiﬁcate of infeasibility whenever it is nonzero.Second, consider the case when v y (cid:54) = 0 . By taking k → ∞ , we have v y = lim k →∞ y k +1 − y k = lim k →∞ τ A (2 x k +1 − x k ) − τ b = lim k →∞ τ Ax k +2 − τ b , (27)where the second equality uses the update rule (8), and the third equality uses lim k →∞ x k +1 − x k =lim k →∞ x k +1 + v x = lim k →∞ x k +2 .Now we claim the following two facts: Fact 1.

There exists some K such that if ( A (cid:62) v y ) i > then x ki = 0 for all k ≥ K . Fact 2.

The support (nonzero components) of A (cid:62) v y satisﬁes supp( A (cid:62) v y ) ⊆ N. The ﬁrst fact is because if ( A (cid:62) v y ) i > then we have that ( A (cid:62) y k /k ) i ≥ ( A (cid:62) v y ) i / > for large enough k. Dividing (23) by k yields − k ( A (cid:62) y k ) i + 1 ηk (cid:0) x k − x k +1 − ηc (cid:1) i ∈ N R n + ( x k +1 i ) . For large enough k , the second term on the left-hand side of the inclusion will be as small as ( A (cid:62) y k /k ) i / and hence the sign of entire expression on the left-hand side will be negative. If N R + (( x k +1 ) i ) contains anegative number, then ( x k +1 ) i = 0 , which implies that ( x k +1 ) i = 0 for large enough k .The second fact is because for any entry i in the support of A (cid:62) v y , namely ( A (cid:62) v y ) i > , it follows fromthe ﬁrst part that ( x k ) i = 0 for all k large enough, thus ( v x ) i = lim k →∞ k x ki = 0 , which proves the secondfact by the deﬁnition of the set N .Returning to the proof of Proposition 4, notice that lim k →∞ v (cid:62) y Ax k = lim k →∞ (cid:88) i ∈ N ( A (cid:62) v y ) i x ki = 0 , (28)where the ﬁrst equality uses Fact 2, and the second equality uses Fact 1. Therefore, it holds that v (cid:62) y b = − τ (cid:107) v y (cid:107) + lim k →∞ v (cid:62) y Ax k +2 = − τ (cid:107) v y (cid:107) < , where the ﬁrst inequality uses (27) and the second equality is from (28). Together with (24), we know v y isan infeasibility certiﬁcate for the primal problem.Now we turn to the inverse direction. Recall that it follows from the closedness of the set range( T − I ) that there exists a pair z (cid:63) = ( x (cid:63) , y (cid:63) ) such that T ( z (cid:63) ) = z (cid:63) + v . If the dual problem is infeasible, we will showthat v x (cid:54) = 0 by contradiction. Assume v x = 0 ; then it follows from the update rule (8) that x (cid:63) = proj R n + ( x (cid:63) − η ( A (cid:62) y (cid:63) + ηc )) , thus A (cid:62) y (cid:63) + ηc ≥ by noticing x (cid:63) ≥ , which contradicts the assumption that the dual problem is infeasible.If the primal problem is infeasible, we will show that v y (cid:54) = 0 by contradiction. Suppose v y = 0 , then it followsfrom the update rule (8) that y (cid:63) = y (cid:63) + τ A (2( x (cid:63) + v x ) − x (cid:63) ) − τ b , thus Ax (cid:63) = b by noticing Av x = 0 from (25). Furthermore, we know x (cid:63) ≥ , thus the primal is a feasibleproblem, which contradicts with assumption.This concludes the proof. 15 .2 The iterates converge to a ray Combining facts from the previous sections we know that if both primal and dual problems are feasible thenthe iterates (without normalization) will converge to a solution, and when both primal and dual problems areinfeasible then the normalized iterates converge to a vector ( v x , v y ) with nonzeros on both primal and dualcomponents. Yet the techniques used to prove these results do not explain what happens when one of theproblems is feasible and the other one is infeasible. In this scenario the convergence of the primal and dualiterates happen at diﬀerent scales, one with normalization by k and the other without it. In this section, weﬁll in this gap by showing that the iterates of PDHG always converge to ray with direction v , emanatingfrom a point z (cid:63) . In turn, this allows us to connect the convergence results for the two scales. Deﬁnition 1 ( Ray ) . Given a starting point z (cid:63) ∈ R n + m and a direction v ∈ R n + m , we deﬁne their ray as [ z (cid:63) , v ] = { z (cid:63) + λv | λ ∈ R + } . With this deﬁnition at hand we can now state the main result of this section.

Theorem 5.

Consider the primal ( P ) and dual ( D ) problems. Assume that ητ (cid:107) A (cid:107) < , let T be theoperator induced by (8) , and let ( z k ) k ∈ N be a sequence generated by the ﬁxed-point iteration for an arbitrarystarting point z . Then, the iterates of PDHG converge to a ray [ z (cid:63) , v ] , in particular (cid:107) z k − z (cid:63) − kv (cid:107) → for some z (cid:63) ∈ ( T − I ) − ( v ) . To prove this result, we establish a connection between the iterates of PDHG applied to the original(possibly infeasible) problem and the iterates of PDHG applied to a feasible auxiliary LP problem. Let usstart by deﬁning this auxiliary problem. Deﬁne the index sets B = { i ∈ [ n ] | ( v x ) i > } ,N = { i ∈ [ n ] | ( v x ) i = 0 , ( A (cid:62) v y ) i = 0 } ,N = { i ∈ [ n ] | ( v x ) i = 0 , ( A (cid:62) v y ) i > } . (29)Deﬁne the operator ˜ T : R n + m → R n + m given by ˜ T ( z ) := z + with ( x + ) B = x B − ηA (cid:62) B y − ηc B − ( v x ) B ( x + ) N = proj R | N | + ( x N − ηA (cid:62) N y − ηc N ) − ( v x ) N ( x + ) N = − ( v x ) N y + = y + τ A (2 x + − x ) − τ b − v y . (30)In turn, this is a PDHG operator for the auxiliary LP problem:minimize ( c B + ( v x ) B /η ) (cid:62) x B + c (cid:62) N x N + c (cid:62) N x N subject to A B x B + A N x N + A N x N = b + v y τx N ≥ , x N = 0 . (31)Then we claim the following connection between T and ˜ T . Proposition 5.

Given an arbitrary initial solution z , there exists a large enough K ∈ N such that ˜ T k ( z K ) = T k ( z K ) − kv for all k ≥ . (32) Proof.

For any initial solution, we know that there exists some K such that it holds for any k ≥ K that ( x k ) B > and ( x k ) N = 0 . (33)16he former is because ( v x ) B > , and the latter follows from Fact 1. With some abuse of notation, we let z ← z K , so that we may study the iterates starting at z (rather than starting at z K ). From Lemma 3: v x ≥ , Av x = 0 , and A (cid:62) v y ≥ . In addition, from the converse of Fact 1, ( A (cid:62) v y ) B = 0 . We show the stated claim by induction. Denote z k = T k ( z ) and ˜ z k = ˜ T k ( z ) . First, (32) holds with k = 0 . Now suppose (32) holds for k , and consider k + 1 . Then by induction we have ˜ z k +1 = ˜ T ( z k − kv ) ,thus it holds by (30) that (˜ x k +1 ) B = ( x k ) B − k ( v x ) B − ηA (cid:62) B ( y k − kv y ) − ηc B − ( v x ) B = ( x k +1 ) B − ( k + 1)( v x ) B where the second equality utilizes the update rule of PDHG by noticing A (cid:62) B v y = 0 and ( x k +1 ) B > . For thecomponents in N we get (˜ x k +1 ) N = proj R | N | + (( x k ) N − k ( v x ) N − ηA (cid:62) N ( y k − kv y ) − ηc N ) − ( v x ) N = proj R | N | + (( x k ) N − ηA (cid:62) N y k − ηc N )= ( x k +1 ) N − ( k + 1)( v x ) N , where the second equality follows from A (cid:62) N v y = 0 and ( v x ) N = 0 , the third one utilizes ( v x ) N = 0 and theupdate rule of PDHG. Similarly, for the N block (˜ x k +1 ) N = − ( v x ) N = 0 = ( x k +1 ) N − ( k + 1)( v x ) N , where the equations follow from ( x k +1 ) N = 0 and ( v x ) N = 0 . Finally, for the dual iterates we have ˜ y k +1 = y k − kv y + τ A (2˜ x k +1 − ˜ x k ) − τ b − v y = y k + τ A (2( x k +1 − ( k + 1) v x ) − ( x k − kv x )) − τ b − ( k + 1) v y = y k + τ A (2 x k +1 − x k ) − τ b − ( k + 1) v y = y k +1 − ( k + 1) v y where the third equality utilizes Av x = 0 , and the last equality is from the update rule of PDHG.Equipped with this proposition we can now prove the theorem. Proof of Theorem 5.

Since ˜ T is a PDHG operator, it is ﬁrmly nonexpansive with respect to (cid:107) · (cid:107) M . Thus, if ˜ T has a ﬁxed point, then the iteration ˜ T k ( z K ) should converge to it. To see that ˜ T has a ﬁxed point, let z (cid:63) be a point such that ( T − I )( z (cid:63) ) = v and let K be the iteration after which ˜ T k ( T K ( z (cid:63) )) = T k + K ( z (cid:63) ) − kv ,which exists thanks to Proposition 5. We claim that T K ( z (cid:63) ) is a ﬁxed point of ˜ T . To see this, note that ˜ T ( T K ( z (cid:63) )) = T K +1 ( z (cid:63) ) − v = T K ( z (cid:63) ) , where the last equality follows from Lemma 2.Now, let z an arbitrary initial point and K be the in (32). Now that we know that the set of ﬁxed pointsof ˜ T is nonempty, we can deﬁne z (cid:63) = lim k →∞ ˜ T k ( T K ( z )) . We will prove that z (cid:63) satisﬁes ( T − I )( z (cid:63) ) = v .Due to Proposition 5, we know that z (cid:63) = ˜ T ( z (cid:63) ) = T ( z (cid:63) ) − v. Finally, using decomposition (32) we get (cid:107) z k + K − z (cid:63) − kv (cid:107) = (cid:107) ˜ T ( z k ) − z (cid:63) (cid:107) → . (34)The statement of theorem claimed this convergence where the coeﬃcient accompanying v is ( k + K ) . Wecan get around this by setting z (cid:63) ← z (cid:63) − Kv , a point that also satisﬁes ( T − I )( z (cid:63) ) = v thanks to Lemma 2.This establishes the result. 17 .3 Proof of Theorem 4 Proof.

As a direct result of Proposition 4, we know that if both primal and dual are feasible, then v = 0 and Theorem 5 ensures that PDHG converges to an optimal solution (or equivalently a ﬁxed point of T ). Ifprimal (and/or dual) is infeasible, then the dual iterate of PDHG (and/or primal) diverges to inﬁnity, andthe diverging direction recovers primal (and/or dual) infeasibility certiﬁcate.Thus, the only thing left to prove is the ﬁnal conclusion of item 3 (and analogously item 4). Assumethat the primal problem is infeasible and the dual is feasible. By Proposition 4 we know that v x = 0 . Then,Theorem 5 guarantees the existence of some z (cid:63) = ( x (cid:63) , y (cid:63) ) such that x k → x (cid:63) + kv x = x (cid:63) and ( T − I )( z (cid:63) ) = v .The proof for the case where the primal is feasible and the dual is infeasible follows from an analogousargument. This completes the proof of the theorem. In this section, we introduce a nondegeneracy condition that ensures that after a ﬁnite amount the diﬀerenceof iterates converges linearly to the inﬁmal convergence vector. To show this, we demostrate under saidcondition the iterates “identify” the support of x k , i.e., the support freezes after a ﬁnite number iterations.Finite-time identiﬁcation has a long history in the analysis of iterative algorithms for feasible problems[22, 14, 13, 12, 26, 34, 37]. Roughly speaking, these algorithms’ behaviors exhibit two phases: a ﬁrst one thatonly takes ﬁnitely many steps but suﬀers from slow sublinear convergence, and then a second one after theactive set is identiﬁed where the convergence is signiﬁcantly faster and becomes linear.For PDHG, ﬁnite-time identiﬁability is known to hold for feasible minimax problems under suitablenondegeneracy conditions [37]. In contrast, here we study this phenomenon for infeasible LP problems. Wedemonstrate that even when there is no primal (and/or dual) feasible solution, active set of the iterates withrespect to an auxiliary feasible LP problem is ﬁxed after ﬁnitely many iterations .Recall that the iterates of PDHG converge to a ray [ z (cid:63) , v ] = { z (cid:63) + kv | k ∈ N } where z (cid:63) is a solution tothe feasible LP problem given by (31). Consider the constraint set deﬁned by said auxiliary problem, that is Ax = b + v y τ , x N ≥ and x N = 0 . (35)Here, the active set is the set of inequality constraints that attain their extreme values, namely, { i ∈ N | x (cid:63)i = 0 } . Note that when the problem is feasible, N = { , . . . , n } and thus the constrained set deﬁned by theauxliary problem (35) matches that of the original problem.Now, we introduce the nondegeneracy condition for possibly infeasible problems, which generalizesthe classical identiﬁbility theory of PDHG for feasible LP problems [34, 37]. Similar variations of thenondegeneracy condition have appeared in numerous works that deal with ﬁnite time identiﬁability. Furthergeneralizations of this idea have led to conditions beyond the context of optimization, we refer the interestedreader to [35] for a perspective from diﬀerential geometry. Deﬁnition 2.

A ray [ z (cid:63) , v ] is nondegenerate if for any i ∈ N (recall N is deﬁned in (29) ), the pair ( x (cid:63)i , ( A (cid:62) y (cid:63) + c ) i ) satisﬁes strict complementarity with respect to the auxiliary problem (31) : x (cid:63)i > if, andonly if, ( A (cid:62) y (cid:63) + c ) i = 0 . Although here we chose to deﬁne nondegeneracy in terms of the extreme point z (cid:63) , this deﬁnition isindependent of the point we take in the ray [ z (cid:63) , v ] . This follows easily from the fact that ( v x ) N and ( A (cid:62) v y ) N are zero vectors. Additionally, notice that when the original problem is feasible, this deﬁnition reduces to theclassical strict complementarity of the original problem.The next lemma presents the ﬁnite time identiﬁbility of PDHG for infeasible LP: Lemma 4.

Suppose [ z (cid:63) , v ] is a nondegenerate ray. Then, every PDHG iterate sequence z k = ( x k , y k ) ,converging to the ray [ z (cid:63) , v ] , ﬁxes the active set of (35) after ﬁnitely many steps. Furthermore, this ensuresthat the support of x k is ﬁxed for all large enough k . roof. First let us prove that the active set of (35) is identiﬁed in ﬁnite time. Notice that this is equivalent tosaying that the support of x kN freezes after ﬁnitely many iterations. Let i ∈ N , due to strict complementaryit is enough to consider two cases: Case 1 . Assume that ( Ay (cid:63) + c ) i > , then complementary slackness implies ( x k ) i → . By construction,we should have η · ( A (cid:62) y k + c ) i > x ki for all k large enough. After this condition starts to hold, thePDHG update at i gives x k +1 i = (cid:0) x ki − η · ( A (cid:62) y k + c ) i (cid:1) + = 0 . Hence, we have x ki = 0 for all large k . Case 2 . Assume that x (cid:63)i > , then after ﬁnitely many iterations we have x ki > .Thus, the support of x kN is identiﬁed in ﬁnite time. Now, we argue that the same happens to the supportof x k . Assume that i ∈ B , then ( v x ) i > and consequently for all large k we have x ki /k > , as we wanted.Lastly, if i ∈ N then A (cid:62) v y > and Fact 1 guarantees that x ki = 0 for large enough k . This concludes theproof.When nondegeneracy holds, PDHG eventually identiﬁes the support of the primal iterate x k . Thissimpliﬁes the form of each iteration. Let S be the support of any x k with k large enough. The projection tothe positive orthant applied by PDHG (8) becomes a projection to the subspace { x | supp( x ) = S } . As aconsequence, one can recast each iteration (8) as an aﬃne transformation: (cid:20) x k +1 y k +1 (cid:21) = (cid:20) I − ηDA (cid:62) τ AD I − τ ηADA (cid:62) (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) Q := (cid:20) x k y k (cid:21) − (cid:20) ηDc τ ηADc + τ b (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) p := (36)where D is a diagonal matrix with ones on the indices ( i, i ) such that i ∈ S and zeros everywhere else, andmatrix Q and vector p are deﬁned in (36). Theorem 6 ( Eventual linear convergence under nondegeneracy ) . Consider the primal ( P ) and dual ( D ) problems. Assume that ητ (cid:107) A (cid:107) < , and let ( z k ) k ∈ N be a sequence generated by PDHG. Suppose that theiterates z k = ( x k , y k ) converge to a nondegenerate ray. Then, the k th power of Q converges lim k →∞ Q k = Q ∞ to a projection matrix, and there exists a ﬁnite K such that for any k ≥ , the active set of x K + k is ﬁxed.Furthermore, there exist positive constants µ, c , c , c , c > such that the following holds:1. Linear convergence of the diﬀerences . For any µ ∈ ( (cid:112) − ητ σ min ( A ) , , the diﬀerences z K + k +1 − z K + k converge at a linear rate to the inﬁmal displacement vector v , i.e., µ k (cid:107) ( Q − I ) z K + ( Q ∞ − I ) p (cid:107) ≤ (cid:107) z K + k +1 − z K + k − v (cid:107) ≤ µ k (cid:107) ( Q − I ) z K − p (cid:107) (37) for all suﬃciently large k .2. Sublinear convergence of the iterates . The normalized iterates converge to v at a Θ (cid:0) k (cid:1) rate, i.e., c k (cid:107) ( I − Q ∞ ) p (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13) k z K + k − v (cid:13)(cid:13)(cid:13)(cid:13) ≤ c k (cid:0) (cid:107) ( Q − I ) † ( I − Q ∞ ) p − z K (cid:107) + (cid:107) z K (cid:107) (cid:1) (38) for all suﬃciently large k .3. Sublinear convergence of the average . The normalized average converges to v at a Θ( k ) rate, i.e., c k + 1 (cid:107) ( I − Q ) p (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k ( k + 1) k (cid:88) j =1 z K + j − v (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ c k + 1 (cid:0) (cid:107) ( Q − I ) † ( I − Q ∞ ) p − z K (cid:107) + (cid:107) z K (cid:107) (cid:1) (39) for all suﬃciently large k .

19 few remarks are in order.

Remark 5.

Although equations (38) and (39) state a bound for the normalized iterates and normalizedaverage of PDHG started from z K , this result implies the same asymptotic bounds (with sightly worseconstants) for the normalized iterates and normalized averaged started from z . Thus, the result concludesthat under nondegeneracy, the diﬀerence of iterates converges much faster than the iterates and average. Remark 6.

The proof of this result shows the same rates for Bilinear games. Recall that a Bilinear game isa minimax problem of the form min x ∈ R n max x ∈ R m c (cid:62) x + (cid:104) Ax, y (cid:105) − b (cid:62) y . For these problems, the updates of PDHG (Algorithm 1) take the form of (36) with D = I . Thus, all thearguments in the proof of this result follow with K = 0 . In particular, this shows that the upper bound derivedin Theorem 3 is tight for Bilinear minimax problems. Remark 7.

Every LP problem that has an optimal solution furthermore has at least one primal-dual solutionthat satisﬁes strict complementarity [43, Theorem 2.35]. Consequently, every problem has at least onenondegenerate ray. Thus there exists at least one initial point z , such that if PDHG is initialized at thispoint, then the iterates converge to the nondegenerate ray and thus enjoy linear convergence.Proof of Theorem 6. We start by making a few simplifying assumptions. First we assume D = I . If that isnot the case, we can consider a submatrix of A where we trim out the columns indexed by { i ∈ n | D ii = 0 } .This has no eﬀect in the end result since for all these indices x ki = 0 , and thus it does not aﬀect the nonzeroentries of x k nor the entries of y k . Furthermore, we assume that A is diagonal, otherwise we can decomposethe matrix Q as Q = (cid:20) V U (cid:21) (cid:20) I η Σ (cid:62) τ Σ I − ητ ΣΣ (cid:62) (cid:21) (cid:20) V (cid:62) U (cid:62) (cid:21) (40)where A = U Σ V (cid:62) is the SVD decomposition of A . Changing basis is equivalent to applying an orthogonalmap and thus does not alter the metric nor the algorithm (which reduces to an aﬃne transform as soon asthe active set is identiﬁed), thus we can change the basis to enforce this assumption. Notice that the columnsand rows of Q can be further permuted so that it becomes a block diagonal matrix of the form Q =  B . . . B min { n,m } I  where B i = (cid:20) − ησ i τ σ i − τ ησ i (cid:21) , (41)where σ i is a the i th singular value of A . Note that when n > m (or n < m ), we might also have blockcorresponding to an identity of size n − m (resp. m − n ), the arguments below can be easily extended tocover the identity block (as it follows from the rationale applied when σ i = 0 ) and so we assume that n = m .Thus, from now on Q has the form (41) with n = m and hence without the last identity block.Now, we can compute closed-form formulas for the three certiﬁcate candidates (2a)-(2c). Let K be thesmallest integer after which the PDHG iteration can be written as (36). By recursively expanding, we obtain z k +1+ K = Qz k + K − p = Q z k − K − Qp − p ... = Q k +1 z K − k (cid:88) i =0 Q i p . (42)20f we take the diﬀerence between to consecutive iterates, this yields z k +1+ K − z k + K = Q k ( Q − I ) z K − Q k p = Q k (( Q − I ) z K − p ) . (43)On the other hand, summing the ﬁrst k iterates (42) gives k (cid:88) j =1 z j + K = k (cid:88) j =1 Q j z K − k (cid:88) j =1 j − (cid:88) l =0 Q l p . (44)We will show that Q k converges to a matrix Q ∞ . We deﬁne the matrix Q ∞ as a block diagonal matrix Q ∞ =  B ∞ . . . B ∞ n  where B ∞ i = (cid:40) I if σ i = 00 otherwise. (45)Since each block is independent of each other, we can analyze Q k by studying B ki . A simple calculationreveals that the i th block has two eigenvalues of the form λ ± i = (1 − ητ σ i ) ± i (cid:0) ητ σ i (cid:0) − ητ σ i (cid:1)(cid:1) . Taking the norm, we ﬁnd ρ ( B i ) = | λ ± i | = (cid:112) − ητ σ i . Then, we have that the iterated product of the i thblock B ki converges to B ∞ i . To see this, consider two cases:

Case 1 . Assume that σ i > . By assumption < − ητ σ i < hence B ki → B ∞ i . This follows sincethe spectral radius ρ ( B ki ) = (1 − ητ σ i ) k/ → as k . Case 2 . Assume that σ i = 0 . Then B ki = B i = I = B ∞ i .The matrix Q ∞ turns out to be the projection onto the kernel of Q − I (that is, Q ∞ ( Q − I ) = 0 ). We use ∆ ⊆ [ n ] to denote the set of indices such that σ i > . Diﬀerences . We start by analyzing the diﬀerences (43). To prove the upper bound, we expand (cid:107) z K + k +1 − z K + k − v (cid:107) = (cid:107) Q k ((( Q − I ) z K − p ) − Q ∞ (( Q − I ) z K − p ) (cid:107) ≤ (cid:107) Q k − Q ∞ (cid:107) (cid:107) ( Q − I ) z K − p (cid:107) = max i ∈ ∆ (cid:107) B ki (cid:107) (cid:107) ( Q − I ) z K − p (cid:107) ≤ µ k (cid:107) ( Q − I ) z K − p (cid:107) , where the ﬁrst equality comes from taking lim k →∞ of (43), the ﬁrst inequality used the fact that (cid:107) W z (cid:107) ≤(cid:107) W (cid:107) (cid:107) z (cid:107) for any matrix W and vector z , and the last inequality follows since ρ ( B i ) = lim (cid:107) B ki (cid:107) k and hencefor any µ ∈ ( ρ ( B i ) , we have that (cid:107) B ki (cid:107) ≤ µ k for all large enough k .Now we turn our attention to the lower bound. Given a vector z , we deﬁne z i to be the vector with thetwo components that multiply the block B i in the matrix-vector product Qz . Using the same expansion as21bove, we get (cid:107) z K + k +1 − z K + k − v (cid:107) = (cid:107) Q k ((( Q − I ) z K − p ) − Q ∞ (( Q − I ) z K − p ) (cid:107) = (cid:88) i ∈ [ n ] (cid:107) ( B ki − B ∞ i )(( Q − I ) z K − p ) i (cid:107) = (cid:88) i ∈ ∆ (cid:107) B ki (( Q − I ) z K − p ) i (cid:107) ≥ (cid:88) i ∈ ∆ σ min ( B ki ) (cid:107) (( Q − I ) z K − p ) i (cid:107) ≥ min i ∈ ∆ σ min ( B ki ) (cid:88) i ∈ ∆ (cid:107) (( Q − I ) z K − p ) i (cid:107) = min i ∈ ∆ σ min ( B ki ) (cid:107) ( I − Q ∞ )(( Q − I ) z K − p ) (cid:107) = (cid:18) min i ∈ ∆ σ min ( B i ) (cid:19) k (cid:107) ( Q − I ) z K + ( Q ∞ − I ) p (cid:107) where the second equality and the penultimate one use the block-diagonal structure of the matrix Q todecompose the norm of the matrix-vector product into orthogonal components, and the ﬁrst inequality followssince B ki is a rank two matrix for any i ∈ ∆ . Normalized iterates . We now turn our attention to the normalized iterates. The upper bound followsalmost immediately from Theorem (3) if we consider the PDHG algorithm started at z K . To show the boundwith the theorem, it suﬃces to note that 1) in this case v = − Q ∞ p and so we might pick z (cid:63) = ( Q − I ) † ( I − Q ∞ ) p ,and 2) all the norms in ﬁnite dimensional spaces are equivalent so we can upper bound (cid:107) · (cid:107) M ≤ C (cid:107) · (cid:107) forsome constant C > . To see the ﬁrst point note that Qz (cid:63) − p − z (cid:63) = − Q ∞ p ⇐⇒ ( Q − I ) z (cid:63) = ( I − Q ∞ ) p ⇐ = z (cid:63) = ( Q − I ) † ( I − Q ∞ ) p , where ( I − Q ) † is the pseudo inverse of ( I − Q ) . Now we turn our attention to the lower bound. Just as before we analyze the dynamics of Q k by studyingthe individual blocks B ki . We will use the following identity for blocks satisfying ρ ( B ) < : k (cid:88) j =0 B j = ( I − B ) − ( I − B k +1 ) . (46)Recall that ∆ = { i | σ i > } , which corresponds with blocks satisfying ρ ( B i ) < . Additionally, recall that p i is the vector with the two components of p that multiply the block B i in the matrix-vector product Qp .Expanding we get (cid:13)(cid:13)(cid:13)(cid:13) v − k z K + k +1 (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Q ∞ p + 1 k k (cid:88) j =0 Q j p − Q k +1 z K (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:88) i ∈ [ n ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) B ∞ i p i + 1 k k (cid:88) j =0 B ji p i − k B k +1 i z Ki (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 1 k (cid:88) i ∈ ∆ (cid:13)(cid:13) ( I − B i ) − ( I − B k +1 i ) p i − B k +1 i z Ki (cid:13)(cid:13) + 1 k (cid:88) i/ ∈ ∆ (cid:107) B k +1 i z Ki (cid:107) ≥ k (cid:88) i ∈ ∆ ( σ max ( I − B i )) − (cid:13)(cid:13) ( I − B k +1 i ) p i − ( I − B i ) B k +1 i z Ki (cid:13)(cid:13) Q is block diagonal, and the last inequality followssince I − B i is invertible for i ∈ ∆ . Then, taking the minimum coeﬃcient we get k (cid:88) i ∈ ∆ σ max ( I − B i ) − (cid:13)(cid:13) ( I − B k +1 i ) p i − ( I − B i ) B k +1 i z K (cid:13)(cid:13) ≥ k min j ∈ ∆ (cid:8) σ max ( I − B j ) − (cid:9) (cid:88) i ∈ ∆ (cid:13)(cid:13) ( I − B k +1 i ) p i − ( I − B i ) B k +1 i z K (cid:13)(cid:13) = 1 k min j ∈ ∆ (cid:8) σ max ( I − B j ) − (cid:9) (cid:13)(cid:13) ( I − Q k +1 ) p − ( I − Q ) Q k +1 z K (cid:13)(cid:13) ≥ k min j ∈ ∆ (cid:8) σ max ( I − B j ) − (cid:9) (cid:107) ( I − Q ∞ ) p (cid:107) , where the last equality uses the fact that the matrices we handle are block diagonal, and the last line followssince (cid:107) ( I − Q k +1 ) p − ( I − Q ) Q k +1 z K (cid:107) → (cid:107) ( I − Q ∞ ) p (cid:107) , and so the inequality holds for suﬃciently large k ≥ . Normalized average . The upper bound follows from the exact same argument as the normalized iteratesby using Theorem 3.The lower bound for this case is sightly more intricate. We expand and apply (46) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v − k ( k + 1) k (cid:88) j =1 z K + j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Q ∞ p − k ( k + 1) k (cid:88) j =1 j − (cid:88) l =0 Q l p + 2 k ( k + 1) k (cid:88) j =1 Q j z K (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:88) i ∈ [ n ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) B ∞ i p i − k ( k + 1) k (cid:88) j =1 j − (cid:88) l =0 B li p i + 2 k ( k + 1) k (cid:88) j =1 B j z Ki (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:88) i ∈ ∆ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k ( k + 1) k (cid:88) j =1 j − (cid:88) l =0 B li p i + 2 k ( k + 1) k (cid:88) j =1 B ji z Ki (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:88) i/ ∈ ∆ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k ( k + 1) k (cid:88) j =1 B ji z Ki (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 4 k ( k + 1) (cid:88) i ∈ ∆ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( I − B i ) −  k (cid:88) j =1 ( I − B ji ) p i + ( I − B k +1 i ) z Ki  − z Ki (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:88) i/ ∈ ∆ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k ( k + 1) k (cid:88) j =1 B ji z Ki (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . The second equality follows from the fact that Q and Q ∞ are block diagonal. Then, dropping the second23um and using the fact that (cid:107) ( I − B i ) − z (cid:107) ≥ σ max ( I − B i ) (cid:107) z (cid:107) for all z and i ∈ ∆ , we can lower bound k ( k + 1) (cid:88) i ∈ ∆ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( I − B i ) −  k (cid:88) j =1 ( I − B ji ) p i + B i ( I − B ki ) z Ki (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:88) i/ ∈ ∆ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k ( k + 1) k (cid:88) j =1 B ji z Ki (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ k ( k + 1) (cid:88) i ∈ ∆ σ max ( I − B i ) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) j =1 ( I − B ji ) p i + B i ( I − B ki ) z Ki (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ k ( k + 1) min i ∈ ∆ (cid:8) σ max ( I − B i ) − (cid:9) (cid:88) i ∈ ∆ (cid:13)(cid:13) kp i − ( I − B i ) − B i ( I − B ki ) p i + B i ( I − B ki ) z Ki (cid:13)(cid:13) ≥ k ( k + 1) min i ∈ ∆ (cid:8) σ max ( I − B i ) − (cid:9) (cid:88) i ∈ ∆ (cid:13)(cid:13) k ( I − B i ) p i − B i ( I − B ki ) p i + ( I − B i ) B i ( I − B ki ) z Ki (cid:13)(cid:13) = 4 k ( k + 1) min i ∈ ∆ (cid:8) σ max ( I − B i ) − (cid:9) (cid:13)(cid:13) k ( I − Q ) p − Q ( I − Q k ) p + ( I − Q ) Q ( I − Q k ) z K (cid:13)(cid:13) = 4( k + 1) min i ∈ ∆ (cid:8) σ max ( I − B i ) − (cid:9) (cid:13)(cid:13)(cid:13)(cid:13) ( I − Q ) p − k Q ( I − Q k ) p + 1 k ( I − Q ) Q ( I − Q k ) z K (cid:13)(cid:13)(cid:13)(cid:13) ≥ k + 1) min i ∈ ∆ (cid:8) σ max ( I − B i ) − (cid:9) (cid:107) ( I − Q ) p (cid:107) , where the last inequality follows for large enough k since (cid:13)(cid:13) − k Q ( I − Q k ) p + k ( I − Q ) Q ( I − Q k ) z K (cid:13)(cid:13) → .This completes the proof. In this section, we test numerically the proposed approach to check infeasibility using PDHG. For theexperiments, we implemented PDHG for LP problems of the formminimize c (cid:62) x subject to Ax ≥ bl ≤ x ≤ u maximize b (cid:62) y + l (cid:62) r + − u (cid:62) r − subject to c − A (cid:62) y = ry ≥ (47)where b ∈ R m , l ∈ ( R ∪ {−∞} ) n , u ∈ ( R ∪ {∞} ) n , A ∈ R m × n are given and r + = proj R + ( r ) and r − = − proj R + ( − r ) are the projections of r onto the positive and negative orthant, respectively. Wechose this form over the standard form ( P )-( D ) since it is algorithmically easier to reduce arbitrary LPproblems to it. Given that the PDHG algorithm for this formulation generates iterates ( x k , y k ) , for ourcomputations we generate r k by ﬁnding the closest point to c − A (cid:62) y k that makes the dual objective ﬁnite;i.e., r k = proj C v ( c − A (cid:62) y k ) with C v deﬁned below. All the results proved in this work also apply to thisform under suitable modiﬁcations of the statements.For our experiments we use the Netlib dataset of infeasible LP instances [28]. We use this dataset toillustrate the diﬀerent dynamics that PDHG exhibits. For all our experiments we measure statistics thatquantify how close are the candidate iterates (2) to being approximate certiﬁcates of inteasibility.Before we describe these statistics, let us deﬁne what we mean by approximate infeasibility certiﬁcates.The set of (exact) primal infeasibility certiﬁcates for (47) is given by all the vectors ( y, r ) ∈ R m + × R n satisfying b (cid:62) y + l (cid:62) r + − u (cid:62) r − > and r = − A (cid:62) y (48)while the set of (exact) dual infeasibility certiﬁcates is given by all the vectors x ∈ R n satisfying c (cid:62) x < , x ∈ C v , and Ax ≥ (49)24 a) box1 (b) woodinfe (c) ex72a (d) ex73a Figure 2: Scaled certiﬁcate error (52) for the three sequences deﬁned in (2) for four instances of the Netlibinfeasible dataset [28]. Vertical dotted lines denote the last observed active set change.where the set C v is given by C v =  x : for i ∈ [ n ] , x i  = 0 if l i , u i ∈ R , ≥ if l i ∈ R , u i = ∞ , ≤ if l i = −∞ , u i ∈ R ,  . We deﬁne an ε -approximate primal infeasibility certiﬁcate to be any point ( y, r ) ∈ R m + × R n satisfying b (cid:62) y + l (cid:62) r + − u (cid:62) r − > and ( b (cid:62) y + l (cid:62) r + − u (cid:62) r − ) − (cid:107) r + A (cid:62) y (cid:107) ∞ ≤ ε. (50)Similarly we say that a point x ∈ R m is an (cid:15) -approximate dual infeasibility certiﬁcate if it satisﬁes c (cid:62) x < , − c (cid:62) x · (cid:107) x − proj C v ( x ) (cid:107) ∞ ≤ ε, and − c (cid:62) x · (cid:107) Ax − proj R m + ( Ax ) (cid:107) ∞ ≤ ε. (51)These deﬁnitions parallel the criteria to detect infeasibility used by SCS [60], a popular open-source solver.Since all the instances in the Netlib infeasible data set are primal infeasible, we will only plot informationabout the dual components of the candidate certiﬁcates (2). To illustrate how close is each candidate tobeing a certiﬁcate we will plot (cid:107) r k + A (cid:62) y k (cid:107) ∞ b (cid:62) y k + l (cid:62) r k + − u (cid:62) r k − . (52)25 a) bgdbg1 (b) chemcom Figure 3: Scaled certiﬁcate error (52) for the three sequences deﬁned in (2) for four instances of the Netlibinfeasible dataset [28].We call this quantity the scaled certiﬁcate error . If the objective term, i.e., the denominator, is negative atany iteration then we do not plot it for that iteration. For almost all the problems we consider (52) remainspositive for almost all iterations.

Nondegeneracy in practice . Our ﬁrst batch of experiments showcases the faster convergence of thediﬀerence of iterates in practice. We found empirically that for a subset of instances in the dataset, thediﬀerence of iterates (2a) detects infeasibility faster than the other two sequences. Based on the theory, weexpect that for these instances, the diﬀerence exhibits eventual faster convergence. To test this claim, we runan experiment on four of these instances: box1 , woodinfe , ex73a and ext72a .Figure 2 displays the scaled certiﬁcate error (52) against the number of iterations for the four instances.For all of them, we can see a clear phase transition between a ﬁrst stage of slow convergence and a secondstage that displays linear convergence. This transition is unequivocally marked by the last change of theactive set of the solution (also depicted in the ﬁgure). Notice, however, that the iteration number at which theactive set is ﬁxed might be large; the point at which this happens ranges among multiple orders of magnitudein our experiments. Normalized iterates can be faster . Even if eventual identiﬁability holds, this might take a signiﬁcantnumber of iterations. In these cases it might be beneﬁcial to check infeasibility using the normalized iterated(2b) and the normalized average (2c). In this batch of experiments we run PDHG on bgdbg1 and chemcom ,the results are displayed in Figure 3. Just as before we plot the scaled certiﬁcate error against the number ofiterations.The normalized average is consistently slower at converging than the normalized iterates. This is mostlikely due to the fact that it retains a tail of initial iterates, which are far away from the inﬁmal displacementvector. For both these problems, the diﬀerence takes at least twice the number of iterations than thenormalized iterates to obtain a highly accurate certiﬁcate, i.e., ε = 10 − . This suggests that solvers maybeneﬁt from checking infeasibility with both the normalized iterates (2b) and diﬀerence of iterates (2a). In this work, we showed how to detect infeasibility of LP programs using the iterates of PDHG. Theproposed approach is simple, doesn’t require changing the logic of the algorithm, and has an almost negligiblecomputational cost. Thus, it is suitable for large-scale LP problems. The ideas developed in this work pavethe road for a number of research directions. Below we list some of these directions and superﬁcially addresstheir main diﬃculties.It would be interesting to understand how to integrate our theoretical results with enhanced versions of26DHG. For example, for practical applications it is often useful to have varying stepsizes η k and τ k thatcompensate for the diﬀerent scales that the primal and dual problems might have [41]. In this case, thereis no unique PDHG operator T applied at each iteration, rather a sequence of operators T k and even thedeﬁnition of v is not well-posed.In a similar vein, we wonder if it is possible to speed up the convergence of some of the sequences in (2)using a restart scheme. This is known to be the true in the feasible case. Tackling this question would requiredesigning a more involved method that restarts the PDHG iterates and the candidate certiﬁcates periodically.It is unclear if it is possible to have a scheme that works seamlessly for both the feasible and infeasible cases.The proofs of most of our results rely on the form of the PDHG updates. For instance, one of our keyarguments established a connection between the iterates of PDHG running on an infeasible problem and theiterates of PDHG running on an auxiliary feasible problem. It is natural to wonder if this is a phenomenonthat only holds for PDHG or if it generalizes to other algorithms and/or other problem families such asquadractic and semideﬁnite programming.Another potential research direction is the application of PDHG for solving the homogeneuous self-dualembedding problem akin to [53]. A naive implementation of this idea would require doubling the dimension ofthe primal problem. The authors of [53] showed that for ADMM, this increase in the dimension doesn’t aﬀectthe computational complexity of each iteration. At its core, their argument for this fact used the symmetrybetween primal and dual updates in ADMM; it is unclear how to extend this argument to PDHG given itsasymmetric updates. Acknowledgements

We would like to thank Brendan O’Donoghue, Oliver Hinder, and Warren Schudy for insightful conversationsduring the ﬁrst stages of this work.

References [1] Tobias Achterberg. Conﬂict analysis in mixed integer programming.

Discrete Optimization , 4(1):4 – 20,2007. Mixed Integer Programming.[2] F. Al-Khayyal and J. Kyparisis. Finite convergence of algorithms for nonlinear programs and variationalinequalities.

J. Optim. Theory Appl. , 70(2):319–332, 1991.[3] JB Bailion, Ronald E Bruck, and Simeon Reich. On the asymptotic behavior of nonexpansive mappingsand semigroups in banach spaces.

Houston Journal of Mathematics , 4(1):1–9, 1978.[4] Goran Banjac, Paul Goulart, Bartolomeo Stellato, and Stephen Boyd. Infeasibility detection in thealternating direction method of multipliers for convex optimization.

Journal of Optimization Theory andApplications , 183(2):490–519, 2019.[5] Kinjal Basu, Amol Ghoting, Rahul Mazumder, and Yao Pan. ECLIPSE: An extreme-scale linearprogram solver for web-applications. In Hal Daumé III and Aarti Singh, editors,

Proceedings of the37th International Conference on Machine Learning , volume 119 of

Proceedings of Machine LearningResearch , pages 704–714, Virtual, 13–18 Jul 2020. PMLR.[6] H. H. Bauschke and P. L. Combettes.

Convex analysis and monotone operator theory in Hilbert spaces .CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC. Springer, Cham, second edition,2017. With a foreword by Hédy Attouch.[7] Heinz H Bauschke, Patrick L Combettes, and D Russell Luke. Finding best approximation pairs relativeto two closed convex sets in hilbert spaces.

Journal of Approximation theory , 127(2):178–192, 2004.278] Heinz H Bauschke and Walaa M Moursi. The Douglas–Rachford algorithm for two (not necessarilyintersecting) aﬃne subspaces.

SIAM Journal on Optimization , 26(2):968–985, 2016.[9] Heinz H Bauschke, Xianfu Wang, and Liangjin Yao. General resolvents for monotone operators:characterization and extension. arXiv preprint arXiv:0810.3905 , 2008.[10] Amir Beck.

First-Order Methods in Optimization . Society for Industrial and Applied Mathematics,Philadelphia, PA, 2017.[11] J. M. Borwein and A. S. Lewis.

Convex Analysis and Nonlinear Optimization: Theory and Examples .Springer, 2006.[12] J.V. Burke. On the identiﬁcation of active constraints. II. The nonconvex case.

SIAM J. Numer. Anal. ,27(4):1081–1103, 1990.[13] J.V. Burke and J.J. Moré. On the identiﬁcation of active constraints.

SIAM J. Numer. Anal. , 25(5):1197–1211, 1988.[14] P.H. Calamai and J.J. Moré. Projected gradient methods for linearly constrained problems.

Math. Prog. ,39(1):93–116, 1987.[15] Antonin Chambolle and Thomas Pock. A ﬁrst-order primal-dual algorithm for convex problems withapplications to imaging.

Journal of mathematical imaging and vision , 40(1):120–145, 2011.[16] Antonin Chambolle and Thomas Pock. An introduction to continuous optimization for imaging.

ActaNumerica , 25:161–319, 2016.[17] John W. Chinneck. Computer codes for the analysis of infeasible linear programs.

The Journal of theOperational Research Society , 47(1):61–72, 1996.[18] G.B. Dantzig.

Linear Programming and Extensions . Princeton University Press, Princeton, 1963.[19] Damek Davis and Wotao Yin. Convergence rate analysis of several splitting schemes. In

Splitting methodsin communication, imaging, science, and engineering , pages 115–163. Springer, 2016.[20] Jim Douglas and Henry H Rachford. On the numerical solution of heat conduction problems in two andthree space variables.

Transactions of the American mathematical Society , 82(2):421–439, 1956.[21] Dmitriy Drusvyatskiy and Adrian S Lewis. Optimality, identiﬁability, and sensitivity.

MathematicalProgramming , 147(1-2):467–498, 2014.[22] J.C. Dunn. On the convergence of projected gradient processes to singular critical points.

J. Optim.Theory Appl. , 55(2):203–216, 1987.[23] Jonathan Eckstein and Dimitri P Bertsekas. On the Douglas—Rachford splitting method and theproximal point algorithm for maximal monotone operators.

Mathematical Programming , 55(1-3):293–318,1992.[24] Jonathan Eckstein and Gyorgy Matyasfalvi. Eﬃcient distributed-memory parallel matrix-vector multi-plication with wide or tall unstructured sparse matrices.

CoRR , abs/1812.00904, 2018.[25] M.C. Ferris. Finite termination of the proximal point algorithm.

Math. Program. , 50(3, (Ser. A)):359–366,1991.[26] S.D. Flåm. On ﬁnite convergence and constraint identiﬁcation of subgradient projection methods.

Math.Program. , 57:427–437, 1992.[27] Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problemsvia ﬁnite element approximation.

Computers & mathematics with applications , 2(1):17–40, 1976.2828] David M. Gay.

Netlib: infeasible linear programming test problems , 2013 (accessed January 5, 2021). http://netlib.org/lp/infeas/index.html .[29] Andrew Gilpin, Javier Peña, and Tuomas Sandholm. First-order algorithm with O (ln(1 /(cid:15) )) convergencefor (cid:15) -equilibrium in two-person zero-sum games. Mathematical Programming , 133(1):279–298, Jun 2012.[30] Roland Glowinski and A Marroco. Sur l’approximation, par éléments ﬁnis d’ordre un, et la résolution,par pénalisation-dualité d’une classe de problèmes de dirichlet non linéaires.

ESAIM: MathematicalModelling and Numerical Analysis-Modélisation Mathématique et Analyse Numérique , 9(R2):41–76, 1975.[31] Bingsheng He, Yanfei You, and Xiaoming Yuan. On the convergence of primal-dual hybrid gradientalgorithm.

SIAM Journal on Imaging Sciences , 7(4):2526–2537, 2014.[32] Jourdain Lamperski, Robert M. Freund, and Michael J. Todd. An oblivious ellipsoid algorithm forsolving a system of (in)feasible linear inequalities, 2020.[33] Guanghui Lan, Zhaosong Lu, and Renato D. C. Monteiro. Primal-dual ﬁrst-order methods with O (1 /(cid:15) ) iteration-complexity for cone programming. Mathematical Programming , 126(1):1–29, Jan 2011.[34] A. S. Lewis. Active sets, nonsmoothness, and sensitivity.

SIAM J. Optim. , 13(3):702–725 (electronic)(2003), 2002.[35] Adrian S Lewis and Jingwei Liang. Partial smoothness and constant rank. arXiv preprintarXiv:1807.03134 , 2018.[36] Jingwei Liang, Jalal Fadili, and Gabriel Peyré. Local convergence properties of Douglas–Rachfordand alternating direction method of multipliers.

Journal of Optimization Theory and Applications ,172(3):874–913, 2017.[37] Jingwei Liang, Jalal Fadili, and Gabriel Peyré. Local linear convergence analysis of primal–dual splittingmethods.

Optimization , 67(6):821–853, 2018.[38] Pierre-Louis Lions and Bertrand Mercier. Splitting algorithms for the sum of two nonlinear operators.

SIAM Journal on Numerical Analysis , 16(6):964–979, 1979.[39] Yanli Liu, Ernest K Ryu, and Wotao Yin. A new use of Douglas–Rachford splitting for identifyinginfeasible, unbounded, and pathological conic programs.

Mathematical Programming , 177(1-2):225–253,2019.[40] Yura Malitsky. The primal-dual hybrid gradient method reduces to a primal method for linearlyconstrained optimization problems, 2019.[41] Yura Malitsky and Thomas Pock. A ﬁrst-order primal-dual algorithm with linesearch.

SIAM Journal onOptimization , 28(1):411–432, 2018.[42] István Maros.

Computational Techniques of the Simplex Method . Kluwer Academic Publishers, 2003.[43] Richard Kipp Martin.

Large Scale Linear and Integer Optimization: A Uniﬁed Approach . Springer US,Boston, MA, 1999.[44] R. Miﬄin and C. Sagastizábal. Primal-dual gradient structured functions: second-order results; links toepi-derivatives and partly smooth functions.

SIAM J. Optim. , 13(4):1174–1194 (electronic), 2003.[45] R. Miﬄin and C. Sagastizábal. VU -smoothness and proximal point results for some nonconvex functions. Optim. Methods Softw. , 19(5):463–478, 2004.[46] R. Miﬄin and C. Sagastizábal. A VU -algorithm for convex minimization. Math. Program. , 104(2-3, Ser.B):583–608, 2005. 2947] Cesare Molinari, Jingwei Liang, and Jalal Fadili. Convergence rates of forward–douglas–rachford splittingmethod.

Journal of Optimization Theory and Applications , 182(2):606–639, 2019.[48] Walaa M Moursi.

The Douglas–Rachford operator in the possibly inconsistent case: static properties anddynamic behaviour . PhD thesis, University of British Columbia, 2016.[49] Arkadi Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities withlipschitz continuous monotone operators and smooth convex-concave saddle point problems.

SIAMJournal on Optimization , 15(1):229–251, 2004.[50] Yu. Nesterov. Subgradient methods for huge-scale optimization problems.

Mathematical Programming ,146(1):275–297, Aug 2014.[51] Daniel O’Connor and Lieven Vandenberghe. On the equivalence of the primal-dual hybrid gradientmethod and Douglas–Rachford splitting.

Mathematical Programming , 179(1):85–108, Jan 2020.[52] Brendan O’Donoghue. Operator splitting for a homogeneous embedding of the monotone linear comple-mentarity problem. arXiv preprint arXiv:2004.02177 , 2020.[53] Brendan O’Donoghue, Eric Chu, Neal Parikh, and Stephen Boyd. Conic optimization via operatorsplitting and homogeneous self-dual embedding.

Journal of Optimization Theory and Applications ,169(3):1042–1068, 2016.[54] A Pazy. Asymptotic behavior of contractions in hilbert space.

Israel Journal of Mathematics , 9(2):235–240,1971.[55] T. Pock and A. Chambolle. Diagonal preconditioning for ﬁrst order primal-dual algorithms in convexoptimization. In , pages 1762–1769, 2011.[56] Michael JD Powell. Algorithms for nonlinear constraints that use lagrangian functions.

Mathematicalprogramming , 14(1):224–248, 1978.[57] Arvind U Raghunathan and Stefano Di Cairano. Infeasibility detection in alternating direction methodof multipliers for convex quadratic programs. In , pages5819–5824. IEEE, 2014.[58] James Renegar. Accelerated ﬁrst-order methods for hyperbolic programming.

Mathematical Programming ,173(1):1–35, Jan 2019.[59] R. T. Rockafellar.

Convex analysis . Princeton Mathematical Series, No. 28. Princeton University Press,Princeton, N.J., 1970.[60] Shu-Chung Shi. Semi-continuités génériques de multi-applications.

C.R. Acad. Sci. Paris (Série A) ,293:27–29, 1981.[61] B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd. OSQP: an operator splitting solver forquadratic programs.

Mathematical Programming Computation , 12(4):637–672, 2020.[62] M.J. Todd. Detecting infeasibility. In Felipe Cucker, Ron DeVore, Peter Olver, and Endre Süli, editors,

Foundations of Computational Mathematics, Minneapolis 2002 , London Mathematical Society LectureNote Series, page 157–192. Cambridge University Press, 2004.[63] Yuhsiang M. Tsai, Terry Cojean, and Hartwig Anzt. Sparse linear algebra on AMD and NVIDIAGPUs – the race is on. In Ponnuswamy Sadayappan, Bradford L. Chamberlain, Guido Juckeland, andHatem Ltaief, editors,

High Performance Computing , pages 309–327, Cham, 2020. Springer InternationalPublishing.[64] S.J. Wright. Identiﬁable surfaces in constrained optimization.

SIAM J. Control Optim. , 31:1063–1079,July 1993. 30

Additional proofs

A.1 Proof of Proposition 1

Proof.

Assume that ( z k +1 − z k ) → v . Fix ε > . Our goal is to show that for all k large enough (cid:107) z k /k − v (cid:107) ≤ ε .Due to convergence, there exist K ∈ N such that for all k ≥ K we have (cid:107) z k +1 − z k − v (cid:107) ≤ ε/ . Deﬁne B := max k ≤ K (cid:107) z k +1 − z k − v (cid:107) , let K ∈ N be such that for all k ≥ K we get K B/k ≤ ε/ , and let K ∈ N be such that if k ≥ K then (cid:107) z (cid:107) /k ≤ ε/ . Then, for any k ≥ max { K , K , K } we have (cid:13)(cid:13)(cid:13)(cid:13) z k k − v (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13) k ( z k − z ) − v (cid:13)(cid:13)(cid:13)(cid:13) + 1 k (cid:107) z (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k k (cid:88) j =1 ( z j − z j − ) − v (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + 1 k (cid:107) z (cid:107)≤ k k (cid:88) j =1 (cid:13)(cid:13) ( z j − z j − ) − v (cid:13)(cid:13) + 1 k (cid:107) z (cid:107)≤ K k B + 1 k k (cid:88) j = K (cid:13)(cid:13) ( z j − z j − ) − v (cid:13)(cid:13) + 1 k (cid:107) z (cid:107)≤ ε + 13 k k (cid:88) j = K ε ≤ ε . This proves the ﬁrst statement.Now, assume that z k k → v , and ﬁx ε > . Just as before deﬁne K ∈ N to be such that for all (cid:107) z k /k − v (cid:107) ≤ ε/ , deﬁne the constant B = max k ≤ K (cid:107) z k /k − v (cid:107) , and let K be such that B ( K + 1) K / (( K + 1) K ) ≤ ε/ .Then, we have that for any k ≥ max { K , K } , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k + 1) k k (cid:88) j =1 z j − v (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 2( k + 1) k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) j =1 z j − k ( k + 1)2 v (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 2( k + 1) k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) j =1 ( z j − jv ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ k + 1) k k (cid:88) j =1 j (cid:13)(cid:13)(cid:13)(cid:13) z j j − v (cid:13)(cid:13)(cid:13)(cid:13) ≤ ( K + 1) K ( k + 1) k B + 2( k + 1) k (cid:88) j = K j (cid:13)(cid:13)(cid:13)(cid:13) z j j − v (cid:13)(cid:13)(cid:13)(cid:13) ≤ ε ε  k + 1) k k (cid:88) j = K j  ≤ ε . Counterexamples

Example 2 ( Diﬀerences don’t converge, but normalized iterates do ) . Consider the sequence ( z k ) ⊆ R generated by applying a ◦ counterclockwise rotation repeatedly starting from z = e = (1 , . Thus, thesequence cycles as z = (0 , , z = ( − , , z = (0 , − , z = (1 , , . . . . For this example, the diﬀerences of iterates z k +1 − z k also cycle among four possibilities and do not converge.Nonetheless, since the iterates are bounded k z k → . Example 3 ( Normalized iterates diverge, but normalized averages converge ) . Consider the se-quence ( z k ) ⊆ R given by z k = ( − k k with k ∈ N . Then, it is clear that | z k | /k > √ k , and so thenormalized iterates diverge. On the other hand, notice that k + 1) ¯ z k = k (cid:88) j =1 ( − j k k + 1 , and it is easy to show that this series converges using the Leibniz Test. Example 4 ( Nonexpansive operator with divergent z (cid:15) ) . Let T : R → R given by T ( z ) = z + f ( z ) where f ( z ) = (cid:40) exp( − z ) + 1 if z > , or otherwise. (53) Since the derivative of T is bounded by , we get that T is a nonexpansive operator. Furthermore, range( T − I ) =range( f ) = (1 , , and so v = 1 . If we deﬁne z (cid:15) to be a point such that | v − ( T − I )( z (cid:15) ) | ≤ (cid:15) , we see that z (cid:15) > Ω (cid:16) log (cid:0) (cid:15) (cid:1) (cid:17) , and thus it diverges as (cid:15) → ..