Finding best approximation pairs for two intersections of closed convex sets
FF I N D I N G B E S T A P P R O X I M AT I O N PA I R S F O R T W OI N T E R S E C T I O N S O F C L O S E D C O N V E X S E T S
Heinz H. Bauschke * , Shambhavi Singh † , and Xianfu Wang ‡ February 25, 2021
Abstract
The problem of finding a best approximation pair of two sets, which in turn gen-eralizes the well known convex feasibility problem, has a long history that dates backto work by Cheney and Goldstein in 1959.In 2018, Aharoni, Censor, and Jiang revisited this problem and proposed an algo-rithm that can be used when the two sets are finite intersections of halfspaces. Mo-tivated by their work, we present alternative algorithms that utilize projection andproximity operators. Numerical experiments indicate that these methods are compet-itive and sometimes superior to the one proposed by Aharoni et al.
Primary 65K05; Secondary 47H09, 90C25.
Keywords:
Aharoni–Censor–Jiang algorithm, best approximation pair, Douglas–Rachford algorithm, dual-based proximal method, proximal distance algorithm, stochas-tic subgradient descent. * Mathematics, University of British Columbia, Kelowna, B.C. V1V 1V7, Canada. E-mail: [email protected] . † Mathematics, University of British Columbia, Kelowna, B.C. V1V 1V7, Canada. E-mail: [email protected] . ‡ Mathematics, University of British Columbia, Kelowna, B.C. V1V 1V7, Canada. E-mail: [email protected] . a r X i v : . [ m a t h . O C ] F e b Introduction
Throughout this paper, we assume that Y is a finite-dimensional real Hilbert space with inner product (cid:104)· , ·(cid:105) : Y × Y → R ,and induced norm (cid:107) · (cid:107) . Let m ∈ {
1, 2, . . . } , set I : = {
1, . . . , m } and suppose that ( ∀ i ∈ I ) A i and B i are nonempty closed convex subsets of Y such that A : = (cid:92) i ∈ I A i (cid:54) = ∅ and B : = (cid:92) i ∈ I B i (cid:54) = ∅ .(We assume here without loss of generality that there are as many sets A i as B j ; otherwise,we can either “copy” sets or use the full space Y itself.) It will occasionally be convenientto work with the convention A m + = A , A m + = A , etc.; or, more formally, A n = A + rem ( n − m ) and B n = B + rem ( n − m ) . We also assume that the projection operators P A i and P B i are “easy” to compute while the projections P A and P B are “hard” and not readilyavailable (unless m = best approximationpair , i.e., to Find ( ¯ a , ¯ b ) ∈ A × B such that (cid:107) ¯ a − ¯ b (cid:107) = inf ( a , b ) ∈ A × B (cid:107) a − b (cid:107) . (1)(Note that this problem is actually a generalization of the famous convex feasibility problem which asks to find a point in A ∩ B provided that this intersection is nonempty which wedo not assume here!) This problem has a long history, and the first systematic study wasgiven by Cheney and Goldstein in 1959 [8]; see also [3] and [4]. These works, however,assume that m =
1. Recently, Aharoni, Censor, and Jiang (see [2]) tackled the general case.Indeed, assuming that the sets A i and B i are halfspaces , they presented a new algorithm —which we call ACJ for simplicity — for solving (1) where they do not require knowledgeof the projectors P A and P B onto the corresponding polyhedra A and B . The purpose of this paper is to provide other approaches to solving (1) . We also provide therequired proximity operators as well as numerical comparisons. The algorithms considered willrely only on the operators P A i and P B i and some other operators that are available in closed form. The algorithms presented will work for general closed convex sets, not just polyhedra.Implementable formulae for the underlying algorithmic operators are provided. Numer-ical experiments, similar to one in Aharoni et al.’s paper, are also performed. Our resultsshow that other algorithms should be seriously considered for solving (1), especially if m is small.The remainder of the paper is organized as follows. In Section 2, we consider refor-mulations of (1) that are more amenable to the algorithms discussed in Section 3. These2lgorithms rely on computable formulae which we present in Section 4. We present asmall example on which these algorithms are run and convergence is observed using dif-ferent metrics in Section 5. In Section 5.1 and Section 5.2, we consider examples wherethe solutions are known or unknown, respectively. We also discuss the (positive) effect ofpairing up constraints (see Section 5.3). We conclude the paper in Section 6 with a briefsummary of our findings.The notation employed in this paper is fairly standard and follows largely [5] to whichwe also refer the reader for general background material. In the product Hilbert space X : = Y × Y ,we define nonempty closed convex sets by ( ∀ i ∈ I ) C i : = A i × B i ,along with their intersection ( ∀ i ∈ I ) C : = (cid:92) i ∈ I C i = A × B .Note that the projector onto C i is still easy to compute; indeed, P C i : ( x , y ) (cid:55)→ ( P A i x , P B i y ) .The problem (1) is thus equivalent tominimize h ( x , y ) subject to ( x , y ) ∈ C , (2)where h ( x , y ) = α (cid:107) x − y (cid:107) p , (3) α >
0, and p ≥
1. Note that h has full domain and it is convex — but h is not strictlyconvex because it has many minimizers: (cid:8) ( x , x ) (cid:12)(cid:12) x ∈ Y (cid:9) . Also note that if p >
1, then h is differentiable.The problem (1) can thus also be alternatively thought of asminimize h ( x , y ) + ∑ i ∈ I ι C i ( x , y ) , (4)3hich features a nonsmooth objective function. In the next section, we survey variousalgorithms that could be used to solve the problem (1) or its reformulations. We alsoconsider the case when (4) is approximated byminimize h ( x , y ) + ∑ i ∈ I Ld C i ( x , y ) , (5)for some “large” constant L . This algorithm was recently proposed by Aharoni, Censor, and Jiang in [2]. We denotetheir algorithm as
ACJ . ACJ builds on “
HLWB ”, which in turn is an earlier algorithm,adapted to the task of finding the projection of a point onto a polyhedron (see [1]). Thisalgorithm is an alternating version of
HLWB to find a solution of (1). Here is a description.First, we fix a sequence ( λ k ) k ∈ N of positive real numbers such that λ k → ∑ k ∈ N λ k = ∞ , ∑ k ∈ N | λ k − λ k + m | < ∞ (6)and also an increasing (not necessarily strictly though) sequence of natural numbers ( n k ) k ∈ N such that n k → ∞ and sup k ∈ N ∑ k > k n k ∏ n > n k ( − λ n ) < ∞ . (7)For instance, (6)–(7) hold when ( ∀ k ∈ N ) λ k = k + and n k = (cid:98) k (cid:99) (see [2, page 512]).Next, given our sequence of sets ( A i ) i ∈ N and n ∈ N , we define the operator Q A , n : Y × Y → Y : ( w , w (cid:48) ) (cid:55)→ w n , (8)where w = w (cid:48) and w n is computed iteratively via ( ∀ i ∈ {
0, 1, . . . , n − } ) w i + = λ i + w + ( − λ i + ) P A i + ( w i ) . (9)The operator Q B , n , for ( B i ) i ∈ N and n ∈ N , is defined analogously. Finally, we initialize ( x , y ) ∈ X × X , and iteratively update via ( ∀ k ∈ N ) ( x k + , y k + ) : = (cid:40)(cid:0) Q B , n k ( y k , y (cid:48) k ) , y k (cid:1) , if k is odd; (cid:0) x k , Q A , n k ( x k , x (cid:48) k ) (cid:1) , if k is even, (10)4nd where ( x (cid:48) k , y (cid:48) k ) k ∈ N in X × X is a bounded sequence that can either be fixed before-hand, e.g., ( x (cid:48) k , y (cid:48) k ) = ( x , y ) , or dynamically updated using, e.g., ( x (cid:48) , y (cid:48) ) = ( y , x ) and ( x (cid:48) k , y (cid:48) k ) = ( y k − , x k − ) for k ∈ {
1, 2, . . . } .The main result of [2] yields the convergence of ( x k , y k ) k ∈ N to a solution of (1) providedthat C (cid:54) = ∅ and each A i and B i is a halfspace . Remark 3.1.
Note that
ACJ takes into account the order of the sets while the problem (1)does not. It is a nice feature of
ACJ that it works throughout in the “small” space X × X . This algorithm, abbreviated as DR , is motivated by the form (4). It implicitly operatesin the space X m + . First, set f ( x , y ) : = α (cid:107) x − y (cid:107) p with α > p ≥
1, as well as f : = ι C , . . . , f m : = ι C m and I : = { } ∪ I . Second, fix a parameter 0 < λ < λ = z : = ( z , z , . . . , z m ) = (cid:0) ( x , y ) , ( x , y ) , . . . , ( x m , y m ) (cid:1) ∈ X m + . Given z k = ( z k ,0 , z k ,1 , . . . , z k , m ) ∈ X m + , set¯ z k : = m + ∑ i ∈ I z k , i (11a) ( ∀ i ∈ I ) x k , i : = Prox f i ( z k − z k , i ) (11b) ( ∀ i ∈ I ) z k + i : = z k , i + λ ( x k , i − ¯ z k ) (11c)to obtain the update z k + : = ( z k + , z k + , . . . , z k + m ) .If i ∈ I , then the prox operators corresponding to f i is simply the projector P C i . Theprox operator Prox f will be computed in closed form for p ∈ {
1, 2 } in Section 4.1 below.It is well known that the sequence ( ¯ z k ) k ∈ N will converge to a solution of (4), i.e., of (1). Remark 3.2.
The DR approach does not care about the order of the sets presented —unlike, ACJ ! A downside is that it operates in the larger space X m + which can becomean issue if m is large. On the positive side, if C ∩ · · · ∩ C m = ∅ , then ( ¯ z k ) k ∈ N will convergeto a minimizer of f over the set of least-squares solutions (see [6, Corollary 6.8] for furtherinformation). We follow Beck’s [7, Section 12.4.2] but slightly modify the algorithms presented to givetwo additional methods for solving (1). We will work with the form given in (4) where5 ( x , y ) needs to be ε -strongly convex, for some ε >
0, which precludes using α (cid:107) x − y (cid:107) p directly. However, below we will add ε ( (cid:107) x (cid:107) + (cid:107) y (cid:107) ) to this last function to obtain therequired ε -strong convexity.The first method considered is the Dual Proximal Gradient method, which we abbrevi-ated as DPG . Because the algorithm requires strong convexity of the objective function,we consider, as announced, f ( x , y ) : = α (cid:107) x − y (cid:107) + ε ( (cid:107) x (cid:107) + (cid:107) y (cid:107) ) with α > ε >
0. We also set f : = ι C , . . . , f m : = ι C m . Second, fix a parameter L ≥ m / ε .Now initialize z : = ( z , . . . , z m ) = (cid:0) ( x , y ) , . . . , ( x m , y m ) (cid:1) ∈ X m , and update itusing s k : = ∑ i ∈ I z k , i (12a) x k : = argmax w ∈ X (cid:2) (cid:104) w , s k (cid:105) − f ( w ) (cid:3) (12b) ( ∀ i ∈ I ) z k + i : = z k , i − L x k + L P f i ( x k − Lz k , i ) (12c)to obtain z k + : = ( z k + , . . . , z k + m ) . Once again, the prox operators corresponding to f i for i ∈ I are just the projectors P C i . The closed form for the argmax operator in (12b)will be calculated in Section 4.2 below. For sufficiently small ε >
0, the primal sequence ( x k ) k ∈ N approximates a solution of (4) and hence of (1) provided that the relative interiorof C is nonempty; see [7, page 362].An accelerated version of DPG , known as Fast Dual Proximal Gradient or simply
FDPG , applies a FISTA-type acceleration. Here is how
FDPG proceeds: Starting with z as before, initialize w : = z , t : =
1, and update via s (cid:48) k : = ∑ i ∈ I w k , i (13a) u k : = argmax v ∈ X (cid:2) (cid:10) v , s (cid:48) k (cid:11) − f ( v ) (cid:3) (13b) ( ∀ i ∈ I ) z k + i : = z k , i − L u k + L P f i ( u k − Lw k , i ) (13c) t k + : = + (cid:113) + t k ( ∀ i ∈ I ) w k + i : = z k , i + + (cid:18) t k − t k + (cid:19) ( z k , i + − z k , i ) (13e)to get the primal sequence of interest x k + : = argmax v ∈ X (cid:2) (cid:104) v , s k + (cid:105) − f ( v ) (cid:3) , where s k + : = ∑ i ∈ I z k + i . (14)6gain, for sufficiently small ε >
0, the sequence ( x k ) k ∈ N approximates a solution close tothat of (4), and as a consequence of (1), provided that ri C (cid:54) = ∅ . Remark 3.3.
Note that although a smaller ε ensures a solution that is closer to that of theoriginal problem (1), it also increases the lower bound for L , which in turn reduces thestep size for each iteration as seen in (12c) and (13c), and so the speed of convergencereduces as well. These algorithms are not affected by the order of the sets presented, butlike DR , they operate in a “large” space (here X m ). This may become a problem when m is large. The Proximal Distance Algorithm, or
PDA for short, was first introduced by Lange andKeys [10]. It will be applied to the form given by (4). We set h ( x , y ) : = α (cid:107) x − y (cid:107) , whichis √ α -Lipschitz by Proposition 4.2, for α >
0. Also, write z = ( x , y ) . The PDA withstarting point z ∈ X generates a sequence ( z k ) k ∈ N via z k + : = Prox ρ − k h (cid:16) m ∑ i = m P C i z k (cid:17) , (15)where ( ρ k ) k ∈ N is a sequence of positive (and “sufficiently large”) parameters. Under suit-able choices of the parameter sequences, the sequences ( z k ) k ∈ N approximates a solutionof (4). Lange and Keys recommend ρ k = min { ( ) k ρ , ρ max } but other choices may yieldbetter performance (see [10, Sections 4 and 5] and [9] for details). Keys, Zhou, and Langealso point out a Nesterov-style accelerated version of PDA ( accPDA for short), whichproceeds as follows: w k : = z k + k − k + ( z k − z k − ) , (16a) z k + : = Prox ρ − k h (cid:16) m ∑ i = m P C i w k (cid:17) . (16b)See [9, Algorithm 1 and Section 3] for further information. Set f ( x , y ) : = α (cid:107) x − y (cid:107) and ( ∀ i ∈ I ) f i : = Ld C i , where L >
0. Then f is √ α -Lipschitzand the other f i are L -Lipschitz. Moreover, for i ∈ I , we have L sign ( z − P C i z ) ∈ ∂ f i ( z ) (17)7y, e.g., [5, Example 16.62]. Now Stochastic Subgradient Descent, which we abbreviate as SSD (see [7, Section 8.3] for further information), applied to (5) generates a sequence via z k + : = z k − η k f (cid:48) i k ( z k ) , (18)where ( η k ) k ∈ N is a sequence of positive parameters (typically constant or a constantdivided by √ k +
1) and where f (cid:48) i k ( z k ) ∈ ∂ f i k ( z k ) where i k is uniformly sampled from I : = { } ∪ I . Note that we have convenient formulae for selections of all ∂ f i — see (17)and (29) below. In this section, we collect formulae for operators that are used later in our numericalexperiments.
Denote the standard unit ball by B : B : = (cid:8) y ∈ Y (cid:12)(cid:12) (cid:107) y (cid:107) ≤ (cid:9) . It will be convenient todefine the generalized signum functions on Y viasign ( x ) : = (cid:40) x / (cid:107) x (cid:107) , if x (cid:54) = x = ( x ) : = (cid:40) { x / (cid:107) x (cid:107)} , if x (cid:54) = B , if x =
0. (19)By [5, Example 16.32], we have ( ∀ x ∈ Y ) sign ( x ) ∈ Sign ( x ) = ∂ (cid:107) · (cid:107) ( x ) . (20) Proposition 4.1.
Let α >
0. Then the prox operator of the function h : Y × Y → R : ( x , y ) (cid:55)→ α (cid:107) x − y (cid:107) . (21)is given by Prox h : ( x , y ) (cid:55)→ α + (cid:0) ( + α ) x + α y , α x + ( + α ) y (cid:1) . (22)Moreover, ∇ h : ( x , y ) (cid:55)→ (cid:0) α ( x − y ) , α ( y − x ) (cid:1) is ( + α ) -Lipschitz continuous. Proof . Set z = ( x , y ) ∈ Y × Y and B : Y × Y → Y : ( x , y ) (cid:55)→ √ α ( x − y ) . Then h ( z ) = (cid:107) Bz (cid:107) = (cid:104) Bz , Bz (cid:105) = (cid:104) B ∗ Bz , z (cid:105) (23)8nd thus ∇ h = B ∗ B . It follows thatProx h = ( Id + B ∗ B ) − . (24)Write B in block matrix form, B = √ α (cid:2) Id, − Id (cid:3) . Then B ∗ = √ α (cid:2) Id, − Id (cid:3) (cid:124) , B ∗ B = α (cid:20) Id − Id − Id Id (cid:21) , (25)and Id + B ∗ B = (cid:20) ( + α ) Id − α Id − α Id ( + α ) Id (cid:21) . (26)The largest eigenvalue of the very last matrix is 1 + α which implies that 1 + α is thesharp Lipschitz contant of ∇ h . Finally,12 α + (cid:0) Id + B ∗ B (cid:1) − = α + (cid:20) ( + α ) Id − α Id − α Id ( + α ) Id (cid:21) (27)and the result follows. (cid:4) Proposition 4.2.
Let α >
0. The function h : Y × Y → R : ( x , y ) (cid:55)→ α (cid:107) x − y (cid:107) , (28)is α √ ∂ h is given by ( x , y ) (cid:55)→ α (cid:0) sign ( x − y ) , − sign ( x − y ) (cid:1) (29a) = (cid:40)(cid:0) α ( x − y ) / (cid:107) x − y (cid:107) , α ( y − x ) / (cid:107) y − x (cid:107) (cid:1) , if x (cid:54) = y ; (cid:0)
0, 0 (cid:1) , if x = y (29b) ∈ (cid:8) α ( s , − s ) (cid:12)(cid:12) s ∈ Sign ( x − y ) (cid:9) (29c) = ∂ h ( x , y ) . (29d)The prox operator of h is given byProx h : ( x , y ) (cid:55)→ ( x , y ) − (cid:8) (cid:107) x − y (cid:107) / α (cid:9) (cid:0) x − y , y − x (cid:1) . (30) Proof . Set A = (cid:2) Id, − Id (cid:3) , z = ( x , y ) , and f ( z ) = (cid:107) Az (cid:107) . Then Az = x − y and h = α f .Furthermore, (cid:107) Az (cid:107) = (cid:107) x − y (cid:107) = (cid:107) x (cid:107) + (cid:107) y (cid:107) − (cid:104) x , y (cid:105) ≤ (cid:107) x (cid:107) + (cid:107) y (cid:107) = ( (cid:107) x (cid:107) + (cid:107) y (cid:107) ) = (cid:107) z (cid:107) which shows that f is √ h is α √ β ≥ A (cid:124) = (cid:20) Id − Id (cid:21) , AA (cid:124) = ( AA (cid:124) + β Id ) − = + β Id, (31)9nd Id − A (cid:124) ( AA (cid:124) + β Id ) − A = (cid:20) Id 00 Id (cid:21) − (cid:20) Id − Id (cid:21) + β (cid:2) Id, − Id (cid:3) (32a) = (cid:20) Id 00 Id (cid:21) − + β (cid:20) Id − Id − Id Id (cid:21) (32b) = + β (cid:20) + β
11 1 + β (cid:21) . (32c)We now discuss cases. Case 1: (cid:107) x − y (cid:107) ≤ α .In view of (31) (with β = (cid:107) ( AA (cid:124) ) − Az (cid:107) ≤ α . Using [7,Lemma 6.68] and (32) (with β = h ( x , y ) = ( x + y , x + y ) = ( x , y ) − ( x − y , y − x ) . (33) Case 2: (cid:107) x − y (cid:107) > α .In view of (31) (with β = (cid:107) ( AA (cid:124) ) − Az (cid:107) > α . Using again (31)(with general β ≥ g ( β ) = (cid:107) ( AA (cid:124) + β Id ) − Az (cid:107) − α (34a) = ( + β ) (cid:107) x − y (cid:107) − α . (34b)Because α > β ≥
0, we obtain the equivalences: g ( β ) = ⇔ (cid:107) x − y (cid:107) = α ( + β ) ⇔ β = (cid:107) x − y (cid:107) / α −
2. Set β ∗ : = (cid:107) x − y (cid:107) α − > g ( β ∗ ) =
0. Moreover, 2 + β ∗ = (cid:107) x − y (cid:107) / α and 1 + β ∗ = ( (cid:107) x − y (cid:107) − α ) / α . Hence(32) (with β = β ∗ ) turns intoId − A (cid:124) ( AA (cid:124) + β ∗ Id ) − A = + β ∗ (cid:20) + β ∗
11 1 + β ∗ (cid:21) (36a) = α (cid:107) x − y (cid:107) (cid:34) (cid:107) x − y (cid:107)− αα (cid:107) x − y (cid:107)− αα (cid:35) (36b) = (cid:20) (cid:21) − α (cid:107) x − y (cid:107) (cid:20) − − (cid:21) . (36c)10sing [7, Lemma 6.68] and (36), we obtain (switching back to row vector notation forconvenience) Prox h ( x , y ) = ( x , y ) − α (cid:107) x − y (cid:107) ( x − y , y − x ) . (37)Finally, the formula given in (30) follows by combining (33) and (37). (cid:4) Consider, for α > ε > f ( x , y ) : = α (cid:107) x − y (cid:107) + ε (cid:0) (cid:107) x (cid:107) + (cid:107) y (cid:107) ) , (38)which is a perturbation of α (cid:107) x − y (cid:107) that is ε -strongly convex.Given ( u , v ) ∈ X , the dual-based proximal methods of Section 3.3 require from us tofind the unique maximizer of ( x , y ) (cid:55)→ (cid:104) ( u , v ) , ( x , y ) (cid:105) − f ( x , y ) (39a) = (cid:104) u , x (cid:105) + (cid:104) v , y (cid:105) − α (cid:107) x − y (cid:107) − ε (cid:0) (cid:107) x (cid:107) + (cid:107) y (cid:107) ) . (39b)We now derive an explicit formula for this maximizer: Proposition 4.3.
Given α > ε >
0, and ( u , v ) ∈ X , the unique maximizer of (39) is ( x , y ) = ( α + ε ) ε (cid:0) ( α + ε ) u + α v , ( α + ε ) v + α u (cid:1) . (40) Proof . Because f is strongly convex, we employ standard convex calculus to find themaximizer by finding the zero of the gradient of the function in (39). That is, we need tosolve (
0, 0 ) = (cid:0) u − α x + α y − ε x , v − α y + α x − ε y (cid:1) ; (41)or equivalently (switching to more formal column vector notation), (cid:20) α + ε − α − α α + ε (cid:21) (cid:20) xy (cid:21) = (cid:20) uv (cid:21) . (42)Because (cid:20) α + ε − α − α α + ε (cid:21) − = ( α + ε ) ε (cid:20) α + ε αα α + ε (cid:21) (43)we obtain the announced formula. (cid:4) Numerical experiments
We start by discussing metrics — in the sense of “standard of measurement” not in thesense of topology — to evaluate the quality of the the iterates for the different algorithmsproposed in Section 3. To make the measurement uniform over the different algorithms,we use this metric once for each prox or proj evaluation. For the same purpose, we con-sider A i or B i as unit inputs in these operators. So, for example, as DR projects to all A i and B i along with a prox evaluation, we get a total of 2 m + m + The convergence plots for the algorithms are straightforward when the solution is known.Assuming the solution is unique and denoted by ( ¯ x , ¯ y ) , we use the metric ( x , y ) (cid:55)→ (cid:107) ( x , y ) − ( ¯ x , ¯ y ) (cid:107) (44)applied to the appropriate iterates of the algorithms.The distance to the solution (44) is evaluated once for each projection or prox opera-tor evaluation. For example, because each iteration of DR (see Section 3.2) uses 2 m + DR step invokes 2 m + Example 5.1.
Consider the subsets A and B of Y = R , defined by the two systems of m = −
171 0 41 1 110 1 5 x x ≤ and − − − − − − x x ≤ , (45)12igure 1: Visualization of the sets A (red) and B (blue) for Example 5.1 and the solutionpair (black dots) for (1).respectively. The corresponding problem (1) possesses the unique solution (cid:0) ( − − ) , (
4, 5 ) (cid:1) , (46)which is also visualized in Fig. 1.We ran the algorithms from Section 3, and also accelerated versions when available.The algorithms were run for a total 1800 Prox or Projection operations per algorithm.The accelerated versions performed clearly better than the original versions in this case.Therefore, for the clarity of the exposition, we do not report the DPG and
PDA results. Weused the following parameters and also report to how many iterations in each algorithmthis corresponded:
ACJ : 55 iterations, with λ k = ( k + ) , n k = (cid:98) k (cid:99) , and ( x (cid:48) k , y (cid:48) k ) = (cid:40) ( y , x ) , if k = ( y k − , x k − ) , otherwise (see Section 3.1).13igure 2: Convergence plot for Example 5.1 using the known-solution metric DR : 200 iterations, with p = α =
5, and λ = FDPG : 200 iterations, with p = α = L =
16, and ε = accPDA : 200 iterations, with α = ρ = ρ max = SSD : 1800 iterations, with α = L =
1, and η k = √ k + x = y = ( − ) . The distance of the iteratesfrom the solution, calculated using (44), results in the convergence plot shown in Fig. 2,where the grey-dotted lines marks intervals of 50 iterations of DR . From the plot, wesee that accPDA appears to perform better than DR for the first 50 iterations, but then DR slowly but steadily begins to produce the most accurate solution. We note that since FDPG is solving for a strong convex version of the objective function (cid:107) x − y (cid:107) , it convergesto a solution that is not the same as our original problem.Note that ACJ and
SSD do not perform as well as the other algorithms in Example 5.1;however, when m becomes larger, they become much more competitive as the followingexample illustrates: Example 5.2.
Let n ∈ {
1, 2, . . . } , and consider the subsets A and B of Y = R n , defined by14igure 3: Example 5.2 for which n = m = m = n linear inequalities x ... x n ≥ and x ... x n ≤ − − , (47)respectively. Clearly, the unique solution to the corresponding problem (1) is ( ¯ x , ¯ y ) ∈ Y × Y , where ¯ x = (
5, . . . , 5 ) and ¯ y = ( −
5, . . . , − ) . This time we run the algorithms foraround 10 prox evaluations with the starting point x = y = (
0, . . . , 0 ) . It turns out thatin this case, PDA outperforms accPDA , so we omit the results of the latter. We only usethe error metric once after every 50000 evaluations. We used the following parametersand also report to how many iterations in each algorithm this corresponded:
ACJ : 169 iterations, with λ k = ( k + ) , n k = (cid:98) k (cid:99) , and ( x (cid:48) k , y (cid:48) k ) = (cid:40) ( y , x ) , if k = ( y k − , x k − ) , otherwise (see Section 3.1).15 R : 50000 iterations, with p = α =
5, and λ = FDPG : 50000 iterations, with p = α = L = ε = PDA : 50000 iterations, with α = ρ = ρ max = (see Section 3.4). SSD : ≈ iterations, with α = L =
10, and η k = √ k + n = = m and the error metric given by (44), we obtain the plot shown in Fig. 3.Note that ACJ and
SSD operate in X = Y × Y = R while, for instance, DR operatesin the much bigger space X m + = R ! The plot makes it clear that in this situation ACJ and
SSD fare much better than in Example 5.1.We note that given enough iterations, DR again trumps the other algorithms, but theinitial descent is very slow. In fact,the other algorithms perform much better in the begin-ning than DR . This suggest an interesting topic for further research: one could considera hybrid approach, where one uses an algorithm such as SSD , ACJ , PDA or FDPG , andthen switches over to DR . Figure 4: Using (48) to measure performance of the algorithms16n general, one has no access to true solutions, so it becomes necessary to measure perfor-mance by a metric different from (44). We propose the measure D δ ( x , y ) : = (cid:107) x − y (cid:107) + δ ∑ i ∈ I d C i ( x , y ) , (48)where δ >
0. Because the problem asks to find a point in A × B , feasibility is of greaterimportance than minimizing (cid:107) x − y (cid:107) ; thus, a smaller value of δ is desirable to stress fea-sibility. (One could also envision a “dynamic” metric, where δ → δ = x and y , resembles the one seen inFig. 2. Figure 5: Comparing variants of DR on Example 5.1 with paired projections17n some cases, it is possible to combine constraints and still be able to compute the pro-jection onto the intersection. For instance, if all sets A i are halfspaces, then any two sets A i and A j may be combined and the projection onto the intersection is explicitly availableusing, e.g., [5, Proposition 2.22–2.24].Revising Example 5.1 in this light, we ran DR with the paired projection (PP), withchoosing mostly non-adjacent halfspaces (labelled as DR+PP in Fig. 5), and also withchoosing explicitly adjacent halfspaces (labelled as
DR+PP adjacent in Fig. 5) along withthe original version of DR (labelled as DR in Fig. 5) To compare this fairly, we countone “paired” projection (onto the intersection of two halfspaces) as being equivalent totwo regular projections. The convergence plot shown in Fig. 5 illustrates that addingthe paired projections significantly improves the performance significantly, even more sowhen the halfspaces are adjacent. The well known and characteristic “rippling” seen intypical DR curves is heavily damped in the last case.Figure 6: Comparing ACJ on Example 5.1 with paired projectionsUsing this technique on
ACJ , we note that the paired-projection variants do not im-prove the performance significantly; see Fig. 6. On the other hand, the approach of the18igure 7: Appearance of the iterations of
ACJ for the paired projectionsiterates to the true solution looks far less scattered as can be seen in Fig. 7.In higher dimensions, further investigations are needed to determine the “best” way topair up halfspaces as “adjacent halfspaces”.
We revisited the recent study by Aharoni, Censor, and Jiang on finding best approxima-tion pairs of two polyhedra. The framework we proposed works for two sets that arethemselves finite intersections of closed convex sets with “simple” projections. Severalalgorithms were proposed and the required prox operators were computed. Our numer-ical experiments suggested that other algorithms deserve serious consideration.
Acknowledgments
HHB and XW are supported by the Natural Sciences and Engineering Research Councilof Canada. 19 eferences [1] Y. Censor, Computational acceleration of projection algorithms for the linear bestapproximation problem,
Linear Algebra and its Applications
416 (2006), 111–123. https://doi.org/10.1016/j.laa.2005.10.006 [2] R. Aharoni, Y. Censor, Z. Jiang, Finding a best approximation pair of points for twopolyhedra,
Computational Optimization and Applications
71 (2018), 509–523. https://doi.org/10.1007/s10589-018-0021-3 [3] H.H. Bauschke and J.M. Borwein, On the convergence of von Neumann’s alternat-ing projection algorithm for two sets,
Set-Valued Analysis https://doi.org/10.1007/BF01027691 [4] H.H. Bauschke and J.M. Borwein, Dykstra’s alternating projection algorithm fortwo sets,
Journal of Approximation Theory
79 (1994), 418–443. https://doi.org/10.1006/jath.1994.1136 [5] H.H. Bauschke and P.L. Combettes,
Convex Analysis and Monotone Operator Theoryin Hilbert Spaces , second edition, Springer, 2017. https://doi.org/10.1007/978-3-319-48311-5 [6] H.H. Bauschke and W.M. Moursi, On the behavior of the Douglas-Rachford algo-rithm for minimizing a convex function subject to a linear constraint,
SIAM Journalon Optimization
30 (2020), 2559–2576, https://doi.org/10.1137/19M1281538 [7] A. Beck,
First-Order Methods in Optimization , SIAM 2017. https://doi.org/10.1137/1.9781611974997 [8] W. Cheney and A.A. Goldstein, Proximity maps for convex sets,
Proceedings of theAMS
10 (1959), 448–450. https://doi.org/10.2307/2032864 [9] K.L. Keys, H. Zhou, and K. Lange, Proximal distance algorithms: theory and prac-tice,
Journal of Machine Learning Research
20 (2019), 1–38. https://jmlr.csail.mit.edu/papers/v20/17-687.html [10] K. Lange and K.L. Keys, The proximal distance algorithm, in
Proceedings of the 2014International Congress of Mathematicians , pages 96–116, Seoul: Kyung Moon, 4. Seealso https://arxiv.org/abs/1507.07598https://arxiv.org/abs/1507.07598