Random Function Iterations for Consistent Stochastic Feasibility
RRandom Function Iterations for Consistent
Stochastic Feasibility
Neal Hermer ∗ , D. Russell Luke † and Anja Sturm ‡ September 19, 2018
Abstract
We study the convergence of iterated random functions for stochastic feasibility in theconsistent case (in the sense of Butnariu and Flåm [1]) in several different settings, underdecreasingly restrictive regularity assumptions of the fixed point mappings. The iterationsare Markov chains and, for the purposes of this study, convergence is understood in veryrestrictive terms. We show that sufficient conditions for geometric (linear) convergence inexpectation of stochastic projection algorithms presented in Nedić [2], are in fact neces-sary for geometric (linear) convergence in expectation more generally of iterated randomfunctions.
Primary 60J05, 52A22, 49J55 Secondary 49J53, 65K05.
Keywords:
Averaged mappings, nonexpansive mappings, paracontractions, stochastic feasi-bility, stochastic fixed point problem, iterated random functions, metric subregularity, linearregularity, linear convergence in expectation, geometric convergence of measures
1. Introduction
We are inspired by problems of the formFind x ∗ ∈ \ i ∈ I C i . where I is an index set and C i are closed subsets of a metric space. When I is a finite set, thenthis is the classical deterministic feasibility problem . Randomized algorithms for solving suchfinite intersections have been intensely studied in recent years for their application to distributedcomputational schemes and machine learning. Our study here concerns a generalization where I is an arbitrary (possibly uncountable) set. This was first considered by Butnariu and Flåm ∗ Institut für Numerische und Angewandte Mathematik, Universität Göttingen, 37083 Göttingen, Germany.NH was supported by Deutsche Forschungsgemeinschaft Research Training Grant 2088 TP-B5. E-mail: [email protected] † Institut für Numerische und Angewandte Mathematik, Universität Göttingen, 37083 Göttingen, Germany.DRL was supported in part by Deutsche Forschungsgemeinschaft Research Training Grant 2088 TP-B5. E-mail: [email protected] ‡ Institut für Mathematische Stochastik, Universität Göttingen, 37077 Göttingen, Germany. AS was sup-ported in part by Deutsche Forschungsgemeinschaft Research Training Grant 2088 TP-B5. E-mail: [email protected] a r X i v : . [ m a t h . O C ] S e p
1, 3, 4], where this is called the stochastic feasibility problem . There are many reasons whyone might consider such infinite feasibility problems. At the time Butnariu and Flåm’s workappeared, there was a great deal of interest in solution methods for linear integral equations.Our primary motivation is to propose a new way of modeling and analyzing errors, eithernumerical or measurement, as they are manifest in numerical iterative procedures.The simplest algorithm one could imagine for solving such problems is to generate sequences( x k ) k ∈ N by the fixed point iteration x k +1 ∈ Y i ∈ I P C i ! x k where P C i is the metric projection onto the set C i and Q i ∈ I indicates the composition of theoperators over the index set. This is known as the cyclic projections algorithm and has beenextensively studied for the case when I is a finite set. But when I is infinite one immediatelyencounters the problem that such an algorithm never completes the first iteration!One application where infinite feasibility problems appear very naturally is integral equationsof the first kind in the separable Hilbert space L ([ a, b ]), as considered in [4]:( T x )( t ) = Z ba K ( t, s ) x ( s ) d s = g ( t ) t ∈ [ a, b ] , (1)with g ∈ L ([ a, b ]). The feasibility reformulation of this problem isFind x ∈ \ t ∈ [ a,b ] , a.s. C t := { ϕ ∈ L ([ a, b ]) | ( T ϕ ) ( t ) = g ( t ) } . (2)The almost surely (a.s) under the intersection will be clarified below. The basic idea, however,is to choose the parameter t above randomly and to choose the nearest point in the set C t tothe current guess. As such, the sequences that we generate are random processes. We returnto this application in Section 4.3.One of our main contributions is to place the previous results on stochastic projections it-erations in the broader context of random function iterations, that is, iterations of randomlyselected mappings with arbitrary initial distributions from which the initial points are chosen.Following [5] we characterize the sequence generated by the method of Random Function It-erations (Algorithm 1) as a Markov chain (Proposition 2.2). There are many different notionsof convergence of Markov chains. For consistent feasibility problems considered in this study(Standing Assumption 2) almost sure convergence applies, which perhaps is of limited interestin the broader context of generic Markov chains. When we move to inconsistent stochasticfeasibility problems, what we formulate more generally as stochastic fixed point problems in afollow-up study, the richness of the theory of Markov processes comes much more into play. Forthe consistent case we are able to establish convergence results in a number of new settings,namely compact metric spaces or R n with paracontracting mappings (see Section 3.1 and 3.2,resp.) and separable Hilbert spaces with averaged mappings (see Section 3.3). (It is the tech-nology of averaged mappings that opens the door to an analysis of algorithms for inconsistentstochastic feasibility problems and more generally stochastic fixed point problems.)Our framework allows us to address a variety of Markov chains and stochastic algorithms,though to fix these ideas our main example will be stochastic sequential projection algorithms.We also achieve with this analysis more refined convergence statements about the correspondingsequences of random variables, namely almost surely strong and, under certain assumptions,geometric convergence (called linear or exponential in other communities) of the measures The-orem 3.12. We then show necessary and sufficient conditions for geometric convergence in ex-pectation Theorem 3.15 in a stochastic analog to [6, Theorem 3.12 and Corollary 3.13]. These2onditions are identified as a manifestation of metric subregularity of a suitable merit functionat its zeros. When specialized to convex feasibility, this formulation of metric subregularity ofthe merit function is equivalent to the function possessing what is known as the KL propertyProposition 4.2. Finally, we identify a previously unrecognized necessary condition for geo-metric convergence of stochastic sequential projection algorithms, Theorem 4.3. Despite thestrong assumptions of the present study, a number of unexpected behaviors can occur; theseare demonstrated in concrete examples.
2. Stochastic Fixed Point Theory
Consider a collection of continuous mappings T i : G → G , i ∈ I , on a metric space ( G, d ),where I is an arbitrary index set. Assume that ( I, I ) is a measurable space. Let (Ω , F , P ) be aprobability space. Let ξ be an I -valued random variable, i.e. ξ : Ω → I measurable ( ξ − A ∈ F for all A ∈ I ).Let ( ξ n ) n ∈ N be an iid sequence with ξ n d = ξ . Let µ be a probability measure on ( G, B ( G )).The stochastic selection method is given by Algorithm 1
Random Function Iteration (RFI)
Initialization: X ∼ µ for k = 0 , , , . . . do X k +1 = T ξ k X k return { X k } k ∈ N Here we mean by X ∼ µ , that the law (i.e. distribution) of X satisfies L ( X ) := P X := P ( X ∈ · ) := P ◦ X − = µ . The following assumptions will be employed throughout. Standing Assumption 1 a) X , ξ , ξ , . . . , ξ k are independent for every k ∈ N , where ( ξ k ) k ∈ N are i.i.d.b) The function Φ : G × I → G , ( x, i ) T i x is measurable. The iterates of the RFI X k +1 = Φ( X k , ξ k ) = T ξ k X k , k ∈ N , can be considered as a timehomogeneous Markov chain with transition kernel( x ∈ G )( A ∈ B ( G )) p ( x, A ) := P (Φ( x, ξ ) ∈ A ) = P ( T ξ x ∈ A ) (3)where Φ is called an update function .To see that p is really a transition kernel, recall that, in general, a transition kernel p : G × B ( G ) → [0 ,
1] is measurable in the first argument, i.e. p ( · , A ) is measurable for all A ∈ B ( G )(follows from [7, Lemma 1.26]) and is a probability measure in the second argument, i.e. p ( x, · )is a probability measure for all x ∈ G (immediate by definition). Now recall the definition of aMarkov chain with general transition kernel p . Definition 2.1.
A sequence of random variables ( X k ) k ∈ N , X k : (Ω , F , P ) → ( G, B ( G )) iscalled Markov chain with transition kernel p if for all k ∈ N and A ∈ B ( G ) P -a.s. holds(i) P ( X k +1 ∈ A | X , X , . . . , X k ) = P ( X k +1 ∈ A | X k )3ii) P ( X k +1 ∈ A | X k ) = p ( X k , A ).Since ( G, B ( G )) is a Borel space and X k is a random variable in G , it makes sense to talk aboutthese conditional expectations (existence of regular versions of the conditional distribution by[7, Theorem 5.3]).From the setting above, the following fact follows easily from Theorem A.9. Proposition 2.2.
Under Standing Assumption 1, the sequence of random variables ( X k ) k ∈ N generated by Algorithm 1 is a Markov chain with transition kernel p given by (3) . Let µ ∈ P ( G ), where P ( G ) is the space of all probability measures on G . The Markovoperator P acting on a measure µ is defined via( A ∈ B ( G )) µ P ( A ) := Z G p ( x, A ) µ (d x ) . One defines the operation of the Markov operator acting on a measurable function f : G → R via ( x ∈ G ) P f ( x ) := Z G f ( y ) p ( x, d y ) . Note that P f ( x ) = Z G f ( y ) P Φ( x,ξ ) (d y ) = Z Ω f (Φ( x, ξ ( ω ))) P (d ω ) = Z I f (Φ( x, u )) P ξ (d u ) . There are several notions of convergence that one can study for the sequence ( X k ) on themetric space ( G, d ), or correspondingly of the law ( L ( X k )) on P ( G ), where L ( X k ) := P X k := P ( X k ∈ · ). Denote the set of bounded and continuous functions from G to R by C b ( G ). Let( ν n ) be a sequence of probability measures on G . The sequence ( ν n ) converges to ν in the weaksense , if ν ∈ P ( G ) and for all f ∈ C b ( G ) it holds that ν n f → νf as n → ∞ .A fixed point of the Markov operator P is called an invariant distribution , i.e. π ∈ P ( G ) isinvariant if and only if π P = π . A very general notion of convergence in the context of Markovchains concerns the convergence of the probability measures ν k := k P kj =1 L ( X j ) in the weaksense. An elementary fact from the theory of Markov chains (Theorem A.4) is that, if thisconverges, it converges to an invariant probability measure π : For f ∈ C b ( G ), ν k f = E k k X j =1 f ( X j ) → πf, as k → ∞ . The notion of convergence we consider here is much stronger; we consider almost sure conver-gence of the sequence ( X k ) to a random variable X : X k → X a.s. , as k → ∞ . Clearly, almost sure convergence of the sequence implies the more general notion above. Thisis common in the studies of stochastic algorithms in optimization, though this does not requirethe full power of the theory of general Markov processes. For consistent feasibility (definedbelow) however, this is all that is needed for our present purposes. In a follow-up study we willneed to consider more general notions of convergence of this Markov chain.4 .2. Consistent Stochastic Feasibility Problem
The stochastic feasibility problem is to find a point x ∗ ∈ C := { x ∈ G | P ( x ∈ Fix T ξ ) = 1 } , (4)where the fixed point set of the operator T i is denoted asFix T i = { x ∈ G | x = T i x } . We assume throughout that not only is Fix T i non-empty for P ξ -almost all i ∈ I , but morerestrictively: Standing Assumption 2 (consistent feasibility problem).
The set C is nonempty.Note that due to continuity of T i it follows, that Fix T i is a closed set. This specializes im-mediately to the stochastic feasibility problem formulated by Butnariu and Flåm [1] whereFix T ξ = C ξ . In order to make sense of the specialization to stochastic set feasibility, we needthe event { x ∈ Fix T ξ } to be an element of F for any x ∈ G . Remark 2.3:
Since { x } ∈ B ( G ) and the function Φ ξ : G × Ω → G , ( x, ω ) Φ ξ ( x, ω ) :=(Φ ◦ (Id , ξ ))( x, ω ) = T ξ ( ω ) x is measurable as composition of two measurable functions, we find { x ∈ Fix T ξ } = n ω ∈ Ω (cid:12)(cid:12)(cid:12) x ∈ Fix T ξ ( ω ) o = n ω ∈ Ω (cid:12)(cid:12)(cid:12) T ξ ( ω ) x = x o = n ω ∈ Ω (cid:12)(cid:12)(cid:12) ( x, ω ) ∈ Φ − ξ { x } o ∈ F , since slices of sets in the product σ -field are measurable with respect to the single σ -fields (seeLemma A.1).Denote in the following for A ⊂ Ω, C ( A ) := T ω ∈ A Fix T ξ ( ω ) . Lemma 2.4 (equivalence of stochastic and deterministic feasibility problems) . Under the stand-ing assumptions, and if G is complete and separable, there exists a P -nullset N ⊂ Ω , such that C = C (Ω \ N ) = \ ω ∈ Ω \ N Fix T ξ ( ω ) . Furthermore, C ⊂ G is closed.Proof. For the direction " ⊃ ", note that P (Ω \ N ) = 1 for any P -nullset N ⊂ Ω, so for x ∈ C (Ω \ N )it holds that P ( x ∈ Fix T ξ ) = 1, i.e. x ∈ C .Consider now the direction " ⊂ ". Let Q be a dense and countable subset of C (exists by The-orem A.2). Since for each q ∈ Q , P ( q ∈ Fix T ξ ) = 1, there is N q ⊂ Ω with P ( N q ) = 0 and q ∈ C (Ω \ N q ). Set N = S q ∈ Q N q , then P ( N ) = 0 and q ∈ C (Ω \ N ) for all q ∈ Q .Now let c ∈ C , so ∃ ( q n ) n ∈ N ⊂ Q with q n → c as n → ∞ . Since, for all i ∈ I , Fix T i is closed bycontinuity of T i , we get c = lim n →∞ q n ∈ C (Ω \ N ).The set C (Ω \ N ) is defined as intersection over closed sets and hence closed itself. Remark 2.5 (interpretation) : Lemma 2.4 shows that the feasible set C in the separable casecan be written as intersection of a selection of sets Fix T ξ ( ω ) as in the deterministic formulationof the fixed point problem, but where ω ∈ Ω \ N for a nullset N ⊂ Ω. In fact C (Ω) is ingeneral a proper subset of C = C (Ω \ N ) or can even be empty. But note that, even though5he construction of C in Lemma 2.4 appears to depend on the random variable ξ , in fact C only depends on the distribution P ξ by definition. Furthermore, in the context of more generalMarkov chains, we have,( c ∈ C ) p ( c, { c } ) = P ( T ξ c ∈ { c } ) = P (Ω \ N ) = 1 . Hence ( A ∈ B ( G )) δ c P ( A ) = p ( c, A ) = A ( c ) = δ c ( A ) . In other words, the delta function δ c for c ∈ C is an invariant measure for P . Corollary 2.6 ( P ξ nullset, separable space) . Under the assumptions of Lemma 2.4 there existsa P -nullset N with C = C (Ω \ N ) , such that ξ ( N ) := { ξ ( ω ) | ω ∈ N } is a P ξ -nullset, where wedenote P ξ = P ( ξ ∈ · ) , and it satisfies C = \ i ∈ ξ (Ω) \ ξ ( N ) Fix T i . Proof.
We will construct a P -nullset N for which ξ (Ω \ N ) = ξ (Ω) \ ξ ( N ), where ξ ( N ) is a P ξ -nullset, in that case immediately follows that \ ω ∈ Ω \ N Fix T ξ ( ω ) = \ i ∈ ξ (Ω) \ ξ ( N ) Fix T i . Let A x := { i ∈ I | T i x = x } for x ∈ G , then analogously to Remark 2.3 A x = n i ∈ I (cid:12)(cid:12)(cid:12) ( x, i ) ∈ Φ − { x } o ∈ I and so is A := T c ∈ C A c = T q ∈ Q A q as countable intersection of measurable sets ( Q ⊂ C dense and countable, see proof of Lemma 2.4). Let ˜ N be the P -nullset from Lemma 2.4, i.e. C = C (Ω \ ˜ N ), note that due to C = \ ω ∈ Ω \ ˜ N Fix T ξ ( ω ) = \ i ∈ ξ (Ω \ ˜ N ) Fix T i = ∅ it holds ξ (Ω \ ˜ N ) ⊂ A c = ∅ , for all c ∈ C . Set N := Ω \ ξ − A , then from Ω \ ˜ N ⊂ ξ − A follows N ⊂ ˜ N is a P -nullset and P ξ ( A ) = P ( ξ − A ) ≥ P (Ω \ ˜ N ) = 1 , i.e. P ξ ( ξ ( N )) = 1 − P ξ ( A ) = 0. By definition of A we have for ω ∈ ξ − A , that any c ∈ C satisfies c ∈ Fix T ξ ( ω ) , so it follows C ⊂ C ( ξ − A ). Due to C (Ω \ N ) ⊂ C (Ω \ ˜ N ) holds C = \ ω ∈ ξ − A Fix T ξ ( ω ) = \ i ∈ ξ ( ξ − A ) Fix T i = \ i ∈ ξ (Ω) \ ξ ( N ) Fix T i . Note that from N = Ω \ ξ − A follows ξ ( N ) = ξ (Ω) \ ξ ( ξ − A ).If ξ is not surjective, then ξ (Ω) = I . In that case, there is a P ξ -nullset ξ ( N ) of indices in I , that are not needed to characterize the fixed point set, and these indices can be removedfrom the index set I . Note also that, in general, the P -nullsets occurring in Lemma 2.4 andCorollary 2.6 are different. If there is N ⊂ Ω with C = C (Ω \ N ), then it need not be the casethat C = T i ∈ ξ (Ω) \ ξ ( N ) Fix T i .In the context of the iterates X k of Algorithm 1 in many of the results below we constructthe set N in Lemma 2.4 as follows: N = [ k N k where N k := Ω \ { ω ∈ Ω | T ξ k ( ω ) c = c ∀ c ∈ C } . (5)From Lemma 2.4 we have that N k is a set of measure zero, hence so is N .6 . Convergence Analysis We achieve convergence of iterated random functions for consistent stochastic feasibility inseveral different settings under different assumptions on the metric spaces and the mappings T i ( i ∈ I ). The main properties of the mappings we consider are: • quasi-nonexpansive mappings, i.e.( ∀ x / ∈ Fix T i )( ∀ y ∈ Fix T i ) d ( T i x, y ) ≤ d ( x, y ) . (6) • paracontractions, i.e. T i is continuous and( ∀ x / ∈ Fix T i )( ∀ y ∈ Fix T i ) d ( T i x, y ) < d ( x, y ); (7) • nonexpansive mappings, i.e.( ∀ x, y ∈ G ) d ( T i x, T i y ) ≤ d ( x, y ); (8) • averaged mappings on a normed linear space H , i.e. mappings T : H → H for which thereexists an α ∈ (0 ,
1) such that( ∀ x, y ∈ H ) k T x − T y k + 1 − αα k ( x − T x ) − ( y − T y ) k ≤ k x − y k . (9)Note that for a quasi-nonexpansive mapping T : G → G the condition x ∈ Fix T impliesthat d ( T x, y ) = d ( x, y ) for all y ∈ G . The set of quasi-nonexpansive mappings contains theparacontractions and the nonexpansive mappings. The set of projectors onto convex sets ormore generally the set of averaged mappings on a Hilbert space H is contained in both the setof nonexpansive mappings and the set of paracontractions [8, Remark 4.24 and 4.26]. For anexample of a paracontraction that is not averaged see Example 3.1 and Appendix B. Averagedmappings were first used in the work of Mann, Krasnoselski, Edelstein, Gurin, Polyak and Raikwho wrote seminal papers in the analysis of (firmly) nonexpansive and averaged mappings [9–12]although the terminology “averaged” wasn’t coined until sometime later [13]. Example 3.1.
Let f : R → R + be continuous. Let f (0) = 0 and | f ( x ) | < | x | for all x ∈ R \{ } ,then f is paracontractive. This includes also convex functions, e.g. Huber functions, which arenot averaged in general (see Appendix B). For other examples on R n also see Appendix B. In this section we establish convergence of the RFI on a compact metric space. The nextexample illustrates why nonexpansivity alone does not suffice to guarantee convergence to theintersection set C . Example 3.2 (nonexpansive mappings, negative result) . For non-expansive mappings in gen-eral, one cannot expect that the support of every invariant measure is contained in the feasibleset C . Consider a rotation in positive direction in R A = cos( ϕ ) − sin( ϕ )sin( ϕ ) cos( ϕ ) ! , ϕ ∈ (0 , π ) , and set ξ = 1 and I = { } , T = A . Then C = { } and, since k A k = 1, A is nonexpansive, but k Ax k = k x k for all x ∈ R . So the (deterministic) iteration X k +1 = AX k will not converge to0, whenever X ∼ δ x , x = 0. 7 sufficient requirement on the mappings T i to ensure convergence of Algorithm 1 is para-contractiveness. The next Lemma is the main ingredient for proving a.s. convergence of ( X k )to a random point in C . The support of a probability measure ν ∈ P ( G ) is the smallest closedset S ⊂ G , for which ν ( G \ S ) = 0 (see also Theorem A.3 for equivalent representations); wethen write S = supp ν . Lemma 3.3 (invariant measures for para-contractions) . Under the standing assumptions andif T i ( i ∈ I ) is paracontracting on a compact metric space, then the set of invariant measuresfor P is { π ∈ P ( G ) | supp π ⊂ C } .Proof. It is clear that π ∈ P ( G ) with supp π ⊂ C is invariant, since p ( x, { x } ) = P ( T ξ x ∈ { x } ) = P ( x ∈ Fix T ξ ) = 1 for all x ∈ C and hence π P ( A ) = R C p ( x, A ) π (d x ) = π ( A ) for all A ∈ B ( G ).The other implication is not so immediate. Suppose supp π \ C = ∅ for some invariantmeasure π of P . Then due to compactness of supp π (as it is closed in G ) we can find s ∈ supp π maximizing the continuous function dist( · , C ) on G . So d max = dist( s, C ) >
0. We show thatthe probability mass around s will be attracted to the feasible set C , implying that the invariantmeasure loses mass around s in every step, which yields a contradiction.Define the set of points being more than d max − (cid:15) away from C : K ( (cid:15) ) := { x ∈ G | dist( x, C ) > d max − (cid:15) } , (cid:15) ∈ (0 , d max ) . This set is measurable, i.e. K ( (cid:15) ) ∈ B ( G ), because it is open. Let M ( (cid:15) ) be the event in F , where T ξ s is at least (cid:15) closer to C than s , i.e. M ( (cid:15) ) := n ω ∈ Ω (cid:12)(cid:12)(cid:12) dist( T ξ ( ω ) s, C ) ≤ d max − (cid:15) o . There are two possibilities, either there is an (cid:15) ∈ (0 , d max ) with P ( M ( (cid:15) )) > (cid:15) exists. In the latter case we have dist( T ξ s, C ) = d max = dist( s, C ) a.s. by paracontractivenessof T i . By compactness of C there exists c ∈ C such that 0 < d max = d ( s, c ). Hence theprobability of the set of ω ∈ Ω such that s Fix T ξ ( ω ) is positive and so is the probability thatdist( T ξ ( ω ) s, C ) ≤ d ( T ξ ( ω ) s, c ) < d ( s, c ) - a contradiction.So it must hold that there is an (cid:15) ∈ (0 , d max ) with P ( M ( (cid:15) )) >
0. In view of continuity of themappings T i around s , i ∈ I , define A n := n ω ∈ M ( (cid:15) ) (cid:12)(cid:12)(cid:12) d ( T ξ ( ω ) x, T ξ ( ω ) s ) ≤ (cid:15) ∀ x ∈ B ( s, n ) o ( n ∈ N ) . It holds that A n ⊂ A n +1 and P ( S n A n ) = P ( M ( (cid:15) )). So in particular there is an m ∈ N , m ≥ /(cid:15) with P ( A m ) >
0. For all x ∈ B ( s, m ) and all ω ∈ A m we havedist( T ξ ( ω ) x, C ) ≤ d ( T ξ ( ω ) x, T ξ ( ω ) s ) + dist( T ξ ( ω ) s, C ) ≤ d max − (cid:15) , which means T ξ ( ω ) x ∈ G \ K ( (cid:15) ). Hence, in particular we conclude that p ( x, K ( (cid:15) )) < ∀ x ∈ B ( s, m ) . Since p ( x, K ( (cid:15) )) = 0 for x ∈ G with dist( x, C ) ≤ d max − (cid:15) due to paracontractiveness, it holdsby invariance of π that π ( K ( (cid:15) )) = Z G p ( x, K ( (cid:15) )) π (d x ) = Z K ( (cid:15) ) p ( x, K ( (cid:15) )) π (d x ) .
8t follows, then, that π ( K ( (cid:15) )) = Z K ( (cid:15) p ( x, K ( (cid:15) )) π (d x )= Z B ( s, m ) p ( x, K ( (cid:15) )) π (d x ) + Z K ( (cid:15) \ B ( s, m ) p ( x, K ( (cid:15) )) π (d x ) < π ( B ( s, m )) + π ( K ( (cid:15) ) \ B ( s, m )) = π ( K ( (cid:15) ))which is a contradiction. So the assumption that supp π \ C = ∅ is false, i.e. supp π ⊂ C asclaimed. Theorem 3.4 (Theorem 4.22 in [14]) . Under Standing Assumption 1, if T i ( i ∈ I ) is continuous,then the Markov operator P is Feller, i.e. P : C b ( G ) → C b ( G ) .Proof. By continuity of T i , i ∈ I , the update function Φ is continuous in the first argument. Itfollows for f ∈ C b ( G ) and x n → x as n → ∞ by Lebesgue’s Dominated Convergence Theorem P f ( x n ) = Z I f (Φ( x n , u )) P ξ (d u ) → Z I f (Φ( x, u )) P ξ (d u ) = P f ( x ) . Note that P f is bounded, whenever f is a bounded function. Theorem 3.5 (almost sure convergence for a compact metric space) . Under the standing as-sumptions, let T i be paracontractive, i ∈ I , and let ( G, d ) be a compact metric space. Thenthe sequence ( X k ) of random variables generated by Algorithm 1 converges almost surely to arandom variable X µ ∈ C depending on the initial distribution µ .Proof. Since P is Feller and G compact, Theorem A.4 implies that any subsequence of ( ν n ),where ν n = n P ni =1 L ( X i ), has a convergent subsequence and clusterpoints are invariant mea-sures for P . Let ( ν n k ) be a convergent subsequence with limit π . So for the bounded andcontinuous function dist( · , C ) it holds that ν n k dist( · , C ) → π dist( · , C ) = 0 as k → ∞ by weakconvergence of the probability measures and the fact that, by Lemma 3.3, supp π ⊂ C .Due to quasi-nonexpansiveness and Lemma 2.4 (a compact metric space is separable), we havea.s. (for all ω / ∈ N with N given by (5)) that d ( X k +1 , c ) ≤ d ( X k , c ) for all c ∈ C and k ∈ N ,which implies dist( X k +1 , C ) ≤ dist( X k , C ) for all k ∈ N a.s. It therefore follows that E [dist( X n k , C )] ≤ n k n k X i =1 E [dist( X i , C )] = ν n k dist( · , C ) → E [dist( X k , C )]) k . This yields E [dist( X k , C )] → k → ∞ . Now since(dist( X k , C )) k is nonincreasing, it must be that dist( X k , C ) → x ω of ( X k ( ω )) k we have x ω ∈ C . This together with a.s. monotonicity of ( d ( X k , c )) k forall c ∈ C implies that d ( X k ( ω ) , x ω ) → x ω of ( X k ( ω )) k , which implies theuniqueness of x ω . In other words, ( X k ) converges almost surely to a random variable X µ , with X µ ( ω ) = x ω ∈ C , ω N , as claimed. The results for compact metric spaces can be applied, with minor adjustments, to finite dimen-sional vector spaces. In the following let (
G, d ) = ( V, k·k ) be a finite dimensional normed vectorspace over R . This means in particular, that V is also complete and every closed and boundedset is compact (Heine-Borel property) and all norms on V are equivalent. So actually, since all9 -dimensional vector spaces are isomorphic, it is enough to study convergence in R n equippedwith the euclidean norm k·k .The following result for R n is a straight forward application of Theorem 3.5. Theorem 3.6 (almost sure convergence in R n ) . Under the standing assumptions, let T i : R n → R n be paracontractive, i ∈ I . Then the sequence ( X k ) of random variables generatedby Algorithm 1 converges almost surely to a random variable X µ ∈ C depending on the initialdistribution µ .Proof. First, suppose µ = δ x for x ∈ R n . Let N be given by (5). The quasi-nonexpansivenessproperty gives us k X k +1 − c k ≤ k X k − c k for all c ∈ C a.s. (i.e. if ω N ). Letting c ∈ C with dist( x, C ) = k x − c k , this implies X k ∈ B ( c, k x − c k ), where B ( s, (cid:15) ) ⊂ R n is the closed ballaround s ∈ R n with radius (cid:15) . The assertion X k → X δ x a.s. then follows from Theorem 3.5.Denote the corresponding invariant measure as π x := L ( X δ x ).Suppose now that µ ∈ P ( R n ) is arbitrary. For f ∈ C b ( R n ) one has p k ( x, f ) ≤ k f k ∞ for all k ∈ N and x ∈ R n . Note that p k ( x, f ) = δ x P k f , and from the above argument δ x P k → π x inthe weak sense as k → ∞ . Hence by Lebesgue’s Dominated Convergence Theorem, we get µ P k f = Z R n p k ( x, f ) µ (d x ) → Z R n π x f µ (d x ) =: µπ x f =: π µ f, as k → ∞ . We conclude that L ( X k ) = µ P k → π µ weakly. The measure π µ = µπ x is an invariant probabilitymeasure for P , since π x is a invariant probability measure for P .Choosing f = min { dist( · , C ) , M } ∈ C b ( R n ) with M > L ( X k ) f → π µ f = 0. Since f ( X k +1 ) ≤ f ( X k ) a.s. and L ( X k ) f = E [ f ( X k )] →
0, it holds that f ( X k ) → X k , C ) → X n k ( ω )) k with limit x ω it holds that x ω ∈ C . Moreover, since ( k X k − x ω k ) k is monotone, actually X k ( ω ) → X µ ( ω ) := lim k X k ( ω ) = x ω ∈ C , ω N . In this section (
G, d ) is a Hilbert space ( H , h· , ·i ). Under the standing assumptions the followingextended-valued function R ( x ) := E h k x − T ξ x k i = Z Ω (cid:13)(cid:13)(cid:13) x − T ξ ( ω ) x (cid:13)(cid:13)(cid:13) P (d ω ) = Z I k x − T u x k P ξ (d u )is measurable from H to [0 , ∞ ]. Following [5] we use this function to characterize convergenceof the consistent fixed point problem under the weaker assumption that the mappings T ξ are averaged (see Eq. (9)). Lemma 3.7 (properties of R and C for quasi-nonexpansive mappings) . In addition to thestanding assumptions, suppose that T i ( i ∈ I ) is quasi-nonexpansive and continuous. Then(i) C = R − (0) ;(ii) R is finite everywhere;(iii) R is continuous;(iv) C is convex and closed.Proof. (i) We have x ∈ C ⇔ x ∈ Fix T ξ a.s. ⇔ x = T ξ x a.s. ⇔ R ( x ) = 0.10ii) Fix x ∈ C , then x = T ξ x a.s. Using quasi-nonexpansivity we get a.s., that k y − T ξ y k ≤k y − x k + k x − T ξ y k ≤ k x − y k ∀ y ∈ H , (10) ⇐⇒k y − T ξ y k ≤ k y − x k ∀ y ∈ H . (11)From (11) it follows that R ( y ) ≤ k y − x k < ∞ for all y ∈ H .(iii) Let x, x n ∈ H , n ∈ N , with x n → x as n → ∞ . Define the functions f n ( ω ) = (cid:13)(cid:13)(cid:13) x n − T ξ ( ω ) x n (cid:13)(cid:13)(cid:13) on Ω ( n ∈ N ). Then, by continuity of T ξ ( ω ) for fixed ω ∈ Ω, one has f n → f := k x − T ξ x k for all ω ∈ Ω. Define the constant function g ( ω ) = 8 (cid:15) + 8 k x − c k for some c ∈ C and some (cid:15) >
0. By (10) we have that k y − T ξ y k ≤ k y − c k for all y ∈ H .For y ∈ B ( x, (cid:15) ) this yields k y − T ξ y k ≤ (cid:15) + 2 k x − c k . We conclude that g is P -integrableand f n ≤ g for all n ∈ N with x n ∈ B ( x, (cid:15) ). Finally, application of Lebesgue’s DominatedConvergence Theorem yields R ( x n ) = E f n → E f = R ( x ) as n → ∞ .(iv) This follows from [8, Proposition 4.13, Proposition 4.14]. Note that for any α ∈ R , a, b ∈ H we have [8, Corollary 2.14] k αa + (1 − α ) b k = α k a k + (1 − α ) k b k − α (1 − α ) k a − b k . Let z = λx + (1 − λ ) y with x, y ∈ R − (0) = C , λ ∈ [0 , T ξ x = x and T ξ y = y a.s. that a.s. holds k T ξ z − z k = k λ ( T ξ z − x ) + (1 − λ )( T ξ z − y ) k = λ k T ξ z − x k + (1 − λ ) k T ξ z − y k − λ (1 − λ ) k x − y k ≤ λ k z − x k + (1 − λ ) k z − y k − λ (1 − λ ) k x − y k = k λ ( z − x ) + (1 − λ )( z − y ) k = 0 . So R ( z ) = 0, i.e. z ∈ R − (0). Closedness of R − (0) follows by continuity of R .In the next theorem we need to compute conditional expectations of nonnegative real-valuedrandom variables, which are non-integrable in general (for example, if the random variable X with distribution µ does not have a finite expectation, E [ k X k ] = + ∞ ). But for theserandom variables the classical results on integrable random variables are still applicable (seeTheorem A.8), also the disintegration theorem is still valid (see Theorem A.9).The stage is now set to show convergence for the corresponding Markov chain. The nextseveral results concern weak convergence of sequences of random variables with respect to theHilbert space, namely, x n w −→ x if h x n , y i → h x, y i for all y ∈ H . Theorem 3.8 (weak cluster points belong to feasible set for averaged mappings) . Under thestanding assumptions, let T i be α i -averaged with α i ≤ α < for all i ∈ I . Then weak clus-ter points (in the sense of Hilbert spaces) of the sequence ( X k ) k ∈ N of random variables in H generated by Algorithm 1 are a.s. contained in C .Proof. Fix c ∈ C . Since T ξ is averaged we have for all k ∈ N that k X k +1 − c k ≤ k X k − c k − − αα k X k +1 − X k k (12)11verywhere but on a P -nullset N c , which may depend on c . Let F k = σ ( X , ξ , . . . , ξ k − ) be the σ -algebra of all iterations of the algorithm up to the k -th and apply Lemma A.10. We get that P k ∈ N R ( X k ) < ∞ a.s., where from Theorem A.9 follows that E [ k X k +1 − X k k | F k ] = R ( X k ).Hence there is ˜ N ⊂ Ω with P ( ˜ N ) = 0 and R ( X k ( ω )) → k → ∞ for ω ∈ Ω \ ( N c ∪ ˜ N ).By nonexpansiveness of T ξ for all we find for any x, x n ∈ Hk x − T ξ x k = k x n − T ξ x k + k x − x n k + 2 h x n − T ξ x, x − x n i = k x n − T ξ x k − k x − x n k + 2 h x − T ξ x, x − x n i = k x n − T ξ x n k + k T ξ x − T ξ x n k + 2 h x n − T ξ x n , T ξ x n − T ξ x i− k x − x n k + 2 h x − T ξ x, x − x n i≤ k x n − T ξ x n k + 2 h x n − T ξ x n , T ξ x n − T ξ x i + 2 h x − T ξ x, x − x n i≤ k x n − T ξ x n k + 2 k x n − T ξ x n kk x n − x k + 2 h x − T ξ x, x − x n i . Taking expectation and using Jensen’s inequality yields R ( x ) ≤ R ( x n ) + 2 q R ( x n ) k x n − x k + 2 E [ h x − T ξ x, x − x n i ] . (13)Now assume that the sequence ( x n ) is weakly converging to x ∈ H , i.e. x n w −→ x . Then thefunctions f n = h x − T ξ x, x − x n i , n ∈ N , on Ω satisfy f n → P -integrablefunction g ( ω ) := (cid:13)(cid:13)(cid:13) x − T ξ ( ω ) x (cid:13)(cid:13)(cid:13) sup n k x − x n k gives us | f n | ≤ g for all n ∈ N and hence byLebesgue’s Dominated Convergence Theorem E [ h x − T ξ x, x − x n i ] → n → ∞ .So for ω ∈ Ω \ ( N c ∪ ˜ N ) there is a weakly convergent subsequence of the bounded sequence( X k ( ω )) k ∈ N , denoted x n := X k n ( ω ) w −→ x ω =: x as n → ∞ . As shown above this subsequencesatisfies R ( x n ) → n → ∞ . We conclude with (13) that R ( x ) = 0, i.e. x ∈ C and hence anyweak cluster point of the sequence ( X k ( ω )) k is contained in C .In the case of separable Hilbert spaces, we are able to show Fejér monotonicity of the sequence( X k ) a.s., so the classical theory of convergence analysis from [8] can be applied in this case.An analogous statement for nonseparable Hilbert spaces remains open since we do not have therepresentation Lemma 2.4 at hand. Theorem 3.9 (almost sure weak convergence under separability) . Under the same assumptionsas in Theorem 3.8 assume additionally that H is a separable Hilbert space. Then the sequence ( X k ) is a.s. weakly convergent (in the sense of Hilbert spaces) to a random variable X µ ∈ C ,depending on the initial distribution µ . Furthermore P C X k → X µ strongly a.s. as k → ∞ .Proof. Instead of a nullset N c , which may depend on c ∈ C , as in the proof of Theorem 3.8,separability gives with help of Lemma 2.4 that there is a nullset N , such that on Ω \ N Eq. (12)is satisfied for all c ∈ C . This implies a.s. Fejér monotonicity of ( X k ). Since from Theorem 3.8follows that weak clusterpoints of ( X k ) are contained in C a.s., we can now apply Theory in[8] developed for Fejér monotone sequences, we get: From [8, Theorem 5.5] (a Fejér monotonesequence w.r.t. C that has all weak clusterpoints in C is weakly convergent to a point in C )follows that X k w −→ X µ ∈ C a.s.For strong convergence of ( P C X k ) a.s. we apply [8, Proposition 5.7]. From [8, Corollary 5.8]we get from X k w −→ X µ a.s., that P C X k → X µ a.s. strongly as k → ∞ . Example 3.10 (convergence to projection for affine subspaces) . Let H be separable and C i bean affine subspace, i ∈ I , where I is an arbitrary index set. Let T i = P i be the projector onto12 i . Under the standing assumptions holds that lim k X k = X µ = P C X for X ∼ µ and any µ ∈ P ( H ).We show, that P C X k +1 = P C X k for any k ∈ N . This allows us to conclude that P C X k = P C X for any k ∈ N , and thus P C X is the only possible weak cluster point of ( X k ) byTheorem 3.9. Using the characterization [15, Theorem 4.1] (if K ⊂ H is nonempty, closed andconvex and u ∈ K then h x − u, k − u i ≤ k ∈ K iff u = P K x ) of a projection, we findwith help of [15, Theorem 4.9] (for a subspace S holds that h x − P S x, s i = 0 for all s ∈ S ), thatfor c ∈ C holds that h X k +1 − P C X k , c − P C X k i = h P ξ k X k − X k , c − P C X k i | {z } =0 + h X k − P C X k , c − P C X k i | {z } ≤ ≤ . Hence by [15, Theorem 4.1] we have that P C X k +1 = P C X k . We will assume in this section that H is a separable Hilbert space and T i is α i -averaged, i ∈ I .We will furthermore assume, that α i ≤ α for some α <
1. As with the deterministic case,geometric convergence of the algorithm can be analyzed by introducing a condition on theset of fixed points. In the context of set feasibility with finitely many sets, the condition isequivalent to linear regularity of the sets [2, Assumption 2]: There exists κ > ( x, C ) ≤ κR ( x ) ∀ x ∈ H . (14)In the more general context of fixed point mappings, this property is more appropriately called global metric subregularity of R at all points in C for 0 [16]; in particular there exists a κ > ( x, R − (0)) ≤ κR ( x ) ∀ x ∈ H . Here C = R − (0), so the above is just another way of writing (14). The smallest constantsatisfying this inequality will be called the regularity constant, it is given bysup x ∈H\ C dist ( x, C ) R ( x ) . Theorem 3.11.
In addition to the standing assumptions, suppose the regularity condition inEq. (14) is satisfied and T i is α i -averaged, i ∈ I with α i ≤ α for some α < . Then the RFIconverges geometrically in expectation to the fixed point set, i.e. for any initial distribution E [dist( X k +1 , C )] ≤ r − κ − − αα E [dist( X k , C )] ∀ k ∈ N . (15) Proof.
Revisiting (12) in the proof of Theorem 3.8 gives us for ω ∈ Ω \ N ( N given by (5)) and x = P C X k ( ω )dist ( X k +1 ( ω ) , C ) ≤ k X k +1 ( ω ) − x k ≤ dist ( X k ( ω ) , C ) − − αα k X k +1 ( ω ) − X k ( ω ) k . With help of Jensen’s inequality and concavity of x
7→ √ x on [0 , ∞ ), we get that E [dist( X k +1 , C ) | F k ] ≤ E [ r dist ( X k , C ) − − αα k T ξ k X k − X k k | F k ] ≤ r dist ( X k , C ) − − αα E [ k T ξ k X k − X k k | F k ]13 r dist ( X k , C ) − − αα R ( X k ) ≤ r − κ − − αα dist( X k , C ) . Note that it could be E [dist( X k , C )] = ∞ for all k ∈ N , depending on the initial distribution µ .The next theorem concerns the Wasserstein distance of two probability measures. For twomeasures ν , ν ∈ P ( G ) this is given by W ( ν , ν ) = inf Y ∼ ν Y ∼ ν E [ k Y − Y k ] . Theorem 3.12 (strong convergence and geometric convergence of measures) . Under the stand-ing assumptions, suppose the regularity condition in Eq. (14) is satisfied and T i is α i -averaged, i ∈ I with α i ≤ α for some α < . Then X k → X strongly a.s. as k → ∞ and the Wassersteindistances W ( L ( X k ) , L ( X )) also converge geometricly, there is r ∈ (0 , such that W ( L ( X k ) , L ( X )) ≤ r k W ( L ( X ) , L ( X )) . Proof.
See also [8, Theorem 5.12]. One has a.s. that k X k − X k + m k ≤ k X k − P C X k k + k P C X k − X k + m k ≤ X k , C ) ≤ q κR ( X k ) . We used here, that T ξ is nonexpansive and it satisfies T ξ c = c for any c ∈ C a.s., hence k P C X k − X k + m k = (cid:13)(cid:13)(cid:13) T ξ k + m − · · · T ξ k P C X k − X k + m (cid:13)(cid:13)(cid:13) ≤ dist( X k , C ). This gives us that ( X k )is a Cauchy sequence a.s., since R ( X k ) → X is contained in C , since it’s weak limit needs to coincide with the strong limit. Letting m → ∞ one arrives at k X k − X k ≤ X k , C ). Taking the expectation yields E [ k X k − X k ] ≤ E [dist( X k , C )]. Hence, using Theorem 3.11 gives us E [ k X k − X k ] ≤ r k E [dist( X , C )] with r = q − κ − − αα and using the fact that E [dist( X , C )] ≤ W ( L ( X ) , L ( X )), we have, by thedefinition of the Wasserstein distance, W ( L ( X k ) , L ( X )) ≤ r k W ( L ( X ) , L ( X )) . Note that it could be W ( L ( X ) , L ( X )) = ∞ , depending on the initial distribution µ . Remark 3.13 ( (cid:15) -fixed point) : In order to assure that, with probability greater than 1 − β , the k -th iterate is in an (cid:15) neighborhood of the feasible set C , it is sufficient that k ≥ ln (cid:18) β(cid:15) √ κR ( x ) (cid:19) / ln( c ),where c = q − − αα κ − and X ∼ δ x . To see this, note that, by Markov’s inequality, P ( X k ∈ C + (cid:15) B (0 , P (dist( X k , C ) < (cid:15) )= 1 − P (dist( X k , C ) ≥ (cid:15) ) ≥ − E [dist( X k , C )] (cid:15) ≥ − r k dist( x, C ) (cid:15) ≥ − r k p κR ( x ) (cid:15) . emark 3.14: As seen in Example 3.10 the probability P ( X k ∈ C ) can increase to 1 as k → ∞ ,but this is not necessarily the case, as we will see in Examples 4.7 and 4.9. There, one findsthat P ( X k ∈ C ) = P ( X ∈ C ) for k ∈ N . In Example 4.8 it holds that P ( X k ∈ C ) = P ( X ∈ C )for all k ∈ N . Theorem 3.15 (necessary and sufficient conditions for geometric convergence) . Under thestanding assumptions, let T i be α i -averaged, i ∈ I with α i ≤ α for some α < . The regularitycondition in Eq. (14) is satisfied if and only if there exists r ∈ [0 , such that E [dist( T ξ x, C )] ≤ r dist( x, C ) ∀ x ∈ H . (16) Furthermore, condition Eq. (14) is necessary and sufficient for geometric convergence in expec-tation of Algorithm 1 to the fixed point set C as in Eq. (15) with a uniform constant for allinitial probability measures.Proof. Eq. (14) implies Eq. (15), which in turn implies Eq. (16) (with X ∼ δ x ) by Theorem 3.11with r = q − κ − − αα . The other implication follows the same proof pattern as [6, Theorem3.11]. We note that, by Theorem A.9, if X ∼ δ x for x ∈ H , then E [ k X − X k | ξ ] = k T ξ x − x k , hence by Hölder’s inequality E [ k X − X k ] ≤ q R ( x ) . Furthermore we can estimate k X − X k = k X − P C X + P C X − X k ≥ dist( X , C ) − dist( X , C ) . Taking the expectation above, the assumption that E [dist( X , C )] ≤ r E [dist( X , C )] yields( ∀ x ∈ H ) R ( x ) ≥ (1 − r ) dist ( x, C ) , i.e. the constant κ in Eq. (14) is finite with κ ≤ (1 − r ) − < ∞ . So Eq. (16) implies Eq. (14).For the last implication of the theorem, note that, in case Eq. (15) is satisfied with thesame constant r ∈ (0 ,
1) for all Dirac measures δ x with x ∈ H , then Eq. (16) also holds (letting X ∼ δ x ) and hence by the above equivalence Eq. (14) is satisfied. This completes the proof. Remark 3.16:
Conventional analytical strategies invoke strong convexity in order to achievegeometric convergence. Our analysis makes no such assumption on the sets C i . Theorem 3.15shows that geometric convergence is a by-product, mainly, of the regularity of the set of fixedpoints. The results of [6] indicate that one could formulate a necessary regularity condition for sublinear convergence, which also might be useful for stochastic algorithms.
4. Applications
We specialize the framework above to several well-known settings: consistent convex feasibil-ity, linear operator equations and in particular Hilbert-Schmidt operators (i.e. linear integralequations). 15 .1. Feasibility and stochastic projections
There are many algorithms for solving convex feasibility problems. We focus on the (concep-tually) simplest of these, namely stochastic projections. In the context of Algorithm 1, T i = P i is a projector, i ∈ I , onto a nonempty closed and convex set C i ⊂ H , i ∈ I and H a Hilbertspace. Note that projectors are -averaged operators [8, Proposition 4.8] (also referred to as firmly nonexpansive operators), so α i = for all i ∈ I , we then can choose the upper bound α = as well. Also note that Fix P i = C i , i ∈ I .As a first assertion we give an equivalent characterization for the regularity property inEq. (14) using just properties of R . This characterization, known as Kurdyka-Łojasiewicz (KL)property, eliminates the term with the distance to the usually unknown fixed point set C , butone needs to be able to compute the first derivative of the function R . For convex sets this isunproblematic since R is the expectation of the squared distances to the convex sets C i , seeLemma A.11. Definition 4.1 (KL property) . A convex, continuously differentiable function f : H → R withinf x f ( x ) = 0 and S := argmin f = ∅ is said to have the global KL property, if there exists aconcave continuously differentiable function ϕ : R + → R + with ϕ (0) = 0 and ϕ > ϕ ( f ( x )) k∇ f ( x ) k ≥ ∀ x ∈ H \ S. The following theorem is a direct consequence of [17].
Proposition 4.2 (equivalent characterization of Eq. (14)) . Under the standing assumptions,let T i = P i be projectors onto nonempty, closed and convex sets, i ∈ I . Then the regularitycondition in Eq. (14) is satisfied with κ > if and only if R ( x ) ≤ κ k∇ R ( x ) k ∀ x ∈ H , i.e. R has the global KL property.Proof. Apply [17, Corollary 6] with ϕ ( s ) := √ κs and f = R and note that R is convex anddifferentiable (see Lemma A.11). Theorem 4.3 (uniform bounds) . Under the standing assumptions, suppose the regularity con-dition in Eq. (14) is satisfied and that H is separable and T i = P i are projectors onto nonempty,closed and convex sets, i ∈ I . Then the probability of any point being feasible is uniformlybounded, i.e. P ( x ∈ C ξ ) ≤ r < for all x ∈ H \ C .Proof. It holds surely for all x ∈ H dist( P ξ x, C ) ≥ dist( x, C ) − dist( x, C ξ ) . This, together with the expectation E [dist( x, C ξ )] = E [dist( x, C ξ ) { x/ ∈ C ξ } ] ≤ E [dist( x, C ) { x/ ∈ C ξ } ] = dist( x, C )(1 − P ( x ∈ C ξ ))yields, for X ∼ δ x , E [dist( X , C )] ≥ P ( x ∈ C ξ ) dist( x, C ) . Hence by Theorem 3.151 > r := sup x ∈H\ C E [dist( X , C )]dist( x, C ) ≥ sup x ∈H\ C P ( x ∈ C ξ ) . heorem 4.4 (finite vs. infinite convergence) . Under the standing assumptions, let H beseparable and let T i = P i be projectors ( i ∈ I ). Then one of the following holds:(i) P ( X ∈ C ) = 1 and P ( X n ∈ C ) = 1 for all n ∈ N ,(ii) P ( X ∈ C ) < and P ( X n ∈ C ) < for all n ∈ N .Proof. (i) If P ( X ∈ C ) = 1, then X k = X a.s. for all k ≥ R p ( x, C ) µ (d x ) = P ( X ∈ C ) < x ∈ supp µ \ C with p ( x, C ) <
1, where µ is the initial distribution. Since p ( x, H\ C ) >
0, there exists y ∈ supp p ( x, · ) \ C .Then by Theorem A.3 this implies that p ( x, B ( y, (cid:15) )) > (cid:15) > (cid:15) > ∀ z ∈ B ( y, (cid:15) )) p ( z, B ( y, (cid:15) )) ≥ p ( x, B ( y, (cid:15) )) > . (17)To see this, note that, for ω ∈ M ( (cid:15) ) := n ω ∈ Ω (cid:12)(cid:12)(cid:12) P ξ ( ω ) x ∈ B ( y, (cid:15) ) o , we have (cid:13)(cid:13)(cid:13) P ξ ( ω ) z − y (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) P ξ ( ω ) z − P ξ ( ω ) y (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) P ξ ( ω ) y − y (cid:13)(cid:13)(cid:13) ≤ k z − y k + (cid:13)(cid:13)(cid:13) P ξ ( ω ) y − y (cid:13)(cid:13)(cid:13) ≤ k z − y k + (cid:13)(cid:13)(cid:13) P ξ ( ω ) x − y (cid:13)(cid:13)(cid:13) ≤ (cid:15). Here we have used nonexpansiveness of P ξ and the definition of a projection. Now (17)follows from the identity P ( M k ( (cid:15) )) = p ( x, B ( y, (cid:15) )) > (cid:15) > ∀ w ∈ B ( x, (cid:15) )) p ( w, B ( y, (cid:15) )) ≥ p ( x, B ( y, (cid:15) )) > . (18)To see this, note that for ω ∈ M ( (cid:15) ), we have (cid:13)(cid:13)(cid:13) P ξ ( ω ) w − y (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) P ξ ( ω ) w − P ξ ( ω ) x (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) P ξ ( ω ) x − y (cid:13)(cid:13)(cid:13) ≤ (cid:15). Now, fix (cid:15) > B ( y, (cid:15) ) ∩ C = ∅ and B ( x, (cid:15) ) ∩ C = ∅ . We get for any w ∈ H and n ∈ N p n +1 ( w, B ( y, (cid:15) )) ≥ Z B ( y, (cid:15) p ( z, B ( y, (cid:15) )) p n ( w, d z ) ≥ p ( x, B ( y, (cid:15) )) p n ( w, B ( y, (cid:15) )) . So iteratively, denoting (cid:15) n := 2 − n (cid:15) , we arrive at p n +1 ( w, B ( y, (cid:15) )) ≥ n Y i =1 p ( x, B ( y, (cid:15) i )) p ( w, B ( y, (cid:15) n )) . The last probability can be estimated for w ∈ B ( x, (cid:15) n +1 ) by (18) through p ( w, B ( y, (cid:15) n )) ≥ p ( x, B ( y, (cid:15) n +1 )) . p n ( w, B ( y, (cid:15) )) is locally uniformly bounded from below for w ∈ B ( x, (cid:15) n ). That implies P ( X n ∈ H \ C ) = Z H p n ( w, H \ C ) µ (d w ) ≥ Z B ( x,(cid:15) n ) p n ( w, B ( y, (cid:15) )) µ (d w ) ≥ [ p ( x, B ( y, (cid:15) n ))] n µ ( B ( x, (cid:15) n )) > , i.e. P ( X n ∈ C ) < n ∈ N , as claimed. Remark 4.5:
Theorem 4.4 can be interpreted as a lower bound on the complexity of the RFIanalogous to the deterministic case [6, Theorem 5.2], where the alternating projection algorithmconverges either after one iteration or after infinitely many. Alternatively, the stopping or hittingtime of a process is defined as T := inf { n | X n ∈ C } . In this context, Theorem 4.4 says that, either P ( T = 1) = 1 or P ( T = n ) < n ∈ N .Note, it could happen that P ( T = ∞ ) = 1, in which case P ( T = n ) = 0 for all n ∈ N . Example 4.6 (finite and infinite convergence) . With just two sets, the deterministic alternatingprojections algorithms can converge in finitely many steps. But when the projections onto therespective sets are randomly selected, convergence might only come after infinitely many steps.For example, let C = R + × R and C = R × R + and P ( ξ = 1) = 0 . P ( ξ = 2) = 0 . C = R + × R + . Set µ = δ x , where x = (cid:16) − − (cid:17) . Then P ( X ∈ C ) = 0 or more generally P ( X n ∈ C ) = 1 − . n − . n < n ∈ N . Now let P ( ξ = 1) = 1 and P ( ξ = 2) = 0, then C = C and for µ as above P ( X ∈ C ) = 1 and so P ( X n ∈ C ) = 1. Example 4.7 (no uniform geometric convergence) . In this example we show a sublinear con-vergence rate for infinitely many overlapping intervals. This is in contrast to the convergenceproperties of finitely many intervals with nonempty interior, where one would expect a geometricrate.Let ξ ∼ unif[ (cid:15) − , − (cid:15) ] for some (cid:15) ∈ [0 , ). Define the nonempty and closed intervals C r = [ r − , r + ], r ∈ R . Figure 1
The projector onto these intervals is given by P r x = r + x ≥ r + r − x ≤ r − x r − ≤ x ≤ r + . ρ (cid:15) of ξ is ρ (cid:15) ( y ) = 11 − (cid:15) [ (cid:15) − , − (cid:15) ] ( y ) . One can compute R (cid:15) ( x ) = E [ | P ξ x − x | ] = Z R | P r x − x | ρ (cid:15) ( r ) d r = 11 − (cid:15) Z − (cid:15)(cid:15) − | P r x − x | d r = 11 − (cid:15) [ (cid:15), ∞ ) ( | x | ) ( | x | − (cid:15) ) + min(1 − | x | − (cid:15), . Now, let us examine regularity properties. For the case (cid:15) ∈ [0 , ), the problem is a consistentfeasibility problem with C = [ − (cid:15), (cid:15) ]. While the regularity condition in Eq. (14) is triviallysatisfied for | x | ≤ (cid:15) , for (cid:15) ≤ | x | ≤ − (cid:15) we find R (cid:15) ( x ) = − (cid:15) ( | x |− (cid:15) ) and dist ( x, C ) = ( | x | − (cid:15) ) .So the regularity property in Eq. (14) is not satisfied for any κ > r ∈ [0 , E [dist( X k +1 , C )] ≤ r E [dist( X k , C )], where X ∼ δ x , x ∈ H ). Example 4.8 (uniform geometric convergence) . We provide here a concrete example wheregeometric convergence of the RFI is achieved. This is somewhat surprising since the anglebetween the sets can become arbitrarily small. In the deterministic setting, this results inarbitrarily slow convergence of the algorithm. This provides some intuition for why stochasticalgorithms can outperform deterministic variants.Let C α := R e α with e α = (cid:16) cos( α )sin( α ) (cid:17) , α ∈ [0 , π ) and let ξ ∼ unif[0 , β ], where β ∈ (0 , π ). Figure 2
We have C = { } , so dist( x, C ) = k x k and the density of ξ is ρ β ( α ) = β [0 ,β ] ( α ). The projectoronto the linear subspace C α of R is given by P α ( x ) = x − * sin( α ) − cos( α ) ! , x + sin( α ) − cos( α ) ! . We find then R β ( x ) = 1 β Z β k P α x − x k d α = 1 β Z β ( x sin( α ) − x cos( α )) d α = 1 β (cid:20) x (cid:18) β − sin( β ) cos( β )2 (cid:19) + x (cid:18) β + sin( β ) cos( β )2 (cid:19) − x x sin ( β ) (cid:21) . x = λe α with λ ≥ x, C ) = λ and R β ( x ) = λ R β ( e α ) and employingtrigonomertric calculation rules, we find the regularity constant in Eq. (14) not to be smallerthan κ = sup x ∈ R dist ( x, C ) R β ( x ) = sup α ∈ [0 , π ) β β − sin(2 β − α ) − sin(2 α ) = 4 ββ − sin( β ) , where the last supremum is attained at α = β . So from Theorem 3.11 we get uniform geometricconvergence. Example 4.9 (disks on a circle) . This example illustrates Theorem 4.3. Let C α := B ( ρe α , ⊂ R , where 0 < ρ < e α = (cid:16) cos( α )sin( α ) (cid:17) , α ∈ [0 , π ) and let ξ ∼ unif[0 , π ]. Figure 3 Figure 4
The intersection is given by C = B (0 , − ρ ). We show next that sets with this configuration donot satisfy (14). To see this we show that there is a sequence ( x n ) n ⊂ R with P ( x n ∈ C ξ ) → n → ∞ . By Theorem 4.3 we conclude that (14) cannot hold. Indeed, let x = x ( λ ) = λ ( )with λ ≥ − ρ , then P ( x ∈ C ξ ) = 12 π Z π {k x − ρe α k ≤ } d α = 12 π Z π { λ + ρ − λρ cos( α ) ≤ } d α = 12 π Z β − β α = βπ , where β = β ( λ ) = cos − (cid:16) λ + ρ − λρ (cid:17) , if λ ≤ ρ . We have β ( λ ) → π as λ → − ρ , so P ( x ( λ ) ∈ C ξ ) → λ → − ρ .In contrast to the case where ρ ∈ (0 , ρ = 0 and ρ = 1 do satisfy(14). This will not be shown here. The set feasibility examples above lead very naturally to the more general context of mappings T i : G → G , i ∈ I and S j : G → G , j ∈ J on a metric space ( G, d ), where
I, J are20rbitrary index sets. Here we envision the scenario where C T := { x ∈ G | P ( x ∈ Fix T ξ ) = 1 } and C S := { x ∈ G | P ( x ∈ Fix S ζ ) = 1 } are distinctly different sets, possibly nonintersecting.Let (Ω , F , P ) be a probability space and let ξ : Ω → I , ζ : Ω → J be two random variables.Let ( ξ n ) n ∈ N be an iid sequence with ξ n d = ξ and ( ζ n ) n ∈ N iid with ζ n d = ζ . The two sequencesare assumed to be independent of each other. Let µ be a probability measure on ( G, B ( G )).Consider the stochastic selection method for two families of mappings Algorithm 2
RFI for two families of mappings
Initialization: X ∼ µ for k = 0 , , , . . . do X k +1 = S ζ k T ξ k X k return { X k } k ∈ N Note, that this structure of two families of mappings is a special case of Algorithm 1, just set˜ T ( i,j ) = S j T i , where ( i, j ) ∈ ˜ I = I × J and ˜ ξ = ( ξ, ζ ). Also the Markov chain property is stillsatisfied, the transition kernel takes the form p ( x, A ) = P ( S ζ T ξ x ∈ A ) for x ∈ G and A ∈ B ( G ).An advantage of this formulation is that properties of the two families { S j } j ∈ J and { T i } i ∈ I can be analyzed more specifically, and independently. As long as the mapping ˜ T satisfies theproperties needed for convergence of the RFI, then convergence of the RFI for two families ofmappings follows. At the very least, we need C := n x ∈ G (cid:12)(cid:12)(cid:12) P ( x ∈ Fix ˜ T ˜ ξ ) = 1 o = ∅ . From this it is easy to see that for convergence the set C T could be empty, but the set C S must be nonempty. Example 4.10 (consistent feasibility) . Revisit Example 4.6. We had C = R + × R and C = R × R + . Set I = { } and J = { } , then the algorithm is the deterministic alternating projectionsmethod. One has P ( X ∈ C ) = 1 for all initial distributions. Example 4.11 (inconsistent stochastic feasibility) . In this example we show that the frameworkestablished here is not exclusively limited to consistent feasibility. Consider the (trivially convex,nonempty and closed) set S := { (0 , } together with the collection of balls in Example 4.9, C α := B ( ρe α , ⊂ R , where 0 ≤ ρ ≤ e α = (cid:16) cos( α )sin( α ) (cid:17) , α ∈ [0 , π ) and let ξ ∼ unif[0 , π ].The intersection of the disks is given by C T = B (0 , − ρ ) where T α := P C α is the metricprojection onto C α . Although S ∩ C α = ∅ for all α ∈ [0 , π ), still the fixed point set for themapping in Algorithm 2 (where S ζ = P S ) is C = { (0 , } , and this is found after one iterationfor any initial probability distribution µ , where X ∼ µ .This is indeed a special example, but points to the richness of inconsistent stochastic feasi-bility, which will be studied in greater depth in a follow-up paper. There are several applications of the RFI to the feasibility problem [1], [4]. We want to focusfirst on linear operator equations in the separable Hilbert space H = L ([ a, b ]). Let T : H → H be a bounded linear operator, we want to find x ∈ H , such that T x = g, for a given g ∈ H . Clearly this is possible only if g ∈ R ( T ). The idea in [4] to solve T x = g isto consider the family of evaluation mappings ϕ t : H → R , t ∈ [ a, b ], which are given by ϕ t ( x ) := ( T x )( t ) . C t := { x ∈ H | ϕ t ( x ) = g ( t ) } , t ∈ [ a, b ]. Consider the probabilityspace (Ω , F , P ) = ([ a, b ] , B ([ a, b ]) , λb − a ), where λ is the Lebesgue-measure. Let ξ : (Ω , F , P ) → ([ a, b ] , B ([ a, b ])) be a random variable with P ξ = P = λb − a . Then for g ∈ R ( T ), we have that T x = g if and only if x ∈ C := { y ∈ H | P ( y ∈ C ξ ) = 1 } . So the linear operator equation becomes a stochastic feasibility problem.In order to be able to compute projections onto the sets C t , t ∈ [ a, b ], we need the evaluationfunctionals ϕ t to be continuous, i.e. k ϕ t k < ∞ for almost all t ∈ [ a, b ]. By the Riesz represen-tation theorem there exists a unique u t ∈ H with ϕ t ( x ) = h u t , x i for all x ∈ H and almost all t ∈ [ a, b ]. We conclude that the projection onto C t takes the form P t x = x + g ( t ) − ( T x )( t ) k u t k u t x ∈ L ([ a, b ]) . Example 4.12 (linear integral equations) . Concretely, consider an integral equation of the firstkind in the separable Hilbert space L ([ a, b ])( T x )( t ) = Z ba K ( t, s ) x ( s ) d s = g ( t ) t ∈ [ a, b ] , with g ∈ L ([ a, b ]). For K ∈ L ([ a, b ] × [ a, b ]), T is a continuous linear compact operator [18,Theorem 8.15] (a Hilbert-Schmidt operator). For the Riesz representation of the evaluationfunctionals we have that ϕ t ( x ) = ( T x )( t ) = h u t , x i , t ∈ [ a, b ], as well as u t = K ( t, · ) and hence k ϕ t k ≤ k K ( t, · ) k < ∞ . Example 4.13 (differentiation) . Let K ( t, s ) = [ a,t ] ( s ) = u t ( s ), i.e. ( T x )( t ) = R ta x ( s ) d s andsuppose g ∈ C ([ a, b ]), then T x = g if and only if x = g almost surely and g ( a ) = 0. Theprojectors take the form P t x = x − g ( t ) − R ta x ( s ) d st − a [ a,t ] . A. Appendix
Lemma A.1 (slices of product σ -field, see Proposition 3.3.2 in [19]) . Let (Ω i , F i ) , i = 1 , be twomeasurable spaces and M ∈ F ⊗F . Then for ω ∈ Ω holds M ω := { ω ∈ Ω | ( ω , ω ) ∈ M } ∈F . Theorem A.2 (dense sets in separable metric space) . Let ( G, d ) be a Polish space (completeseparable metric space). Then for any A ⊂ G , there is a dense countable subset { a n } n ∈ N ⊂ A and if A is closed then even A = cl { a n } ( cl U denotes the closure of the set U ⊂ G w.r.t. themetric d ).Proof. Since G is separable there exists a dense and countable subset { u n } n ∈ N ⊂ G with G =cl { u n } n . By denseness of { u n } ⊂ G , for any x ∈ G and any (cid:15) >
0, there is u n , where n isdepending on x and (cid:15) , with d ( u n , x ) < (cid:15) . Let (cid:15) > a (cid:15)n ∈ B ( u n , (cid:15) ) ∩ A , n ∈ N , ifthe intersection is nonempty. The set ˜ A := { a /mn } n,m ∈ N ⊂ A is nonempty and countable asunion of countable sets. It holds for any a ∈ A and any (cid:15) > ∃ n, m with 1 /m < (cid:15) and d ( a, u n ) < (cid:15) , hence d ( a, a /mn ) ≤ d ( a, u n ) + d ( u n , a /mn ) < (cid:15), i.e. ˜ A ⊂ A dense. So then A ⊂ cl ˜ A and if A is closed, then also cl ˜ A ⊂ A .22 heorem A.3 (support of a measure) . Let ( G, d ) be a Polish space and B ( G ) its Borel σ -algebra. Let π be a measure on ( G, B ( G )) and define its support via supp π = { x ∈ G | π ( B ( x, (cid:15) )) > ∀ (cid:15) > } . Then1. supp π = ∅ , if π = 0 .2. supp π is closed.3. π ( A ) = π ( A ∩ supp π ) for all A ∈ B ( G ) , i.e. π ((supp π ) c ) = 0 .4. For any closed set S ⊂ G with π ( A ∩ S ) = π ( A ) for all A ∈ B ( G ) holds supp π ⊂ S .Proof.
1. If π ( G ) >
0, then due to separability one could find for any (cid:15) > G with balls with radius (cid:15) , where at least one needs to have nonzero measure, because0 < π ( G ) ≤ P n π ( B ( x n , (cid:15) )). Now just consider B := B ( x N , (cid:15) ) such that π ( B ) > (cid:15) < (cid:15) iteratively, then there is asequence (cid:15) n → B ( x n +1 , (cid:15) n +1 ) ⊂ B ( x n , (cid:15) n ), such that x n → x , i.e. x ∈ supp π .2. Let ( x n ) n ∈ N ⊂ supp π with x n → x as n → ∞ . Let (cid:15) > N > d ( x n , x ) < (cid:15) for all n ≥ N . Then x n ∈ B ( x, (cid:15) ) and ∃ ˜ (cid:15) > B ( x n , ˜ (cid:15) ) ⊂ B ( x, (cid:15) ), so we get π ( B ( x, (cid:15) )) ≥ π ( B ( x n , ˜ (cid:15) )) > , i.e. x ∈ supp π .3. Write S = (supp π ) c . Choose { x n } n ∈ N ⊂ S dense. By openness of S there exists (cid:15) n > B ( x n , (cid:15) n ) ⊂ S , hence S = S n ∈ N B ( x n , (cid:15) n ) and π ( S ) ≤ X i ∈ N π ( B ( x n , (cid:15) n )) = 0 . (It holds π ( B ( x n , (cid:15) n )) = 0, because otherwise, one could find for any small enough (cid:15) >
0a countable cover of B ( x n , (cid:15) n ) with balls with radius (cid:15) , where at least one needs to havenonzero measure. Since this holds for all (cid:15) , there is a contradiction to B ( x n , (cid:15) n ) ⊂ S .)4. Let x ∈ supp π . So π ( B ( x, (cid:15) ) ∩ S ) > (cid:15) >
0, i.e. B ( x, (cid:15) ) ∩ S = ∅ for all (cid:15) >
0. Let x n be such that x n ∈ B ( x, (cid:15) n ) ∩ S , where (cid:15) n → n → ∞ . Then by closedness of S , x n → x ∈ S . Theorem A.4 (construction of an invariant measure) . Let ( µ P n ) n ∈ N be a tight family of prob-ability measures on a Polish space ( G, d ) , i.e. for any (cid:15) > there exists K (cid:15) ⊂ G compactwith ( µ P n )( G \ K (cid:15) ) < (cid:15) for all n ∈ N , where µ ∈ P ( G ) and P is the Markov operator for agiven transition kernel p , which is assumed to be Feller. Then any clusterpoint of ( ν n ) where ν n = n P ni =1 µ P i is an invariant measure for P .Proof. This is basically [20, Theorem 1.10]. The tightness of ( µ P n ) implies tightness of ( ν n )and therefore there exists a weakly converging subsequence ( ν n k ) with limit π ∈ P ( G ) byProkhorov’s Theorem. By the Feller property of P one has for any continuous and bounded f : G → R that also P f is continuous and bounded and hence | ( π P ) f − πf | = | π ( P f ) − πf |
23 lim k | ν n k ( P f ) − ν n k f | = lim k n k (cid:12)(cid:12)(cid:12) µ P n k +1 f − µ P f (cid:12)(cid:12)(cid:12) ≤ lim k k f k ∞ n k = 0 . Theorem A.5 (conditional expectation - basics, see Theorem 5.1 in [7]) . Let (Ω , F , P ) be aprobability space and X a real-valued random variable with E | X | < ∞ (integrable). Let F ⊂ F a sub- σ -algebra. Then it exists an a.s. unique F -mb. random variable Z := E [ X | F ] with E ( Z A ) = E ( X A ) for all A ∈ F .Let Y, X n be integrable random variables. Further properties are:1. E ( E [ X | F ]) = E X .2. X is F -mb, then E [ X | F ] = X a.s.3. X independent of F , then E [ X | F ] = E X a.s.4. E [ aX + bY | F ] = a E [ X | F ] + b E [ Y | F ] a.s. for all a, b ∈ R .5. X ≤ Y , then E [ X | F ] ≤ E [ Y | F ] a.s.6. ≤ X n ↑ X , then E [ X n | F ] ↑ E [ X | F ] a.s.7. F ⊂ F ⊂ F with σ -algebra F , then E [ E [ X | F ] | F ] = E [ X | F ] . Lemma A.6 (Satz 17.11 in [21]) . Let (Ω , F ) be a measurable space and µ be σ -finite. Let f : Ω → [0 , ∞ ] and set ν = f · µ (i.e. ν ( A ) = R A f d µ for A ∈ F ). Then f is µ -a.s. unique.Furthermore, ν is σ -finite (i.e. there exists (Ω n ) n ∈ N ⊂ F with ν (Ω n ) < ∞ and Ω n ↑ Ω ) if andonly if f is real-valued µ -a.s. Remark A.7:
A nonnegative real-valued random variable X on a probability space (Ω , F , P )induces a σ -finite measure ν = X · P . Theorem A.8 (conditional expectation - extension) . Let (Ω , F , P ) be a probability space and X ≥ be a real-valued random variable. Let F ⊂ F be a sub- σ -algebra. Then it exists ana.s. unique nonnegative real-valued random variable Z := E [ X | F ] on (Ω , F ) with E ( Z A ) = E ( X A ) for all A ∈ F .Let additionally Y, ( X n ) be nonnegative and real-valued, then all items 1.-7. in Theorem A.5 aresatisfied for these.Proof. From Remark A.7 follows the existence of disjoint sets Ω n ∈ F with S n Ω n = Ω andthe property that R Ω n X d P < ∞ . One has that a.s. Ω n E [ X | F ∩ Ω n ] = E [ X Ω n | F ∩ Ω n ] = E [ X | F ∩ Ω n ] = E [ X Ω n | F ] . Define Z := P n E [ X | F ∩ Ω n ], then Z = E [ X | F ]. The items 1.-7. follow now from Theo-rem A.5 on Ω n and the Monotone Convergence Theorem.24 heorem A.9 (disintegration) . Let (Ω , F , P ) be a probability space and (Ω , F ) , (Ω , F ) be measurable spaces and F ⊂ F a sub- σ -algebra, let X ∈ Ω have a regular version µ of P ( X ∈ · | F ) and let X ∈ Ω be F -mb. Let furthermore f : Ω × Ω → R + be measurable.Then E [ f ( X , X ) | F ] = Z f ( x , X ) µ (d x ) a.s. Lemma A.10.
Let (Ω , F , P ) be probability space. Let ( X k ) k ∈ N , ( U k ) k ∈ N be sequences ofnonnegative real-valued random variables with X k ∈ F k , where F ⊂ F ⊂ . . . ⊂ F are σ -algebras. Suppose for all k ∈ N X k +1 ≤ X k − U k a.s. Define V k := E [ U k | F k ] for k ∈ N . Then X k → X a.s. and P k U k , P k V k < ∞ a.s.Proof. This is a special instance of the more general supermartingale convergence theorem in[22].
Lemma A.11 (further properties of R ) . Under the standing assumptions, if T i = P i areprojectors onto nonempty, closed and convex sets, i ∈ I . There holds that:1. R is convex.2. R is continuously differentiable, ∇ R ( x ) = x − E [ P ξ x ] for all x ∈ H .3. ∇ R is globally Lipschitz continuous with constant not larger than .4. C = {∇ R = 0 } .Proof.
1. The function x dist( x, C i ) is convex for all i ∈ I , since C i = Fix P i is convex,nonempty and closed. On [0 , ∞ ) the function x x is increasing and convex, so x dist ( x, C i ) is convex, i ∈ I . The convexity of R follows by linearity of the expectation.2. We need to show thatlim = k y k→ | R ( x + y ) − R ( x ) − E [ h x − P ξ x, y i ] |k y k = 0 . Let ( y n ) ⊂ B (0 , (cid:15) ) ⊂ H with y n →
0. Define a sequence of functions on Ω via f n = (cid:12)(cid:12)(cid:12) dist ( x + y n , C ξ ) − dist ( x, C ξ ) − h x − P ξ x, y n i (cid:12)(cid:12)(cid:12) k y n k . Then for fixed ω ∈ Ω we have f n ( ω ) → n → ∞ since the function x dist ( x, C i )is Fréchet differentiable for all i ∈ I [8, Corollary 12.30]. Furthermore, we find for any n ∈ N that f n = (cid:12)(cid:12)(cid:12) k x + y n − P ξ ( x + y n ) k − k x − P ξ x k − h x − P ξ x, y n i (cid:12)(cid:12)(cid:12) k y n k = (cid:12)(cid:12)(cid:12) k y n − P ξ ( x + y n ) + P ξ x k + 2 h x − P ξ x, P ξ x − P ξ ( x + y n ) i (cid:12)(cid:12)(cid:12) k y n k (cid:12)(cid:12)(cid:12) k y n k + k P ξ x − P ξ ( x + y n ) k + 2 h y n + x − P ξ x, P ξ x − P ξ ( x + y n ) i (cid:12)(cid:12)(cid:12) k y n k≤ k y n k + 2 dist( x, C ξ ) k y n kk y n k≤ (cid:15) + 2 dist( x, C ξ ) =: g, where, in the first inequality, we used nonexpansivity of the projectors P i , i ∈ I and theCauchy-Schwartz inequality. In particular with Hölder’s inequality follows that E [ g ] ≤ (cid:15) + 2 p R ( x ), i.e. g is integrable and hence Lebesgue’s Dominated Convergence Theoremyields E [ f n ] →
0, which gives us Fréchet differentiability of R with derivative ∇ R ( x ) =2 E [ x − P ξ x ]. Continuity of ∇ R follows from k∇ R ( x + y ) − ∇ R ( x ) k = 2 k E [ y − P ξ ( x + y ) + P ξ x ] k≤ E [ k y k + k P ξ ( x + y ) − P ξ x k ] ≤ k y k , where we used nonexpansivity of the projectors P i , i ∈ I in the second inequality.3. For any x, y ∈ H it holds that k∇ R ( x ) − ∇ R ( y ) k ≤ k x − y k + k E [ P ξ x − P ξ y ] k ). Apply-ing Jensen’s inequality and nonexpansivity, we arrive at the desired result.4. Clearly if x ∈ C , then x = P ξ x a.s. and so x = E [ P ξ x ], i.e. ∇ R ( x ) = 0 by 2.Now conversely, if ∇ R ( x ) = 0, then by convexity R ( x ) − R ( y ) ≤ h∇ R ( x ) , x − y i = 0 forall y ∈ H . Since C = ∅ there is y ∈ H with R ( y ) = 0, so also R ( x ) = 0, i.e. x ∈ C . B. Paracontractions
Paracontractions include the set of averaged operators, but averaged mappings possess moreuseful regularity properties, e.g. when composing these operators, one stays in the set of averagedoperators, whereas for nonaveraged operators this is not clear in general. An example of anonaveraged paracontraction in R is a Huber function with parameter α > α = 1) f α ( x ) := ( x α , | x | ≤ α | x | − α , | x | > α , x ∈ R . We have that f α is nonexpansive and paracontractive, but not averaged, since for x = − α and y = − α one has f ( x ) = α and f ( y ) = α . Consequently | f ( x ) − f ( y ) | = α = | x − y | , but | x − f ( x ) − ( y − f ( y )) | = 2 α = 0 . In general metric spaces with nonlinear structure the averaged mappings are not defined orat least demand a different definition, but still the paracontraction framework applies here andexhibits a useful description of mappings which ensure that the RFI converges to a commonfixed point. Paracontractions were used in Section 3.1 to guarantee Fejér monotonicity, yieldingconvergence; averagedness in this context would be too strong an assumption (and is not definedactually). In the following we provide an example of a class of paracontracting operators in R n ,that are not in general averaged, resolvents of quasiconvex functions.26 efinition B.1. A function f : R n → R is called quasiconvex, if the sublevel sets { x ∈ R n | f ( x ) ≤ α } are convex for all α ∈ R . Equivalently, f satisfies f ( λx + (1 − λ ) y ) ≤ max { f ( x ) , f ( y ) } ∀ x, y ∈ R n , ∀ λ ∈ [0 , . The proximity operator function f : R n → R is given by the set-valued mappingprox f ( x ) := argmin y ∈ R n (cid:26) f ( y ) + 12 k x − y k (cid:27) , x ∈ R n . Lemma B.2.
Let f : R n → R be twice differentiable and quasiconvex and satisfy S :=argmin f = ∅ and ∇ f = 0 on R n \ S , furthermore suppose that Id + Hess f ( x ) is positivedefinit for all x ∈ R n , then prox f is paracontracting.Proof. Denote A := Id + ∇ f . Let x, y ∈ R n with f ( x ) ≥ f ( y ), then k A ( x ) − y k = k x − y k + k∇ f ( x ) k + h∇ f ( x ) , x − y i ≥ k x − y k , where we used that in [24] it is shown, that a quasiconvex and differentiable function satisfies f ( x ) ≥ f ( y ) = ⇒ h∇ f ( x ) , x − y i ≥ , for any x, y ∈ R n . Note that if x / ∈ S then ∇ f ( x ) = 0 by assumption and hence for y ∈ R n with f ( y ) ≤ f ( x ) it holds that k A ( x ) − y k > k x − y k . (19)Moreover, the function g ( y ) := f ( y ) + 12 k x − y k for fixed x ∈ R n is bounded from below, since inf x f ( x ) > −∞ by assumption and coercive.From positive definitness of Id + Hess f we have that g is also twice continuously differentiableand strictly convex, hence it possesses a unique minimizer ¯ x that satisfies x = ∇ f (¯ x ) + ¯ x = A (¯ x ) , it follows that A ( R n ) = R n , i.e. A is surjective. Furthermore, A is injective, since from unique-ness of the minimizer and sufficiency of the first order optimality criterion for ¯ x to be a minimizer( g is convex) it follows that, if A (¯ x ) = A (¯ y ), then ¯ x = ¯ y is the minimizer for g and in particular A (¯ x ) = x ⇔ prox f ( x ) = ¯ x .To show that also prox f is continuous, fix x ∈ R n and let y ∈ B ( x, (cid:15) ). We can find a z ∈ B ( x, (cid:15) )with f ( z ) ≤ f ( y ) for all y ∈ B ( x, (cid:15) ) by continuity of f , so we get with (19) that (cid:13)(cid:13)(cid:13) prox f ( x ) − prox f ( y ) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) prox f ( x ) − z (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) prox f ( y ) − z (cid:13)(cid:13)(cid:13) < k x − z k + k y − z k < (cid:15). In particular, letting y = ¯ x ∈ S in (19), we have that (cid:13)(cid:13)(cid:13) prox f ( x ) − ¯ x (cid:13)(cid:13)(cid:13) < k x − ¯ x k ∀ x ∈ R n \ S, where S = argmin f = Fix prox f . 27 xample B.3 (non-averaged resolvent of quasiconvex function) . The function f ( x ) := 1 − exp (cid:16) −k x k (cid:17) for x ∈ R n satisfies all the conditions in Lemma B.2. Its proximity operator hasthe derivative prox f ( A ( x )) = ( A ( x )) − , where A ( x ) = (cid:16) (cid:16) −k x k (cid:17)(cid:17) x , i.e. A ( x ) = (cid:16) (cid:16) −k x k (cid:17)(cid:17) Id − (cid:16) −k x k (cid:17) xx T . Since (cid:13)(cid:13)(cid:13) prox f ( A ( x )) (cid:13)(cid:13)(cid:13) ≥ k y k / k A ( x ) y k for any y ∈ R n \ { } , we have with x = e = (1 , , . . . , T = y that (cid:13)(cid:13)(cid:13) prox f ( A ( e )) (cid:13)(cid:13)(cid:13) >
1, which is incontradiction to nonexpansiveness of averaged mappings, that have derivative bounded by 1, ifit exists.Where paracontractions also occur are nonconvex feasibility problems, both consistent andinconsistent. As long as the fixed point set of the averaged projections operator consists ofisolated points and the projectors are single-valued in a neighborhood of this fixed point, [25,Theorem 3.2] shows that these operators are paracontractions, whenever all assumptions of thetheorem are met. Unfortunately, a statement on paracontractiveness, for the case that the fixedpoint set of the averaged projections operator does not consist of isolated points, is however notpossible in general.Furthermore, also nonconvex forward-backward operators appearing in structured optimiza-tion of nonconvex objective functions show the paracontractiveness property, see [25, Proposition3.9], and these are not averaged in general, still the assumption that the fixed points are isolatedis used.
References [1] D. Butnariu and S. D. Flåm, “Strong convergence of expected-projection methods in hilbertspaces,”
Numerical Functional Analysis and Optimization , vol. 16, no. 5-6, pp. 601–636,1995.[2] A. Nedić, “Random algorithms for convex minimization problems,”
Mathematical Program-ming , vol. 129, no. 2, pp. 225–253, 2011.[3] S. D. Flåm, “Successive averages of firmly nonexpansive mappings,”
Mathematics of Op-erations Research , vol. 20, no. 2, pp. 497–512, 1995.[4] D. Butnariu, “The expected-projection method: Its behavior and applications to linearoperator equations and convex optimization,”
Journal of Applied Analysis , vol. 1, no. 1,pp. 93–108, 1995.[5] A. Nedić, “Random projection algorithms for convex set intersection problems,” , pp. 7655–7660, 2010.[6] D. R. Luke, M. Teboulle, and N. H. Thao, “Necessary conditions for linear convergence ofPicard iterations and application to alternating projections,” arXiv , 2017.[7] O. Kallenberg,
Foundations of Modern Probability . Probability and Its Applications, NewYork: Springer, 1997.[8] H. H. Bauschke and P. L. Combettes,
Convex Analysis and Monotone Operator Theory inHilbert Spaces . Berlin: Springer, 2011.[9] W. R. Mann, “Mean value methods in iterations,”
Proc. Amer. Math. Soc. , vol. 4, pp. 506–510, 1953. 2810] M. A. Krasnoselski, “Two remarks on the method of successive approximations,”
Math.Nauk. (N.S.) , vol. 63, no. 1, pp. 123–127, 1955. (Russian).[11] M. Edelstein, “A remark on a theorem of M. A. Krasnoselski,”
Amer. Math. Monthly ,vol. 73, pp. 509–510, May 1966.[12] L. Gubin, B. Polyak, and E. Raik, “The method of projections for finding the commonpoint of convex sets,”
USSR Comput. Math. and Math. Phys. , vol. 7, no. 6, pp. 1–24, 1967.[13] J. B. Baillon, R. E. Bruck, and S. Reich, “On the asymptotic behavior of nonexpansivemappings and semigroups in Banach spaces,”
Houston J. Math. , vol. 4, no. 1, pp. 1–9,1978.[14] M. Hairer, “Ergodic properties of Markov processes,”
Lecture Notes in Mathematics ,vol. 1881, pp. 1–39, 2006.[15] F. Deutsch,
Best Approximation in Inner Product Spaces . New York: Springer, 2001.[16] A. Y. Kruger, D. R. Luke, and N. H. Thao, “Set regularities and feasibility problems,”
Mathematical Programming , vol. 168, no. 1-2, pp. 279–311, 2018.[17] J. Bolte, T. P. Nguyen, J. Peypouquet, and B. W. Suter, “From error bounds to the com-plexity of first-order descent methods for convex functions,”
Mathematical Programming ,vol. 165, pp. 471–507, Oct 2017.[18] H. Alt,
Lineare Funktionalanalysis: Eine anwendungsorientierte Einführung (in German) .Springer Lehrbuch, Berlin: Springer, 2002.[19] V. Bogachev,
Measure Theory , vol. I. Springer Berlin Heidelberg, 2007.[20] M. Hairer, “Convergence of Markov processes,”
Lecture notes, University of Warwick , 2016.[21] H. Bauer,
Maß- und Integrationstheorie (in German) . De Gruyter Lehrbuch, Berlin: W.de Gruyter, 1992.[22] H. Robbins and D. Siegmund, “A convergence theorem for non negative almost super-martingales and some applications,” in
Optimizing Methods in Statistics (J. S. Rustagi,ed.), pp. 233 – 257, Academic Press, 1971.[23] H. H. Bauschke and J. M. Borwein, “On projection algorithms for solving convex feasibilityproblems.,”
SIAM Rev. , vol. 38, no. 3, pp. 367–426, 1996.[24] K. J. Arrow and A. C. Enthoven, “Quasi-concave programming.,”
Econometrica , vol. 29,pp. 779–800, 1961.[25] D. R. Luke, N. H. Thao, and M. K. Tam, “Quantitative convergence analysis of iteratedexpansive , set-valued mappings,”