[PDF] Random Function Iterations for Consistent Stochastic Feasibility

Abstract

We study the convergence of stochastic fixed point iterations in the consistent case (in the sense of Butnariu and Flåm (1995)) in several different settings, under decreasingly restrictive regularity assumptions of the fixed point mappings. The iterations are Markov chains and, for the purposes of this study, convergence is understood in very restrictive terms. We show that sufficient conditions for geometric (linear) convergence in expectation of stochastic projection algorithms presented in Nedić (2011), are in fact necessary for geometric (linear) convergence in expectation more generally of iterated random functions.

Full PDF

RRandom Function Iterations for Consistent

Stochastic Feasibility

Neal Hermer ∗ , D. Russell Luke † and Anja Sturm ‡ September 19, 2018

Abstract

We study the convergence of iterated random functions for stochastic feasibility in theconsistent case (in the sense of Butnariu and Flåm [1]) in several diﬀerent settings, underdecreasingly restrictive regularity assumptions of the ﬁxed point mappings. The iterationsare Markov chains and, for the purposes of this study, convergence is understood in veryrestrictive terms. We show that suﬃcient conditions for geometric (linear) convergence inexpectation of stochastic projection algorithms presented in Nedić [2], are in fact neces-sary for geometric (linear) convergence in expectation more generally of iterated randomfunctions.

Primary 60J05, 52A22, 49J55 Secondary 49J53, 65K05.

Keywords:

Averaged mappings, nonexpansive mappings, paracontractions, stochastic feasi-bility, stochastic ﬁxed point problem, iterated random functions, metric subregularity, linearregularity, linear convergence in expectation, geometric convergence of measures

1. Introduction

We are inspired by problems of the formFind x ∗ ∈ \ i ∈ I C i . where I is an index set and C i are closed subsets of a metric space. When I is a ﬁnite set, thenthis is the classical deterministic feasibility problem . Randomized algorithms for solving suchﬁnite intersections have been intensely studied in recent years for their application to distributedcomputational schemes and machine learning. Our study here concerns a generalization where I is an arbitrary (possibly uncountable) set. This was ﬁrst considered by Butnariu and Flåm ∗ Institut für Numerische und Angewandte Mathematik, Universität Göttingen, 37083 Göttingen, Germany.NH was supported by Deutsche Forschungsgemeinschaft Research Training Grant 2088 TP-B5. E-mail: [email protected] † Institut für Numerische und Angewandte Mathematik, Universität Göttingen, 37083 Göttingen, Germany.DRL was supported in part by Deutsche Forschungsgemeinschaft Research Training Grant 2088 TP-B5. E-mail: [email protected] ‡ Institut für Mathematische Stochastik, Universität Göttingen, 37077 Göttingen, Germany. AS was sup-ported in part by Deutsche Forschungsgemeinschaft Research Training Grant 2088 TP-B5. E-mail: [email protected] a r X i v : . [ m a t h . O C ] S e p

1, 3, 4], where this is called the stochastic feasibility problem . There are many reasons whyone might consider such inﬁnite feasibility problems. At the time Butnariu and Flåm’s workappeared, there was a great deal of interest in solution methods for linear integral equations.Our primary motivation is to propose a new way of modeling and analyzing errors, eithernumerical or measurement, as they are manifest in numerical iterative procedures.The simplest algorithm one could imagine for solving such problems is to generate sequences( x k ) k ∈ N by the ﬁxed point iteration x k +1 ∈ Y i ∈ I P C i ! x k where P C i is the metric projection onto the set C i and Q i ∈ I indicates the composition of theoperators over the index set. This is known as the cyclic projections algorithm and has beenextensively studied for the case when I is a ﬁnite set. But when I is inﬁnite one immediatelyencounters the problem that such an algorithm never completes the ﬁrst iteration!One application where inﬁnite feasibility problems appear very naturally is integral equationsof the ﬁrst kind in the separable Hilbert space L ([ a, b ]), as considered in [4]:( T x )( t ) = Z ba K ( t, s ) x ( s ) d s = g ( t ) t ∈ [ a, b ] , (1)with g ∈ L ([ a, b ]). The feasibility reformulation of this problem isFind x ∈ \ t ∈ [ a,b ] , a.s. C t := { ϕ ∈ L ([ a, b ]) | ( T ϕ ) ( t ) = g ( t ) } . (2)The almost surely (a.s) under the intersection will be clariﬁed below. The basic idea, however,is to choose the parameter t above randomly and to choose the nearest point in the set C t tothe current guess. As such, the sequences that we generate are random processes. We returnto this application in Section 4.3.One of our main contributions is to place the previous results on stochastic projections it-erations in the broader context of random function iterations, that is, iterations of randomlyselected mappings with arbitrary initial distributions from which the initial points are chosen.Following [5] we characterize the sequence generated by the method of Random Function It-erations (Algorithm 1) as a Markov chain (Proposition 2.2). There are many diﬀerent notionsof convergence of Markov chains. For consistent feasibility problems considered in this study(Standing Assumption 2) almost sure convergence applies, which perhaps is of limited interestin the broader context of generic Markov chains. When we move to inconsistent stochasticfeasibility problems, what we formulate more generally as stochastic ﬁxed point problems in afollow-up study, the richness of the theory of Markov processes comes much more into play. Forthe consistent case we are able to establish convergence results in a number of new settings,namely compact metric spaces or R n with paracontracting mappings (see Section 3.1 and 3.2,resp.) and separable Hilbert spaces with averaged mappings (see Section 3.3). (It is the tech-nology of averaged mappings that opens the door to an analysis of algorithms for inconsistentstochastic feasibility problems and more generally stochastic ﬁxed point problems.)Our framework allows us to address a variety of Markov chains and stochastic algorithms,though to ﬁx these ideas our main example will be stochastic sequential projection algorithms.We also achieve with this analysis more reﬁned convergence statements about the correspondingsequences of random variables, namely almost surely strong and, under certain assumptions,geometric convergence (called linear or exponential in other communities) of the measures The-orem 3.12. We then show necessary and suﬃcient conditions for geometric convergence in ex-pectation Theorem 3.15 in a stochastic analog to [6, Theorem 3.12 and Corollary 3.13]. These2onditions are identiﬁed as a manifestation of metric subregularity of a suitable merit functionat its zeros. When specialized to convex feasibility, this formulation of metric subregularity ofthe merit function is equivalent to the function possessing what is known as the KL propertyProposition 4.2. Finally, we identify a previously unrecognized necessary condition for geo-metric convergence of stochastic sequential projection algorithms, Theorem 4.3. Despite thestrong assumptions of the present study, a number of unexpected behaviors can occur; theseare demonstrated in concrete examples.

2. Stochastic Fixed Point Theory

Consider a collection of continuous mappings T i : G → G , i ∈ I , on a metric space ( G, d ),where I is an arbitrary index set. Assume that ( I, I ) is a measurable space. Let (Ω , F , P ) be aprobability space. Let ξ be an I -valued random variable, i.e. ξ : Ω → I measurable ( ξ − A ∈ F for all A ∈ I ).Let ( ξ n ) n ∈ N be an iid sequence with ξ n d = ξ . Let µ be a probability measure on ( G, B ( G )).The stochastic selection method is given by Algorithm 1

Random Function Iteration (RFI)

Initialization: X ∼ µ for k = 0 , , , . . . do X k +1 = T ξ k X k return { X k } k ∈ N Here we mean by X ∼ µ , that the law (i.e. distribution) of X satisﬁes L ( X ) := P X := P ( X ∈ · ) := P ◦ X − = µ . The following assumptions will be employed throughout. Standing Assumption 1 a) X , ξ , ξ , . . . , ξ k are independent for every k ∈ N , where ( ξ k ) k ∈ N are i.i.d.b) The function Φ : G × I → G , ( x, i ) T i x is measurable. The iterates of the RFI X k +1 = Φ( X k , ξ k ) = T ξ k X k , k ∈ N , can be considered as a timehomogeneous Markov chain with transition kernel( x ∈ G )( A ∈ B ( G )) p ( x, A ) := P (Φ( x, ξ ) ∈ A ) = P ( T ξ x ∈ A ) (3)where Φ is called an update function .To see that p is really a transition kernel, recall that, in general, a transition kernel p : G × B ( G ) → [0 ,

1] is measurable in the ﬁrst argument, i.e. p ( · , A ) is measurable for all A ∈ B ( G )(follows from [7, Lemma 1.26]) and is a probability measure in the second argument, i.e. p ( x, · )is a probability measure for all x ∈ G (immediate by deﬁnition). Now recall the deﬁnition of aMarkov chain with general transition kernel p . Deﬁnition 2.1.

A sequence of random variables ( X k ) k ∈ N , X k : (Ω , F , P ) → ( G, B ( G )) iscalled Markov chain with transition kernel p if for all k ∈ N and A ∈ B ( G ) P -a.s. holds(i) P ( X k +1 ∈ A | X , X , . . . , X k ) = P ( X k +1 ∈ A | X k )3ii) P ( X k +1 ∈ A | X k ) = p ( X k , A ).Since ( G, B ( G )) is a Borel space and X k is a random variable in G , it makes sense to talk aboutthese conditional expectations (existence of regular versions of the conditional distribution by[7, Theorem 5.3]).From the setting above, the following fact follows easily from Theorem A.9. Proposition 2.2.

Under Standing Assumption 1, the sequence of random variables ( X k ) k ∈ N generated by Algorithm 1 is a Markov chain with transition kernel p given by (3) . Let µ ∈ P ( G ), where P ( G ) is the space of all probability measures on G . The Markovoperator P acting on a measure µ is deﬁned via( A ∈ B ( G )) µ P ( A ) := Z G p ( x, A ) µ (d x ) . One deﬁnes the operation of the Markov operator acting on a measurable function f : G → R via ( x ∈ G ) P f ( x ) := Z G f ( y ) p ( x, d y ) . Note that P f ( x ) = Z G f ( y ) P Φ( x,ξ ) (d y ) = Z Ω f (Φ( x, ξ ( ω ))) P (d ω ) = Z I f (Φ( x, u )) P ξ (d u ) . There are several notions of convergence that one can study for the sequence ( X k ) on themetric space ( G, d ), or correspondingly of the law ( L ( X k )) on P ( G ), where L ( X k ) := P X k := P ( X k ∈ · ). Denote the set of bounded and continuous functions from G to R by C b ( G ). Let( ν n ) be a sequence of probability measures on G . The sequence ( ν n ) converges to ν in the weaksense , if ν ∈ P ( G ) and for all f ∈ C b ( G ) it holds that ν n f → νf as n → ∞ .A ﬁxed point of the Markov operator P is called an invariant distribution , i.e. π ∈ P ( G ) isinvariant if and only if π P = π . A very general notion of convergence in the context of Markovchains concerns the convergence of the probability measures ν k := k P kj =1 L ( X j ) in the weaksense. An elementary fact from the theory of Markov chains (Theorem A.4) is that, if thisconverges, it converges to an invariant probability measure π : For f ∈ C b ( G ), ν k f = E  k k X j =1 f ( X j )  → πf, as k → ∞ . The notion of convergence we consider here is much stronger; we consider almost sure conver-gence of the sequence ( X k ) to a random variable X : X k → X a.s. , as k → ∞ . Clearly, almost sure convergence of the sequence implies the more general notion above. Thisis common in the studies of stochastic algorithms in optimization, though this does not requirethe full power of the theory of general Markov processes. For consistent feasibility (deﬁnedbelow) however, this is all that is needed for our present purposes. In a follow-up study we willneed to consider more general notions of convergence of this Markov chain.4 .2. Consistent Stochastic Feasibility Problem

The stochastic feasibility problem is to ﬁnd a point x ∗ ∈ C := { x ∈ G | P ( x ∈ Fix T ξ ) = 1 } , (4)where the ﬁxed point set of the operator T i is denoted asFix T i = { x ∈ G | x = T i x } . We assume throughout that not only is Fix T i non-empty for P ξ -almost all i ∈ I , but morerestrictively: Standing Assumption 2 (consistent feasibility problem).

The set C is nonempty.Note that due to continuity of T i it follows, that Fix T i is a closed set. This specializes im-mediately to the stochastic feasibility problem formulated by Butnariu and Flåm [1] whereFix T ξ = C ξ . In order to make sense of the specialization to stochastic set feasibility, we needthe event { x ∈ Fix T ξ } to be an element of F for any x ∈ G . Remark 2.3:

Since { x } ∈ B ( G ) and the function Φ ξ : G × Ω → G , ( x, ω ) Φ ξ ( x, ω ) :=(Φ ◦ (Id , ξ ))( x, ω ) = T ξ ( ω ) x is measurable as composition of two measurable functions, we ﬁnd { x ∈ Fix T ξ } = n ω ∈ Ω (cid:12)(cid:12)(cid:12) x ∈ Fix T ξ ( ω ) o = n ω ∈ Ω (cid:12)(cid:12)(cid:12) T ξ ( ω ) x = x o = n ω ∈ Ω (cid:12)(cid:12)(cid:12) ( x, ω ) ∈ Φ − ξ { x } o ∈ F , since slices of sets in the product σ -ﬁeld are measurable with respect to the single σ -ﬁelds (seeLemma A.1).Denote in the following for A ⊂ Ω, C ( A ) := T ω ∈ A Fix T ξ ( ω ) . Lemma 2.4 (equivalence of stochastic and deterministic feasibility problems) . Under the stand-ing assumptions, and if G is complete and separable, there exists a P -nullset N ⊂ Ω , such that C = C (Ω \ N ) = \ ω ∈ Ω \ N Fix T ξ ( ω ) . Furthermore, C ⊂ G is closed.Proof. For the direction " ⊃ ", note that P (Ω \ N ) = 1 for any P -nullset N ⊂ Ω, so for x ∈ C (Ω \ N )it holds that P ( x ∈ Fix T ξ ) = 1, i.e. x ∈ C .Consider now the direction " ⊂ ". Let Q be a dense and countable subset of C (exists by The-orem A.2). Since for each q ∈ Q , P ( q ∈ Fix T ξ ) = 1, there is N q ⊂ Ω with P ( N q ) = 0 and q ∈ C (Ω \ N q ). Set N = S q ∈ Q N q , then P ( N ) = 0 and q ∈ C (Ω \ N ) for all q ∈ Q .Now let c ∈ C , so ∃ ( q n ) n ∈ N ⊂ Q with q n → c as n → ∞ . Since, for all i ∈ I , Fix T i is closed bycontinuity of T i , we get c = lim n →∞ q n ∈ C (Ω \ N ).The set C (Ω \ N ) is deﬁned as intersection over closed sets and hence closed itself. Remark 2.5 (interpretation) : Lemma 2.4 shows that the feasible set C in the separable casecan be written as intersection of a selection of sets Fix T ξ ( ω ) as in the deterministic formulationof the ﬁxed point problem, but where ω ∈ Ω \ N for a nullset N ⊂ Ω. In fact C (Ω) is ingeneral a proper subset of C = C (Ω \ N ) or can even be empty. But note that, even though5he construction of C in Lemma 2.4 appears to depend on the random variable ξ , in fact C only depends on the distribution P ξ by deﬁnition. Furthermore, in the context of more generalMarkov chains, we have,( c ∈ C ) p ( c, { c } ) = P ( T ξ c ∈ { c } ) = P (Ω \ N ) = 1 . Hence ( A ∈ B ( G )) δ c P ( A ) = p ( c, A ) = A ( c ) = δ c ( A ) . In other words, the delta function δ c for c ∈ C is an invariant measure for P . Corollary 2.6 ( P ξ nullset, separable space) . Under the assumptions of Lemma 2.4 there existsa P -nullset N with C = C (Ω \ N ) , such that ξ ( N ) := { ξ ( ω ) | ω ∈ N } is a P ξ -nullset, where wedenote P ξ = P ( ξ ∈ · ) , and it satisﬁes C = \ i ∈ ξ (Ω) \ ξ ( N ) Fix T i . Proof.

We will construct a P -nullset N for which ξ (Ω \ N ) = ξ (Ω) \ ξ ( N ), where ξ ( N ) is a P ξ -nullset, in that case immediately follows that \ ω ∈ Ω \ N Fix T ξ ( ω ) = \ i ∈ ξ (Ω) \ ξ ( N ) Fix T i . Let A x := { i ∈ I | T i x = x } for x ∈ G , then analogously to Remark 2.3 A x = n i ∈ I (cid:12)(cid:12)(cid:12) ( x, i ) ∈ Φ − { x } o ∈ I and so is A := T c ∈ C A c = T q ∈ Q A q as countable intersection of measurable sets ( Q ⊂ C dense and countable, see proof of Lemma 2.4). Let ˜ N be the P -nullset from Lemma 2.4, i.e. C = C (Ω \ ˜ N ), note that due to C = \ ω ∈ Ω \ ˜ N Fix T ξ ( ω ) = \ i ∈ ξ (Ω \ ˜ N ) Fix T i = ∅ it holds ξ (Ω \ ˜ N ) ⊂ A c = ∅ , for all c ∈ C . Set N := Ω \ ξ − A , then from Ω \ ˜ N ⊂ ξ − A follows N ⊂ ˜ N is a P -nullset and P ξ ( A ) = P ( ξ − A ) ≥ P (Ω \ ˜ N ) = 1 , i.e. P ξ ( ξ ( N )) = 1 − P ξ ( A ) = 0. By deﬁnition of A we have for ω ∈ ξ − A , that any c ∈ C satisﬁes c ∈ Fix T ξ ( ω ) , so it follows C ⊂ C ( ξ − A ). Due to C (Ω \ N ) ⊂ C (Ω \ ˜ N ) holds C = \ ω ∈ ξ − A Fix T ξ ( ω ) = \ i ∈ ξ ( ξ − A ) Fix T i = \ i ∈ ξ (Ω) \ ξ ( N ) Fix T i . Note that from N = Ω \ ξ − A follows ξ ( N ) = ξ (Ω) \ ξ ( ξ − A ).If ξ is not surjective, then ξ (Ω) = I . In that case, there is a P ξ -nullset ξ ( N ) of indices in I , that are not needed to characterize the ﬁxed point set, and these indices can be removedfrom the index set I . Note also that, in general, the P -nullsets occurring in Lemma 2.4 andCorollary 2.6 are diﬀerent. If there is N ⊂ Ω with C = C (Ω \ N ), then it need not be the casethat C = T i ∈ ξ (Ω) \ ξ ( N ) Fix T i .In the context of the iterates X k of Algorithm 1 in many of the results below we constructthe set N in Lemma 2.4 as follows: N = [ k N k where N k := Ω \ { ω ∈ Ω | T ξ k ( ω ) c = c ∀ c ∈ C } . (5)From Lemma 2.4 we have that N k is a set of measure zero, hence so is N .6 . Convergence Analysis We achieve convergence of iterated random functions for consistent stochastic feasibility inseveral diﬀerent settings under diﬀerent assumptions on the metric spaces and the mappings T i ( i ∈ I ). The main properties of the mappings we consider are: • quasi-nonexpansive mappings, i.e.( ∀ x / ∈ Fix T i )( ∀ y ∈ Fix T i ) d ( T i x, y ) ≤ d ( x, y ) . (6) • paracontractions, i.e. T i is continuous and( ∀ x / ∈ Fix T i )( ∀ y ∈ Fix T i ) d ( T i x, y ) < d ( x, y ); (7) • nonexpansive mappings, i.e.( ∀ x, y ∈ G ) d ( T i x, T i y ) ≤ d ( x, y ); (8) • averaged mappings on a normed linear space H , i.e. mappings T : H → H for which thereexists an α ∈ (0 ,

1) such that( ∀ x, y ∈ H ) k T x − T y k + 1 − αα k ( x − T x ) − ( y − T y ) k ≤ k x − y k . (9)Note that for a quasi-nonexpansive mapping T : G → G the condition x ∈ Fix T impliesthat d ( T x, y ) = d ( x, y ) for all y ∈ G . The set of quasi-nonexpansive mappings contains theparacontractions and the nonexpansive mappings. The set of projectors onto convex sets ormore generally the set of averaged mappings on a Hilbert space H is contained in both the setof nonexpansive mappings and the set of paracontractions [8, Remark 4.24 and 4.26]. For anexample of a paracontraction that is not averaged see Example 3.1 and Appendix B. Averagedmappings were ﬁrst used in the work of Mann, Krasnoselski, Edelstein, Gurin, Polyak and Raikwho wrote seminal papers in the analysis of (ﬁrmly) nonexpansive and averaged mappings [9–12]although the terminology “averaged” wasn’t coined until sometime later [13]. Example 3.1.

Let f : R → R + be continuous. Let f (0) = 0 and | f ( x ) | < | x | for all x ∈ R \{ } ,then f is paracontractive. This includes also convex functions, e.g. Huber functions, which arenot averaged in general (see Appendix B). For other examples on R n also see Appendix B. In this section we establish convergence of the RFI on a compact metric space. The nextexample illustrates why nonexpansivity alone does not suﬃce to guarantee convergence to theintersection set C . Example 3.2 (nonexpansive mappings, negative result) . For non-expansive mappings in gen-eral, one cannot expect that the support of every invariant measure is contained in the feasibleset C . Consider a rotation in positive direction in R A = cos( ϕ ) − sin( ϕ )sin( ϕ ) cos( ϕ ) ! , ϕ ∈ (0 , π ) , and set ξ = 1 and I = { } , T = A . Then C = { } and, since k A k = 1, A is nonexpansive, but k Ax k = k x k for all x ∈ R . So the (deterministic) iteration X k +1 = AX k will not converge to0, whenever X ∼ δ x , x = 0. 7 suﬃcient requirement on the mappings T i to ensure convergence of Algorithm 1 is para-contractiveness. The next Lemma is the main ingredient for proving a.s. convergence of ( X k )to a random point in C . The support of a probability measure ν ∈ P ( G ) is the smallest closedset S ⊂ G , for which ν ( G \ S ) = 0 (see also Theorem A.3 for equivalent representations); wethen write S = supp ν . Lemma 3.3 (invariant measures for para-contractions) . Under the standing assumptions andif T i ( i ∈ I ) is paracontracting on a compact metric space, then the set of invariant measuresfor P is { π ∈ P ( G ) | supp π ⊂ C } .Proof. It is clear that π ∈ P ( G ) with supp π ⊂ C is invariant, since p ( x, { x } ) = P ( T ξ x ∈ { x } ) = P ( x ∈ Fix T ξ ) = 1 for all x ∈ C and hence π P ( A ) = R C p ( x, A ) π (d x ) = π ( A ) for all A ∈ B ( G ).The other implication is not so immediate. Suppose supp π \ C = ∅ for some invariantmeasure π of P . Then due to compactness of supp π (as it is closed in G ) we can ﬁnd s ∈ supp π maximizing the continuous function dist( · , C ) on G . So d max = dist( s, C ) >

0. We show thatthe probability mass around s will be attracted to the feasible set C , implying that the invariantmeasure loses mass around s in every step, which yields a contradiction.Deﬁne the set of points being more than d max − (cid:15) away from C : K ( (cid:15) ) := { x ∈ G | dist( x, C ) > d max − (cid:15) } , (cid:15) ∈ (0 , d max ) . This set is measurable, i.e. K ( (cid:15) ) ∈ B ( G ), because it is open. Let M ( (cid:15) ) be the event in F , where T ξ s is at least (cid:15) closer to C than s , i.e. M ( (cid:15) ) := n ω ∈ Ω (cid:12)(cid:12)(cid:12) dist( T ξ ( ω ) s, C ) ≤ d max − (cid:15) o . There are two possibilities, either there is an (cid:15) ∈ (0 , d max ) with P ( M ( (cid:15) )) > (cid:15) exists. In the latter case we have dist( T ξ s, C ) = d max = dist( s, C ) a.s. by paracontractivenessof T i . By compactness of C there exists c ∈ C such that 0 < d max = d ( s, c ). Hence theprobability of the set of ω ∈ Ω such that s Fix T ξ ( ω ) is positive and so is the probability thatdist( T ξ ( ω ) s, C ) ≤ d ( T ξ ( ω ) s, c ) < d ( s, c ) - a contradiction.So it must hold that there is an (cid:15) ∈ (0 , d max ) with P ( M ( (cid:15) )) >

0. In view of continuity of themappings T i around s , i ∈ I , deﬁne A n := n ω ∈ M ( (cid:15) ) (cid:12)(cid:12)(cid:12) d ( T ξ ( ω ) x, T ξ ( ω ) s ) ≤ (cid:15) ∀ x ∈ B ( s, n ) o ( n ∈ N ) . It holds that A n ⊂ A n +1 and P ( S n A n ) = P ( M ( (cid:15) )). So in particular there is an m ∈ N , m ≥ /(cid:15) with P ( A m ) >

0. For all x ∈ B ( s, m ) and all ω ∈ A m we havedist( T ξ ( ω ) x, C ) ≤ d ( T ξ ( ω ) x, T ξ ( ω ) s ) + dist( T ξ ( ω ) s, C ) ≤ d max − (cid:15) , which means T ξ ( ω ) x ∈ G \ K ( (cid:15) ). Hence, in particular we conclude that p ( x, K ( (cid:15) )) < ∀ x ∈ B ( s, m ) . Since p ( x, K ( (cid:15) )) = 0 for x ∈ G with dist( x, C ) ≤ d max − (cid:15) due to paracontractiveness, it holdsby invariance of π that π ( K ( (cid:15) )) = Z G p ( x, K ( (cid:15) )) π (d x ) = Z K ( (cid:15) ) p ( x, K ( (cid:15) )) π (d x ) .

8t follows, then, that π ( K ( (cid:15) )) = Z K ( (cid:15) p ( x, K ( (cid:15) )) π (d x )= Z B ( s, m ) p ( x, K ( (cid:15) )) π (d x ) + Z K ( (cid:15) \ B ( s, m ) p ( x, K ( (cid:15) )) π (d x ) < π ( B ( s, m )) + π ( K ( (cid:15) ) \ B ( s, m )) = π ( K ( (cid:15) ))which is a contradiction. So the assumption that supp π \ C = ∅ is false, i.e. supp π ⊂ C asclaimed. Theorem 3.4 (Theorem 4.22 in [14]) . Under Standing Assumption 1, if T i ( i ∈ I ) is continuous,then the Markov operator P is Feller, i.e. P : C b ( G ) → C b ( G ) .Proof. By continuity of T i , i ∈ I , the update function Φ is continuous in the ﬁrst argument. Itfollows for f ∈ C b ( G ) and x n → x as n → ∞ by Lebesgue’s Dominated Convergence Theorem P f ( x n ) = Z I f (Φ( x n , u )) P ξ (d u ) → Z I f (Φ( x, u )) P ξ (d u ) = P f ( x ) . Note that P f is bounded, whenever f is a bounded function. Theorem 3.5 (almost sure convergence for a compact metric space) . Under the standing as-sumptions, let T i be paracontractive, i ∈ I , and let ( G, d ) be a compact metric space. Thenthe sequence ( X k ) of random variables generated by Algorithm 1 converges almost surely to arandom variable X µ ∈ C depending on the initial distribution µ .Proof. Since P is Feller and G compact, Theorem A.4 implies that any subsequence of ( ν n ),where ν n = n P ni =1 L ( X i ), has a convergent subsequence and clusterpoints are invariant mea-sures for P . Let ( ν n k ) be a convergent subsequence with limit π . So for the bounded andcontinuous function dist( · , C ) it holds that ν n k dist( · , C ) → π dist( · , C ) = 0 as k → ∞ by weakconvergence of the probability measures and the fact that, by Lemma 3.3, supp π ⊂ C .Due to quasi-nonexpansiveness and Lemma 2.4 (a compact metric space is separable), we havea.s. (for all ω / ∈ N with N given by (5)) that d ( X k +1 , c ) ≤ d ( X k , c ) for all c ∈ C and k ∈ N ,which implies dist( X k +1 , C ) ≤ dist( X k , C ) for all k ∈ N a.s. It therefore follows that E [dist( X n k , C )] ≤ n k n k X i =1 E [dist( X i , C )] = ν n k dist( · , C ) → E [dist( X k , C )]) k . This yields E [dist( X k , C )] → k → ∞ . Now since(dist( X k , C )) k is nonincreasing, it must be that dist( X k , C ) → x ω of ( X k ( ω )) k we have x ω ∈ C . This together with a.s. monotonicity of ( d ( X k , c )) k forall c ∈ C implies that d ( X k ( ω ) , x ω ) → x ω of ( X k ( ω )) k , which implies theuniqueness of x ω . In other words, ( X k ) converges almost surely to a random variable X µ , with X µ ( ω ) = x ω ∈ C , ω N , as claimed. The results for compact metric spaces can be applied, with minor adjustments, to ﬁnite dimen-sional vector spaces. In the following let (

G, d ) = ( V, k·k ) be a ﬁnite dimensional normed vectorspace over R . This means in particular, that V is also complete and every closed and boundedset is compact (Heine-Borel property) and all norms on V are equivalent. So actually, since all9 -dimensional vector spaces are isomorphic, it is enough to study convergence in R n equippedwith the euclidean norm k·k .The following result for R n is a straight forward application of Theorem 3.5. Theorem 3.6 (almost sure convergence in R n ) . Under the standing assumptions, let T i : R n → R n be paracontractive, i ∈ I . Then the sequence ( X k ) of random variables generatedby Algorithm 1 converges almost surely to a random variable X µ ∈ C depending on the initialdistribution µ .Proof. First, suppose µ = δ x for x ∈ R n . Let N be given by (5). The quasi-nonexpansivenessproperty gives us k X k +1 − c k ≤ k X k − c k for all c ∈ C a.s. (i.e. if ω N ). Letting c ∈ C with dist( x, C ) = k x − c k , this implies X k ∈ B ( c, k x − c k ), where B ( s, (cid:15) ) ⊂ R n is the closed ballaround s ∈ R n with radius (cid:15) . The assertion X k → X δ x a.s. then follows from Theorem 3.5.Denote the corresponding invariant measure as π x := L ( X δ x ).Suppose now that µ ∈ P ( R n ) is arbitrary. For f ∈ C b ( R n ) one has p k ( x, f ) ≤ k f k ∞ for all k ∈ N and x ∈ R n . Note that p k ( x, f ) = δ x P k f , and from the above argument δ x P k → π x inthe weak sense as k → ∞ . Hence by Lebesgue’s Dominated Convergence Theorem, we get µ P k f = Z R n p k ( x, f ) µ (d x ) → Z R n π x f µ (d x ) =: µπ x f =: π µ f, as k → ∞ . We conclude that L ( X k ) = µ P k → π µ weakly. The measure π µ = µπ x is an invariant probabilitymeasure for P , since π x is a invariant probability measure for P .Choosing f = min { dist( · , C ) , M } ∈ C b ( R n ) with M > L ( X k ) f → π µ f = 0. Since f ( X k +1 ) ≤ f ( X k ) a.s. and L ( X k ) f = E [ f ( X k )] →

0, it holds that f ( X k ) → X k , C ) → X n k ( ω )) k with limit x ω it holds that x ω ∈ C . Moreover, since ( k X k − x ω k ) k is monotone, actually X k ( ω ) → X µ ( ω ) := lim k X k ( ω ) = x ω ∈ C , ω N . In this section (

G, d ) is a Hilbert space ( H , h· , ·i ). Under the standing assumptions the followingextended-valued function R ( x ) := E h k x − T ξ x k i = Z Ω (cid:13)(cid:13)(cid:13) x − T ξ ( ω ) x (cid:13)(cid:13)(cid:13) P (d ω ) = Z I k x − T u x k P ξ (d u )is measurable from H to [0 , ∞ ]. Following [5] we use this function to characterize convergenceof the consistent ﬁxed point problem under the weaker assumption that the mappings T ξ are averaged (see Eq. (9)). Lemma 3.7 (properties of R and C for quasi-nonexpansive mappings) . In addition to thestanding assumptions, suppose that T i ( i ∈ I ) is quasi-nonexpansive and continuous. Then(i) C = R − (0) ;(ii) R is ﬁnite everywhere;(iii) R is continuous;(iv) C is convex and closed.Proof. (i) We have x ∈ C ⇔ x ∈ Fix T ξ a.s. ⇔ x = T ξ x a.s. ⇔ R ( x ) = 0.10ii) Fix x ∈ C , then x = T ξ x a.s. Using quasi-nonexpansivity we get a.s., that k y − T ξ y k ≤k y − x k + k x − T ξ y k ≤ k x − y k ∀ y ∈ H , (10) ⇐⇒k y − T ξ y k ≤ k y − x k ∀ y ∈ H . (11)From (11) it follows that R ( y ) ≤ k y − x k < ∞ for all y ∈ H .(iii) Let x, x n ∈ H , n ∈ N , with x n → x as n → ∞ . Deﬁne the functions f n ( ω ) = (cid:13)(cid:13)(cid:13) x n − T ξ ( ω ) x n (cid:13)(cid:13)(cid:13) on Ω ( n ∈ N ). Then, by continuity of T ξ ( ω ) for ﬁxed ω ∈ Ω, one has f n → f := k x − T ξ x k for all ω ∈ Ω. Deﬁne the constant function g ( ω ) = 8 (cid:15) + 8 k x − c k for some c ∈ C and some (cid:15) >

0. By (10) we have that k y − T ξ y k ≤ k y − c k for all y ∈ H .For y ∈ B ( x, (cid:15) ) this yields k y − T ξ y k ≤ (cid:15) + 2 k x − c k . We conclude that g is P -integrableand f n ≤ g for all n ∈ N with x n ∈ B ( x, (cid:15) ). Finally, application of Lebesgue’s DominatedConvergence Theorem yields R ( x n ) = E f n → E f = R ( x ) as n → ∞ .(iv) This follows from [8, Proposition 4.13, Proposition 4.14]. Note that for any α ∈ R , a, b ∈ H we have [8, Corollary 2.14] k αa + (1 − α ) b k = α k a k + (1 − α ) k b k − α (1 − α ) k a − b k . Let z = λx + (1 − λ ) y with x, y ∈ R − (0) = C , λ ∈ [0 , T ξ x = x and T ξ y = y a.s. that a.s. holds k T ξ z − z k = k λ ( T ξ z − x ) + (1 − λ )( T ξ z − y ) k = λ k T ξ z − x k + (1 − λ ) k T ξ z − y k − λ (1 − λ ) k x − y k ≤ λ k z − x k + (1 − λ ) k z − y k − λ (1 − λ ) k x − y k = k λ ( z − x ) + (1 − λ )( z − y ) k = 0 . So R ( z ) = 0, i.e. z ∈ R − (0). Closedness of R − (0) follows by continuity of R .In the next theorem we need to compute conditional expectations of nonnegative real-valuedrandom variables, which are non-integrable in general (for example, if the random variable X with distribution µ does not have a ﬁnite expectation, E [ k X k ] = + ∞ ). But for theserandom variables the classical results on integrable random variables are still applicable (seeTheorem A.8), also the disintegration theorem is still valid (see Theorem A.9).The stage is now set to show convergence for the corresponding Markov chain. The nextseveral results concern weak convergence of sequences of random variables with respect to theHilbert space, namely, x n w −→ x if h x n , y i → h x, y i for all y ∈ H . Theorem 3.8 (weak cluster points belong to feasible set for averaged mappings) . Under thestanding assumptions, let T i be α i -averaged with α i ≤ α < for all i ∈ I . Then weak clus-ter points (in the sense of Hilbert spaces) of the sequence ( X k ) k ∈ N of random variables in H generated by Algorithm 1 are a.s. contained in C .Proof. Fix c ∈ C . Since T ξ is averaged we have for all k ∈ N that k X k +1 − c k ≤ k X k − c k − − αα k X k +1 − X k k (12)11verywhere but on a P -nullset N c , which may depend on c . Let F k = σ ( X , ξ , . . . , ξ k − ) be the σ -algebra of all iterations of the algorithm up to the k -th and apply Lemma A.10. We get that P k ∈ N R ( X k ) < ∞ a.s., where from Theorem A.9 follows that E [ k X k +1 − X k k | F k ] = R ( X k ).Hence there is ˜ N ⊂ Ω with P ( ˜ N ) = 0 and R ( X k ( ω )) → k → ∞ for ω ∈ Ω \ ( N c ∪ ˜ N ).By nonexpansiveness of T ξ for all we ﬁnd for any x, x n ∈ Hk x − T ξ x k = k x n − T ξ x k + k x − x n k + 2 h x n − T ξ x, x − x n i = k x n − T ξ x k − k x − x n k + 2 h x − T ξ x, x − x n i = k x n − T ξ x n k + k T ξ x − T ξ x n k + 2 h x n − T ξ x n , T ξ x n − T ξ x i− k x − x n k + 2 h x − T ξ x, x − x n i≤ k x n − T ξ x n k + 2 h x n − T ξ x n , T ξ x n − T ξ x i + 2 h x − T ξ x, x − x n i≤ k x n − T ξ x n k + 2 k x n − T ξ x n kk x n − x k + 2 h x − T ξ x, x − x n i . Taking expectation and using Jensen’s inequality yields R ( x ) ≤ R ( x n ) + 2 q R ( x n ) k x n − x k + 2 E [ h x − T ξ x, x − x n i ] . (13)Now assume that the sequence ( x n ) is weakly converging to x ∈ H , i.e. x n w −→ x . Then thefunctions f n = h x − T ξ x, x − x n i , n ∈ N , on Ω satisfy f n → P -integrablefunction g ( ω ) := (cid:13)(cid:13)(cid:13) x − T ξ ( ω ) x (cid:13)(cid:13)(cid:13) sup n k x − x n k gives us | f n | ≤ g for all n ∈ N and hence byLebesgue’s Dominated Convergence Theorem E [ h x − T ξ x, x − x n i ] → n → ∞ .So for ω ∈ Ω \ ( N c ∪ ˜ N ) there is a weakly convergent subsequence of the bounded sequence( X k ( ω )) k ∈ N , denoted x n := X k n ( ω ) w −→ x ω =: x as n → ∞ . As shown above this subsequencesatisﬁes R ( x n ) → n → ∞ . We conclude with (13) that R ( x ) = 0, i.e. x ∈ C and hence anyweak cluster point of the sequence ( X k ( ω )) k is contained in C .In the case of separable Hilbert spaces, we are able to show Fejér monotonicity of the sequence( X k ) a.s., so the classical theory of convergence analysis from [8] can be applied in this case.An analogous statement for nonseparable Hilbert spaces remains open since we do not have therepresentation Lemma 2.4 at hand. Theorem 3.9 (almost sure weak convergence under separability) . Under the same assumptionsas in Theorem 3.8 assume additionally that H is a separable Hilbert space. Then the sequence ( X k ) is a.s. weakly convergent (in the sense of Hilbert spaces) to a random variable X µ ∈ C ,depending on the initial distribution µ . Furthermore P C X k → X µ strongly a.s. as k → ∞ .Proof. Instead of a nullset N c , which may depend on c ∈ C , as in the proof of Theorem 3.8,separability gives with help of Lemma 2.4 that there is a nullset N , such that on Ω \ N Eq. (12)is satisﬁed for all c ∈ C . This implies a.s. Fejér monotonicity of ( X k ). Since from Theorem 3.8follows that weak clusterpoints of ( X k ) are contained in C a.s., we can now apply Theory in[8] developed for Fejér monotone sequences, we get: From [8, Theorem 5.5] (a Fejér monotonesequence w.r.t. C that has all weak clusterpoints in C is weakly convergent to a point in C )follows that X k w −→ X µ ∈ C a.s.For strong convergence of ( P C X k ) a.s. we apply [8, Proposition 5.7]. From [8, Corollary 5.8]we get from X k w −→ X µ a.s., that P C X k → X µ a.s. strongly as k → ∞ . Example 3.10 (convergence to projection for aﬃne subspaces) . Let H be separable and C i bean aﬃne subspace, i ∈ I , where I is an arbitrary index set. Let T i = P i be the projector onto12 i . Under the standing assumptions holds that lim k X k = X µ = P C X for X ∼ µ and any µ ∈ P ( H ).We show, that P C X k +1 = P C X k for any k ∈ N . This allows us to conclude that P C X k = P C X for any k ∈ N , and thus P C X is the only possible weak cluster point of ( X k ) byTheorem 3.9. Using the characterization [15, Theorem 4.1] (if K ⊂ H is nonempty, closed andconvex and u ∈ K then h x − u, k − u i ≤ k ∈ K iﬀ u = P K x ) of a projection, we ﬁndwith help of [15, Theorem 4.9] (for a subspace S holds that h x − P S x, s i = 0 for all s ∈ S ), thatfor c ∈ C holds that h X k +1 − P C X k , c − P C X k i = h P ξ k X k − X k , c − P C X k i | {z } =0 + h X k − P C X k , c − P C X k i | {z } ≤ ≤ . Hence by [15, Theorem 4.1] we have that P C X k +1 = P C X k . We will assume in this section that H is a separable Hilbert space and T i is α i -averaged, i ∈ I .We will furthermore assume, that α i ≤ α for some α <

1. As with the deterministic case,geometric convergence of the algorithm can be analyzed by introducing a condition on theset of ﬁxed points. In the context of set feasibility with ﬁnitely many sets, the condition isequivalent to linear regularity of the sets [2, Assumption 2]: There exists κ > ( x, C ) ≤ κR ( x ) ∀ x ∈ H . (14)In the more general context of ﬁxed point mappings, this property is more appropriately called global metric subregularity of R at all points in C for 0 [16]; in particular there exists a κ > ( x, R − (0)) ≤ κR ( x ) ∀ x ∈ H . Here C = R − (0), so the above is just another way of writing (14). The smallest constantsatisfying this inequality will be called the regularity constant, it is given bysup x ∈H\ C dist ( x, C ) R ( x ) . Theorem 3.11.

In addition to the standing assumptions, suppose the regularity condition inEq. (14) is satisﬁed and T i is α i -averaged, i ∈ I with α i ≤ α for some α < . Then the RFIconverges geometrically in expectation to the ﬁxed point set, i.e. for any initial distribution E [dist( X k +1 , C )] ≤ r − κ − − αα E [dist( X k , C )] ∀ k ∈ N . (15) Proof.

Revisiting (12) in the proof of Theorem 3.8 gives us for ω ∈ Ω \ N ( N given by (5)) and x = P C X k ( ω )dist ( X k +1 ( ω ) , C ) ≤ k X k +1 ( ω ) − x k ≤ dist ( X k ( ω ) , C ) − − αα k X k +1 ( ω ) − X k ( ω ) k . With help of Jensen’s inequality and concavity of x

7→ √ x on [0 , ∞ ), we get that E [dist( X k +1 , C ) | F k ] ≤ E [ r dist ( X k , C ) − − αα k T ξ k X k − X k k | F k ] ≤ r dist ( X k , C ) − − αα E [ k T ξ k X k − X k k | F k ]13 r dist ( X k , C ) − − αα R ( X k ) ≤ r − κ − − αα dist( X k , C ) . Note that it could be E [dist( X k , C )] = ∞ for all k ∈ N , depending on the initial distribution µ .The next theorem concerns the Wasserstein distance of two probability measures. For twomeasures ν , ν ∈ P ( G ) this is given by W ( ν , ν ) = inf Y ∼ ν Y ∼ ν E [ k Y − Y k ] . Theorem 3.12 (strong convergence and geometric convergence of measures) . Under the stand-ing assumptions, suppose the regularity condition in Eq. (14) is satisﬁed and T i is α i -averaged, i ∈ I with α i ≤ α for some α < . Then X k → X strongly a.s. as k → ∞ and the Wassersteindistances W ( L ( X k ) , L ( X )) also converge geometricly, there is r ∈ (0 , such that W ( L ( X k ) , L ( X )) ≤ r k W ( L ( X ) , L ( X )) . Proof.

See also [8, Theorem 5.12]. One has a.s. that k X k − X k + m k ≤ k X k − P C X k k + k P C X k − X k + m k ≤ X k , C ) ≤ q κR ( X k ) . We used here, that T ξ is nonexpansive and it satisﬁes T ξ c = c for any c ∈ C a.s., hence k P C X k − X k + m k = (cid:13)(cid:13)(cid:13) T ξ k + m − · · · T ξ k P C X k − X k + m (cid:13)(cid:13)(cid:13) ≤ dist( X k , C ). This gives us that ( X k )is a Cauchy sequence a.s., since R ( X k ) → X is contained in C , since it’s weak limit needs to coincide with the strong limit. Letting m → ∞ one arrives at k X k − X k ≤ X k , C ). Taking the expectation yields E [ k X k − X k ] ≤ E [dist( X k , C )]. Hence, using Theorem 3.11 gives us E [ k X k − X k ] ≤ r k E [dist( X , C )] with r = q − κ − − αα and using the fact that E [dist( X , C )] ≤ W ( L ( X ) , L ( X )), we have, by thedeﬁnition of the Wasserstein distance, W ( L ( X k ) , L ( X )) ≤ r k W ( L ( X ) , L ( X )) . Note that it could be W ( L ( X ) , L ( X )) = ∞ , depending on the initial distribution µ . Remark 3.13 ( (cid:15) -ﬁxed point) : In order to assure that, with probability greater than 1 − β , the k -th iterate is in an (cid:15) neighborhood of the feasible set C , it is suﬃcient that k ≥ ln (cid:18) β(cid:15) √ κR ( x ) (cid:19) / ln( c ),where c = q − − αα κ − and X ∼ δ x . To see this, note that, by Markov’s inequality, P ( X k ∈ C + (cid:15) B (0 , P (dist( X k , C ) < (cid:15) )= 1 − P (dist( X k , C ) ≥ (cid:15) ) ≥ − E [dist( X k , C )] (cid:15) ≥ − r k dist( x, C ) (cid:15) ≥ − r k p κR ( x ) (cid:15) . emark 3.14: As seen in Example 3.10 the probability P ( X k ∈ C ) can increase to 1 as k → ∞ ,but this is not necessarily the case, as we will see in Examples 4.7 and 4.9. There, one ﬁndsthat P ( X k ∈ C ) = P ( X ∈ C ) for k ∈ N . In Example 4.8 it holds that P ( X k ∈ C ) = P ( X ∈ C )for all k ∈ N . Theorem 3.15 (necessary and suﬃcient conditions for geometric convergence) . Under thestanding assumptions, let T i be α i -averaged, i ∈ I with α i ≤ α for some α < . The regularitycondition in Eq. (14) is satisﬁed if and only if there exists r ∈ [0 , such that E [dist( T ξ x, C )] ≤ r dist( x, C ) ∀ x ∈ H . (16) Furthermore, condition Eq. (14) is necessary and suﬃcient for geometric convergence in expec-tation of Algorithm 1 to the ﬁxed point set C as in Eq. (15) with a uniform constant for allinitial probability measures.Proof. Eq. (14) implies Eq. (15), which in turn implies Eq. (16) (with X ∼ δ x ) by Theorem 3.11with r = q − κ − − αα . The other implication follows the same proof pattern as [6, Theorem3.11]. We note that, by Theorem A.9, if X ∼ δ x for x ∈ H , then E [ k X − X k | ξ ] = k T ξ x − x k , hence by Hölder’s inequality E [ k X − X k ] ≤ q R ( x ) . Furthermore we can estimate k X − X k = k X − P C X + P C X − X k ≥ dist( X , C ) − dist( X , C ) . Taking the expectation above, the assumption that E [dist( X , C )] ≤ r E [dist( X , C )] yields( ∀ x ∈ H ) R ( x ) ≥ (1 − r ) dist ( x, C ) , i.e. the constant κ in Eq. (14) is ﬁnite with κ ≤ (1 − r ) − < ∞ . So Eq. (16) implies Eq. (14).For the last implication of the theorem, note that, in case Eq. (15) is satisﬁed with thesame constant r ∈ (0 ,

1) for all Dirac measures δ x with x ∈ H , then Eq. (16) also holds (letting X ∼ δ x ) and hence by the above equivalence Eq. (14) is satisﬁed. This completes the proof. Remark 3.16:

Conventional analytical strategies invoke strong convexity in order to achievegeometric convergence. Our analysis makes no such assumption on the sets C i . Theorem 3.15shows that geometric convergence is a by-product, mainly, of the regularity of the set of ﬁxedpoints. The results of [6] indicate that one could formulate a necessary regularity condition for sublinear convergence, which also might be useful for stochastic algorithms.

4. Applications

We specialize the framework above to several well-known settings: consistent convex feasibil-ity, linear operator equations and in particular Hilbert-Schmidt operators (i.e. linear integralequations). 15 .1. Feasibility and stochastic projections

There are many algorithms for solving convex feasibility problems. We focus on the (concep-tually) simplest of these, namely stochastic projections. In the context of Algorithm 1, T i = P i is a projector, i ∈ I , onto a nonempty closed and convex set C i ⊂ H , i ∈ I and H a Hilbertspace. Note that projectors are -averaged operators [8, Proposition 4.8] (also referred to as ﬁrmly nonexpansive operators), so α i = for all i ∈ I , we then can choose the upper bound α = as well. Also note that Fix P i = C i , i ∈ I .As a ﬁrst assertion we give an equivalent characterization for the regularity property inEq. (14) using just properties of R . This characterization, known as Kurdyka-Łojasiewicz (KL)property, eliminates the term with the distance to the usually unknown ﬁxed point set C , butone needs to be able to compute the ﬁrst derivative of the function R . For convex sets this isunproblematic since R is the expectation of the squared distances to the convex sets C i , seeLemma A.11. Deﬁnition 4.1 (KL property) . A convex, continuously diﬀerentiable function f : H → R withinf x f ( x ) = 0 and S := argmin f = ∅ is said to have the global KL property, if there exists aconcave continuously diﬀerentiable function ϕ : R + → R + with ϕ (0) = 0 and ϕ > ϕ ( f ( x )) k∇ f ( x ) k ≥ ∀ x ∈ H \ S. The following theorem is a direct consequence of [17].

Proposition 4.2 (equivalent characterization of Eq. (14)) . Under the standing assumptions,let T i = P i be projectors onto nonempty, closed and convex sets, i ∈ I . Then the regularitycondition in Eq. (14) is satisﬁed with κ > if and only if R ( x ) ≤ κ k∇ R ( x ) k ∀ x ∈ H , i.e. R has the global KL property.Proof. Apply [17, Corollary 6] with ϕ ( s ) := √ κs and f = R and note that R is convex anddiﬀerentiable (see Lemma A.11). Theorem 4.3 (uniform bounds) . Under the standing assumptions, suppose the regularity con-dition in Eq. (14) is satisﬁed and that H is separable and T i = P i are projectors onto nonempty,closed and convex sets, i ∈ I . Then the probability of any point being feasible is uniformlybounded, i.e. P ( x ∈ C ξ ) ≤ r < for all x ∈ H \ C .Proof. It holds surely for all x ∈ H dist( P ξ x, C ) ≥ dist( x, C ) − dist( x, C ξ ) . This, together with the expectation E [dist( x, C ξ )] = E [dist( x, C ξ ) { x/ ∈ C ξ } ] ≤ E [dist( x, C ) { x/ ∈ C ξ } ] = dist( x, C )(1 − P ( x ∈ C ξ ))yields, for X ∼ δ x , E [dist( X , C )] ≥ P ( x ∈ C ξ ) dist( x, C ) . Hence by Theorem 3.151 > r := sup x ∈H\ C E [dist( X , C )]dist( x, C ) ≥ sup x ∈H\ C P ( x ∈ C ξ ) . heorem 4.4 (ﬁnite vs. inﬁnite convergence) . Under the standing assumptions, let H beseparable and let T i = P i be projectors ( i ∈ I ). Then one of the following holds:(i) P ( X ∈ C ) = 1 and P ( X n ∈ C ) = 1 for all n ∈ N ,(ii) P ( X ∈ C ) < and P ( X n ∈ C ) < for all n ∈ N .Proof. (i) If P ( X ∈ C ) = 1, then X k = X a.s. for all k ≥ R p ( x, C ) µ (d x ) = P ( X ∈ C ) < x ∈ supp µ \ C with p ( x, C ) <

1, where µ is the initial distribution. Since p ( x, H\ C ) >

0, there exists y ∈ supp p ( x, · ) \ C .Then by Theorem A.3 this implies that p ( x, B ( y, (cid:15) )) > (cid:15) > (cid:15) > ∀ z ∈ B ( y, (cid:15) )) p ( z, B ( y, (cid:15) )) ≥ p ( x, B ( y, (cid:15) )) > . (17)To see this, note that, for ω ∈ M ( (cid:15) ) := n ω ∈ Ω (cid:12)(cid:12)(cid:12) P ξ ( ω ) x ∈ B ( y, (cid:15) ) o , we have (cid:13)(cid:13)(cid:13) P ξ ( ω ) z − y (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) P ξ ( ω ) z − P ξ ( ω ) y (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) P ξ ( ω ) y − y (cid:13)(cid:13)(cid:13) ≤ k z − y k + (cid:13)(cid:13)(cid:13) P ξ ( ω ) y − y (cid:13)(cid:13)(cid:13) ≤ k z − y k + (cid:13)(cid:13)(cid:13) P ξ ( ω ) x − y (cid:13)(cid:13)(cid:13) ≤ (cid:15). Here we have used nonexpansiveness of P ξ and the deﬁnition of a projection. Now (17)follows from the identity P ( M k ( (cid:15) )) = p ( x, B ( y, (cid:15) )) > (cid:15) > ∀ w ∈ B ( x, (cid:15) )) p ( w, B ( y, (cid:15) )) ≥ p ( x, B ( y, (cid:15) )) > . (18)To see this, note that for ω ∈ M ( (cid:15) ), we have (cid:13)(cid:13)(cid:13) P ξ ( ω ) w − y (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) P ξ ( ω ) w − P ξ ( ω ) x (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) P ξ ( ω ) x − y (cid:13)(cid:13)(cid:13) ≤ (cid:15). Now, ﬁx (cid:15) > B ( y, (cid:15) ) ∩ C = ∅ and B ( x, (cid:15) ) ∩ C = ∅ . We get for any w ∈ H and n ∈ N p n +1 ( w, B ( y, (cid:15) )) ≥ Z B ( y, (cid:15) p ( z, B ( y, (cid:15) )) p n ( w, d z ) ≥ p ( x, B ( y, (cid:15) )) p n ( w, B ( y, (cid:15) )) . So iteratively, denoting (cid:15) n := 2 − n (cid:15) , we arrive at p n +1 ( w, B ( y, (cid:15) )) ≥ n Y i =1 p ( x, B ( y, (cid:15) i )) p ( w, B ( y, (cid:15) n )) . The last probability can be estimated for w ∈ B ( x, (cid:15) n +1 ) by (18) through p ( w, B ( y, (cid:15) n )) ≥ p ( x, B ( y, (cid:15) n +1 )) . p n ( w, B ( y, (cid:15) )) is locally uniformly bounded from below for w ∈ B ( x, (cid:15) n ). That implies P ( X n ∈ H \ C ) = Z H p n ( w, H \ C ) µ (d w ) ≥ Z B ( x,(cid:15) n ) p n ( w, B ( y, (cid:15) )) µ (d w ) ≥ [ p ( x, B ( y, (cid:15) n ))] n µ ( B ( x, (cid:15) n )) > , i.e. P ( X n ∈ C ) < n ∈ N , as claimed. Remark 4.5:

Theorem 4.4 can be interpreted as a lower bound on the complexity of the RFIanalogous to the deterministic case [6, Theorem 5.2], where the alternating projection algorithmconverges either after one iteration or after inﬁnitely many. Alternatively, the stopping or hittingtime of a process is deﬁned as T := inf { n | X n ∈ C } . In this context, Theorem 4.4 says that, either P ( T = 1) = 1 or P ( T = n ) < n ∈ N .Note, it could happen that P ( T = ∞ ) = 1, in which case P ( T = n ) = 0 for all n ∈ N . Example 4.6 (ﬁnite and inﬁnite convergence) . With just two sets, the deterministic alternatingprojections algorithms can converge in ﬁnitely many steps. But when the projections onto therespective sets are randomly selected, convergence might only come after inﬁnitely many steps.For example, let C = R + × R and C = R × R + and P ( ξ = 1) = 0 . P ( ξ = 2) = 0 . C = R + × R + . Set µ = δ x , where x = (cid:16) − − (cid:17) . Then P ( X ∈ C ) = 0 or more generally P ( X n ∈ C ) = 1 − . n − . n < n ∈ N . Now let P ( ξ = 1) = 1 and P ( ξ = 2) = 0, then C = C and for µ as above P ( X ∈ C ) = 1 and so P ( X n ∈ C ) = 1. Example 4.7 (no uniform geometric convergence) . In this example we show a sublinear con-vergence rate for inﬁnitely many overlapping intervals. This is in contrast to the convergenceproperties of ﬁnitely many intervals with nonempty interior, where one would expect a geometricrate.Let ξ ∼ unif[ (cid:15) − , − (cid:15) ] for some (cid:15) ∈ [0 , ). Deﬁne the nonempty and closed intervals C r = [ r − , r + ], r ∈ R . Figure 1

The projector onto these intervals is given by P r x =  r + x ≥ r + r − x ≤ r − x r − ≤ x ≤ r + . ρ (cid:15) of ξ is ρ (cid:15) ( y ) = 11 − (cid:15) [ (cid:15) − , − (cid:15) ] ( y ) . One can compute R (cid:15) ( x ) = E [ | P ξ x − x | ] = Z R | P r x − x | ρ (cid:15) ( r ) d r = 11 − (cid:15) Z − (cid:15)(cid:15) − | P r x − x | d r = 11 − (cid:15) [ (cid:15), ∞ ) ( | x | ) ( | x | − (cid:15) ) + min(1 − | x | − (cid:15), . Now, let us examine regularity properties. For the case (cid:15) ∈ [0 , ), the problem is a consistentfeasibility problem with C = [ − (cid:15), (cid:15) ]. While the regularity condition in Eq. (14) is triviallysatisﬁed for | x | ≤ (cid:15) , for (cid:15) ≤ | x | ≤ − (cid:15) we ﬁnd R (cid:15) ( x ) = − (cid:15) ( | x |− (cid:15) ) and dist ( x, C ) = ( | x | − (cid:15) ) .So the regularity property in Eq. (14) is not satisﬁed for any κ > r ∈ [0 , E [dist( X k +1 , C )] ≤ r E [dist( X k , C )], where X ∼ δ x , x ∈ H ). Example 4.8 (uniform geometric convergence) . We provide here a concrete example wheregeometric convergence of the RFI is achieved. This is somewhat surprising since the anglebetween the sets can become arbitrarily small. In the deterministic setting, this results inarbitrarily slow convergence of the algorithm. This provides some intuition for why stochasticalgorithms can outperform deterministic variants.Let C α := R e α with e α = (cid:16) cos( α )sin( α ) (cid:17) , α ∈ [0 , π ) and let ξ ∼ unif[0 , β ], where β ∈ (0 , π ). Figure 2

We have C = { } , so dist( x, C ) = k x k and the density of ξ is ρ β ( α ) = β [0 ,β ] ( α ). The projectoronto the linear subspace C α of R is given by P α ( x ) = x − * sin( α ) − cos( α ) ! , x + sin( α ) − cos( α ) ! . We ﬁnd then R β ( x ) = 1 β Z β k P α x − x k d α = 1 β Z β ( x sin( α ) − x cos( α )) d α = 1 β (cid:20) x (cid:18) β − sin( β ) cos( β )2 (cid:19) + x (cid:18) β + sin( β ) cos( β )2 (cid:19) − x x sin ( β ) (cid:21) . x = λe α with λ ≥ x, C ) = λ and R β ( x ) = λ R β ( e α ) and employingtrigonomertric calculation rules, we ﬁnd the regularity constant in Eq. (14) not to be smallerthan κ = sup x ∈ R dist ( x, C ) R β ( x ) = sup α ∈ [0 , π ) β β − sin(2 β − α ) − sin(2 α ) = 4 ββ − sin( β ) , where the last supremum is attained at α = β . So from Theorem 3.11 we get uniform geometricconvergence. Example 4.9 (disks on a circle) . This example illustrates Theorem 4.3. Let C α := B ( ρe α , ⊂ R , where 0 < ρ < e α = (cid:16) cos( α )sin( α ) (cid:17) , α ∈ [0 , π ) and let ξ ∼ unif[0 , π ]. Figure 3 Figure 4

The intersection is given by C = B (0 , − ρ ). We show next that sets with this conﬁguration donot satisfy (14). To see this we show that there is a sequence ( x n ) n ⊂ R with P ( x n ∈ C ξ ) → n → ∞ . By Theorem 4.3 we conclude that (14) cannot hold. Indeed, let x = x ( λ ) = λ ( )with λ ≥ − ρ , then P ( x ∈ C ξ ) = 12 π Z π {k x − ρe α k ≤ } d α = 12 π Z π { λ + ρ − λρ cos( α ) ≤ } d α = 12 π Z β − β α = βπ , where β = β ( λ ) = cos − (cid:16) λ + ρ − λρ (cid:17) , if λ ≤ ρ . We have β ( λ ) → π as λ → − ρ , so P ( x ( λ ) ∈ C ξ ) → λ → − ρ .In contrast to the case where ρ ∈ (0 , ρ = 0 and ρ = 1 do satisfy(14). This will not be shown here. The set feasibility examples above lead very naturally to the more general context of mappings T i : G → G , i ∈ I and S j : G → G , j ∈ J on a metric space ( G, d ), where

I, J are20rbitrary index sets. Here we envision the scenario where C T := { x ∈ G | P ( x ∈ Fix T ξ ) = 1 } and C S := { x ∈ G | P ( x ∈ Fix S ζ ) = 1 } are distinctly diﬀerent sets, possibly nonintersecting.Let (Ω , F , P ) be a probability space and let ξ : Ω → I , ζ : Ω → J be two random variables.Let ( ξ n ) n ∈ N be an iid sequence with ξ n d = ξ and ( ζ n ) n ∈ N iid with ζ n d = ζ . The two sequencesare assumed to be independent of each other. Let µ be a probability measure on ( G, B ( G )).Consider the stochastic selection method for two families of mappings Algorithm 2

RFI for two families of mappings

Initialization: X ∼ µ for k = 0 , , , . . . do X k +1 = S ζ k T ξ k X k return { X k } k ∈ N Note, that this structure of two families of mappings is a special case of Algorithm 1, just set˜ T ( i,j ) = S j T i , where ( i, j ) ∈ ˜ I = I × J and ˜ ξ = ( ξ, ζ ). Also the Markov chain property is stillsatisﬁed, the transition kernel takes the form p ( x, A ) = P ( S ζ T ξ x ∈ A ) for x ∈ G and A ∈ B ( G ).An advantage of this formulation is that properties of the two families { S j } j ∈ J and { T i } i ∈ I can be analyzed more speciﬁcally, and independently. As long as the mapping ˜ T satisﬁes theproperties needed for convergence of the RFI, then convergence of the RFI for two families ofmappings follows. At the very least, we need C := n x ∈ G (cid:12)(cid:12)(cid:12) P ( x ∈ Fix ˜ T ˜ ξ ) = 1 o = ∅ . From this it is easy to see that for convergence the set C T could be empty, but the set C S must be nonempty. Example 4.10 (consistent feasibility) . Revisit Example 4.6. We had C = R + × R and C = R × R + . Set I = { } and J = { } , then the algorithm is the deterministic alternating projectionsmethod. One has P ( X ∈ C ) = 1 for all initial distributions. Example 4.11 (inconsistent stochastic feasibility) . In this example we show that the frameworkestablished here is not exclusively limited to consistent feasibility. Consider the (trivially convex,nonempty and closed) set S := { (0 , } together with the collection of balls in Example 4.9, C α := B ( ρe α , ⊂ R , where 0 ≤ ρ ≤ e α = (cid:16) cos( α )sin( α ) (cid:17) , α ∈ [0 , π ) and let ξ ∼ unif[0 , π ].The intersection of the disks is given by C T = B (0 , − ρ ) where T α := P C α is the metricprojection onto C α . Although S ∩ C α = ∅ for all α ∈ [0 , π ), still the ﬁxed point set for themapping in Algorithm 2 (where S ζ = P S ) is C = { (0 , } , and this is found after one iterationfor any initial probability distribution µ , where X ∼ µ .This is indeed a special example, but points to the richness of inconsistent stochastic feasi-bility, which will be studied in greater depth in a follow-up paper. There are several applications of the RFI to the feasibility problem [1], [4]. We want to focusﬁrst on linear operator equations in the separable Hilbert space H = L ([ a, b ]). Let T : H → H be a bounded linear operator, we want to ﬁnd x ∈ H , such that T x = g, for a given g ∈ H . Clearly this is possible only if g ∈ R ( T ). The idea in [4] to solve T x = g isto consider the family of evaluation mappings ϕ t : H → R , t ∈ [ a, b ], which are given by ϕ t ( x ) := ( T x )( t ) . C t := { x ∈ H | ϕ t ( x ) = g ( t ) } , t ∈ [ a, b ]. Consider the probabilityspace (Ω , F , P ) = ([ a, b ] , B ([ a, b ]) , λb − a ), where λ is the Lebesgue-measure. Let ξ : (Ω , F , P ) → ([ a, b ] , B ([ a, b ])) be a random variable with P ξ = P = λb − a . Then for g ∈ R ( T ), we have that T x = g if and only if x ∈ C := { y ∈ H | P ( y ∈ C ξ ) = 1 } . So the linear operator equation becomes a stochastic feasibility problem.In order to be able to compute projections onto the sets C t , t ∈ [ a, b ], we need the evaluationfunctionals ϕ t to be continuous, i.e. k ϕ t k < ∞ for almost all t ∈ [ a, b ]. By the Riesz represen-tation theorem there exists a unique u t ∈ H with ϕ t ( x ) = h u t , x i for all x ∈ H and almost all t ∈ [ a, b ]. We conclude that the projection onto C t takes the form P t x = x + g ( t ) − ( T x )( t ) k u t k u t x ∈ L ([ a, b ]) . Example 4.12 (linear integral equations) . Concretely, consider an integral equation of the ﬁrstkind in the separable Hilbert space L ([ a, b ])( T x )( t ) = Z ba K ( t, s ) x ( s ) d s = g ( t ) t ∈ [ a, b ] , with g ∈ L ([ a, b ]). For K ∈ L ([ a, b ] × [ a, b ]), T is a continuous linear compact operator [18,Theorem 8.15] (a Hilbert-Schmidt operator). For the Riesz representation of the evaluationfunctionals we have that ϕ t ( x ) = ( T x )( t ) = h u t , x i , t ∈ [ a, b ], as well as u t = K ( t, · ) and hence k ϕ t k ≤ k K ( t, · ) k < ∞ . Example 4.13 (diﬀerentiation) . Let K ( t, s ) = [ a,t ] ( s ) = u t ( s ), i.e. ( T x )( t ) = R ta x ( s ) d s andsuppose g ∈ C ([ a, b ]), then T x = g if and only if x = g almost surely and g ( a ) = 0. Theprojectors take the form P t x = x − g ( t ) − R ta x ( s ) d st − a [ a,t ] . A. Appendix

Lemma A.1 (slices of product σ -ﬁeld, see Proposition 3.3.2 in [19]) . Let (Ω i , F i ) , i = 1 , be twomeasurable spaces and M ∈ F ⊗F . Then for ω ∈ Ω holds M ω := { ω ∈ Ω | ( ω , ω ) ∈ M } ∈F . Theorem A.2 (dense sets in separable metric space) . Let ( G, d ) be a Polish space (completeseparable metric space). Then for any A ⊂ G , there is a dense countable subset { a n } n ∈ N ⊂ A and if A is closed then even A = cl { a n } ( cl U denotes the closure of the set U ⊂ G w.r.t. themetric d ).Proof. Since G is separable there exists a dense and countable subset { u n } n ∈ N ⊂ G with G =cl { u n } n . By denseness of { u n } ⊂ G , for any x ∈ G and any (cid:15) >

0, there is u n , where n isdepending on x and (cid:15) , with d ( u n , x ) < (cid:15) . Let (cid:15) > a (cid:15)n ∈ B ( u n , (cid:15) ) ∩ A , n ∈ N , ifthe intersection is nonempty. The set ˜ A := { a /mn } n,m ∈ N ⊂ A is nonempty and countable asunion of countable sets. It holds for any a ∈ A and any (cid:15) > ∃ n, m with 1 /m < (cid:15) and d ( a, u n ) < (cid:15) , hence d ( a, a /mn ) ≤ d ( a, u n ) + d ( u n , a /mn ) < (cid:15), i.e. ˜ A ⊂ A dense. So then A ⊂ cl ˜ A and if A is closed, then also cl ˜ A ⊂ A .22 heorem A.3 (support of a measure) . Let ( G, d ) be a Polish space and B ( G ) its Borel σ -algebra. Let π be a measure on ( G, B ( G )) and deﬁne its support via supp π = { x ∈ G | π ( B ( x, (cid:15) )) > ∀ (cid:15) > } . Then1. supp π = ∅ , if π = 0 .2. supp π is closed.3. π ( A ) = π ( A ∩ supp π ) for all A ∈ B ( G ) , i.e. π ((supp π ) c ) = 0 .4. For any closed set S ⊂ G with π ( A ∩ S ) = π ( A ) for all A ∈ B ( G ) holds supp π ⊂ S .Proof.

1. If π ( G ) >

0, then due to separability one could ﬁnd for any (cid:15) > G with balls with radius (cid:15) , where at least one needs to have nonzero measure, because0 < π ( G ) ≤ P n π ( B ( x n , (cid:15) )). Now just consider B := B ( x N , (cid:15) ) such that π ( B ) > (cid:15) < (cid:15) iteratively, then there is asequence (cid:15) n → B ( x n +1 , (cid:15) n +1 ) ⊂ B ( x n , (cid:15) n ), such that x n → x , i.e. x ∈ supp π .2. Let ( x n ) n ∈ N ⊂ supp π with x n → x as n → ∞ . Let (cid:15) > N > d ( x n , x ) < (cid:15) for all n ≥ N . Then x n ∈ B ( x, (cid:15) ) and ∃ ˜ (cid:15) > B ( x n , ˜ (cid:15) ) ⊂ B ( x, (cid:15) ), so we get π ( B ( x, (cid:15) )) ≥ π ( B ( x n , ˜ (cid:15) )) > , i.e. x ∈ supp π .3. Write S = (supp π ) c . Choose { x n } n ∈ N ⊂ S dense. By openness of S there exists (cid:15) n > B ( x n , (cid:15) n ) ⊂ S , hence S = S n ∈ N B ( x n , (cid:15) n ) and π ( S ) ≤ X i ∈ N π ( B ( x n , (cid:15) n )) = 0 . (It holds π ( B ( x n , (cid:15) n )) = 0, because otherwise, one could ﬁnd for any small enough (cid:15) >

0a countable cover of B ( x n , (cid:15) n ) with balls with radius (cid:15) , where at least one needs to havenonzero measure. Since this holds for all (cid:15) , there is a contradiction to B ( x n , (cid:15) n ) ⊂ S .)4. Let x ∈ supp π . So π ( B ( x, (cid:15) ) ∩ S ) > (cid:15) >

0, i.e. B ( x, (cid:15) ) ∩ S = ∅ for all (cid:15) >

0. Let x n be such that x n ∈ B ( x, (cid:15) n ) ∩ S , where (cid:15) n → n → ∞ . Then by closedness of S , x n → x ∈ S . Theorem A.4 (construction of an invariant measure) . Let ( µ P n ) n ∈ N be a tight family of prob-ability measures on a Polish space ( G, d ) , i.e. for any (cid:15) > there exists K (cid:15) ⊂ G compactwith ( µ P n )( G \ K (cid:15) ) < (cid:15) for all n ∈ N , where µ ∈ P ( G ) and P is the Markov operator for agiven transition kernel p , which is assumed to be Feller. Then any clusterpoint of ( ν n ) where ν n = n P ni =1 µ P i is an invariant measure for P .Proof. This is basically [20, Theorem 1.10]. The tightness of ( µ P n ) implies tightness of ( ν n )and therefore there exists a weakly converging subsequence ( ν n k ) with limit π ∈ P ( G ) byProkhorov’s Theorem. By the Feller property of P one has for any continuous and bounded f : G → R that also P f is continuous and bounded and hence | ( π P ) f − πf | = | π ( P f ) − πf |

23 lim k | ν n k ( P f ) − ν n k f | = lim k n k (cid:12)(cid:12)(cid:12) µ P n k +1 f − µ P f (cid:12)(cid:12)(cid:12) ≤ lim k k f k ∞ n k = 0 . Theorem A.5 (conditional expectation - basics, see Theorem 5.1 in [7]) . Let (Ω , F , P ) be aprobability space and X a real-valued random variable with E | X | < ∞ (integrable). Let F ⊂ F a sub- σ -algebra. Then it exists an a.s. unique F -mb. random variable Z := E [ X | F ] with E ( Z A ) = E ( X A ) for all A ∈ F .Let Y, X n be integrable random variables. Further properties are:1. E ( E [ X | F ]) = E X .2. X is F -mb, then E [ X | F ] = X a.s.3. X independent of F , then E [ X | F ] = E X a.s.4. E [ aX + bY | F ] = a E [ X | F ] + b E [ Y | F ] a.s. for all a, b ∈ R .5. X ≤ Y , then E [ X | F ] ≤ E [ Y | F ] a.s.6. ≤ X n ↑ X , then E [ X n | F ] ↑ E [ X | F ] a.s.7. F ⊂ F ⊂ F with σ -algebra F , then E [ E [ X | F ] | F ] = E [ X | F ] . Lemma A.6 (Satz 17.11 in [21]) . Let (Ω , F ) be a measurable space and µ be σ -ﬁnite. Let f : Ω → [0 , ∞ ] and set ν = f · µ (i.e. ν ( A ) = R A f d µ for A ∈ F ). Then f is µ -a.s. unique.Furthermore, ν is σ -ﬁnite (i.e. there exists (Ω n ) n ∈ N ⊂ F with ν (Ω n ) < ∞ and Ω n ↑ Ω ) if andonly if f is real-valued µ -a.s. Remark A.7:

A nonnegative real-valued random variable X on a probability space (Ω , F , P )induces a σ -ﬁnite measure ν = X · P . Theorem A.8 (conditional expectation - extension) . Let (Ω , F , P ) be a probability space and X ≥ be a real-valued random variable. Let F ⊂ F be a sub- σ -algebra. Then it exists ana.s. unique nonnegative real-valued random variable Z := E [ X | F ] on (Ω , F ) with E ( Z A ) = E ( X A ) for all A ∈ F .Let additionally Y, ( X n ) be nonnegative and real-valued, then all items 1.-7. in Theorem A.5 aresatisﬁed for these.Proof. From Remark A.7 follows the existence of disjoint sets Ω n ∈ F with S n Ω n = Ω andthe property that R Ω n X d P < ∞ . One has that a.s. Ω n E [ X | F ∩ Ω n ] = E [ X Ω n | F ∩ Ω n ] = E [ X | F ∩ Ω n ] = E [ X Ω n | F ] . Deﬁne Z := P n E [ X | F ∩ Ω n ], then Z = E [ X | F ]. The items 1.-7. follow now from Theo-rem A.5 on Ω n and the Monotone Convergence Theorem.24 heorem A.9 (disintegration) . Let (Ω , F , P ) be a probability space and (Ω , F ) , (Ω , F ) be measurable spaces and F ⊂ F a sub- σ -algebra, let X ∈ Ω have a regular version µ of P ( X ∈ · | F ) and let X ∈ Ω be F -mb. Let furthermore f : Ω × Ω → R + be measurable.Then E [ f ( X , X ) | F ] = Z f ( x , X ) µ (d x ) a.s. Lemma A.10.

Let (Ω , F , P ) be probability space. Let ( X k ) k ∈ N , ( U k ) k ∈ N be sequences ofnonnegative real-valued random variables with X k ∈ F k , where F ⊂ F ⊂ . . . ⊂ F are σ -algebras. Suppose for all k ∈ N X k +1 ≤ X k − U k a.s. Deﬁne V k := E [ U k | F k ] for k ∈ N . Then X k → X a.s. and P k U k , P k V k < ∞ a.s.Proof. This is a special instance of the more general supermartingale convergence theorem in[22].

Lemma A.11 (further properties of R ) . Under the standing assumptions, if T i = P i areprojectors onto nonempty, closed and convex sets, i ∈ I . There holds that:1. R is convex.2. R is continuously diﬀerentiable, ∇ R ( x ) = x − E [ P ξ x ] for all x ∈ H .3. ∇ R is globally Lipschitz continuous with constant not larger than .4. C = {∇ R = 0 } .Proof.

1. The function x dist( x, C i ) is convex for all i ∈ I , since C i = Fix P i is convex,nonempty and closed. On [0 , ∞ ) the function x x is increasing and convex, so x dist ( x, C i ) is convex, i ∈ I . The convexity of R follows by linearity of the expectation.2. We need to show thatlim = k y k→ | R ( x + y ) − R ( x ) − E [ h x − P ξ x, y i ] |k y k = 0 . Let ( y n ) ⊂ B (0 , (cid:15) ) ⊂ H with y n →

0. Deﬁne a sequence of functions on Ω via f n = (cid:12)(cid:12)(cid:12) dist ( x + y n , C ξ ) − dist ( x, C ξ ) − h x − P ξ x, y n i (cid:12)(cid:12)(cid:12) k y n k . Then for ﬁxed ω ∈ Ω we have f n ( ω ) → n → ∞ since the function x dist ( x, C i )is Fréchet diﬀerentiable for all i ∈ I [8, Corollary 12.30]. Furthermore, we ﬁnd for any n ∈ N that f n = (cid:12)(cid:12)(cid:12) k x + y n − P ξ ( x + y n ) k − k x − P ξ x k − h x − P ξ x, y n i (cid:12)(cid:12)(cid:12) k y n k = (cid:12)(cid:12)(cid:12) k y n − P ξ ( x + y n ) + P ξ x k + 2 h x − P ξ x, P ξ x − P ξ ( x + y n ) i (cid:12)(cid:12)(cid:12) k y n k (cid:12)(cid:12)(cid:12) k y n k + k P ξ x − P ξ ( x + y n ) k + 2 h y n + x − P ξ x, P ξ x − P ξ ( x + y n ) i (cid:12)(cid:12)(cid:12) k y n k≤ k y n k + 2 dist( x, C ξ ) k y n kk y n k≤ (cid:15) + 2 dist( x, C ξ ) =: g, where, in the ﬁrst inequality, we used nonexpansivity of the projectors P i , i ∈ I and theCauchy-Schwartz inequality. In particular with Hölder’s inequality follows that E [ g ] ≤ (cid:15) + 2 p R ( x ), i.e. g is integrable and hence Lebesgue’s Dominated Convergence Theoremyields E [ f n ] →

0, which gives us Fréchet diﬀerentiability of R with derivative ∇ R ( x ) =2 E [ x − P ξ x ]. Continuity of ∇ R follows from k∇ R ( x + y ) − ∇ R ( x ) k = 2 k E [ y − P ξ ( x + y ) + P ξ x ] k≤ E [ k y k + k P ξ ( x + y ) − P ξ x k ] ≤ k y k , where we used nonexpansivity of the projectors P i , i ∈ I in the second inequality.3. For any x, y ∈ H it holds that k∇ R ( x ) − ∇ R ( y ) k ≤ k x − y k + k E [ P ξ x − P ξ y ] k ). Apply-ing Jensen’s inequality and nonexpansivity, we arrive at the desired result.4. Clearly if x ∈ C , then x = P ξ x a.s. and so x = E [ P ξ x ], i.e. ∇ R ( x ) = 0 by 2.Now conversely, if ∇ R ( x ) = 0, then by convexity R ( x ) − R ( y ) ≤ h∇ R ( x ) , x − y i = 0 forall y ∈ H . Since C = ∅ there is y ∈ H with R ( y ) = 0, so also R ( x ) = 0, i.e. x ∈ C . B. Paracontractions

Paracontractions include the set of averaged operators, but averaged mappings possess moreuseful regularity properties, e.g. when composing these operators, one stays in the set of averagedoperators, whereas for nonaveraged operators this is not clear in general. An example of anonaveraged paracontraction in R is a Huber function with parameter α > α = 1) f α ( x ) := ( x α , | x | ≤ α | x | − α , | x | > α , x ∈ R . We have that f α is nonexpansive and paracontractive, but not averaged, since for x = − α and y = − α one has f ( x ) = α and f ( y ) = α . Consequently | f ( x ) − f ( y ) | = α = | x − y | , but | x − f ( x ) − ( y − f ( y )) | = 2 α = 0 . In general metric spaces with nonlinear structure the averaged mappings are not deﬁned orat least demand a diﬀerent deﬁnition, but still the paracontraction framework applies here andexhibits a useful description of mappings which ensure that the RFI converges to a commonﬁxed point. Paracontractions were used in Section 3.1 to guarantee Fejér monotonicity, yieldingconvergence; averagedness in this context would be too strong an assumption (and is not deﬁnedactually). In the following we provide an example of a class of paracontracting operators in R n ,that are not in general averaged, resolvents of quasiconvex functions.26 eﬁnition B.1. A function f : R n → R is called quasiconvex, if the sublevel sets { x ∈ R n | f ( x ) ≤ α } are convex for all α ∈ R . Equivalently, f satisﬁes f ( λx + (1 − λ ) y ) ≤ max { f ( x ) , f ( y ) } ∀ x, y ∈ R n , ∀ λ ∈ [0 , . The proximity operator function f : R n → R is given by the set-valued mappingprox f ( x ) := argmin y ∈ R n (cid:26) f ( y ) + 12 k x − y k (cid:27) , x ∈ R n . Lemma B.2.

Let f : R n → R be twice diﬀerentiable and quasiconvex and satisfy S :=argmin f = ∅ and ∇ f = 0 on R n \ S , furthermore suppose that Id + Hess f ( x ) is positivedeﬁnit for all x ∈ R n , then prox f is paracontracting.Proof. Denote A := Id + ∇ f . Let x, y ∈ R n with f ( x ) ≥ f ( y ), then k A ( x ) − y k = k x − y k + k∇ f ( x ) k + h∇ f ( x ) , x − y i ≥ k x − y k , where we used that in [24] it is shown, that a quasiconvex and diﬀerentiable function satisﬁes f ( x ) ≥ f ( y ) = ⇒ h∇ f ( x ) , x − y i ≥ , for any x, y ∈ R n . Note that if x / ∈ S then ∇ f ( x ) = 0 by assumption and hence for y ∈ R n with f ( y ) ≤ f ( x ) it holds that k A ( x ) − y k > k x − y k . (19)Moreover, the function g ( y ) := f ( y ) + 12 k x − y k for ﬁxed x ∈ R n is bounded from below, since inf x f ( x ) > −∞ by assumption and coercive.From positive deﬁnitness of Id + Hess f we have that g is also twice continuously diﬀerentiableand strictly convex, hence it possesses a unique minimizer ¯ x that satisﬁes x = ∇ f (¯ x ) + ¯ x = A (¯ x ) , it follows that A ( R n ) = R n , i.e. A is surjective. Furthermore, A is injective, since from unique-ness of the minimizer and suﬃciency of the ﬁrst order optimality criterion for ¯ x to be a minimizer( g is convex) it follows that, if A (¯ x ) = A (¯ y ), then ¯ x = ¯ y is the minimizer for g and in particular A (¯ x ) = x ⇔ prox f ( x ) = ¯ x .To show that also prox f is continuous, ﬁx x ∈ R n and let y ∈ B ( x, (cid:15) ). We can ﬁnd a z ∈ B ( x, (cid:15) )with f ( z ) ≤ f ( y ) for all y ∈ B ( x, (cid:15) ) by continuity of f , so we get with (19) that (cid:13)(cid:13)(cid:13) prox f ( x ) − prox f ( y ) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) prox f ( x ) − z (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) prox f ( y ) − z (cid:13)(cid:13)(cid:13) < k x − z k + k y − z k < (cid:15). In particular, letting y = ¯ x ∈ S in (19), we have that (cid:13)(cid:13)(cid:13) prox f ( x ) − ¯ x (cid:13)(cid:13)(cid:13) < k x − ¯ x k ∀ x ∈ R n \ S, where S = argmin f = Fix prox f . 27 xample B.3 (non-averaged resolvent of quasiconvex function) . The function f ( x ) := 1 − exp (cid:16) −k x k (cid:17) for x ∈ R n satisﬁes all the conditions in Lemma B.2. Its proximity operator hasthe derivative prox f ( A ( x )) = ( A ( x )) − , where A ( x ) = (cid:16) (cid:16) −k x k (cid:17)(cid:17) x , i.e. A ( x ) = (cid:16) (cid:16) −k x k (cid:17)(cid:17) Id − (cid:16) −k x k (cid:17) xx T . Since (cid:13)(cid:13)(cid:13) prox f ( A ( x )) (cid:13)(cid:13)(cid:13) ≥ k y k / k A ( x ) y k for any y ∈ R n \ { } , we have with x = e = (1 , , . . . , T = y that (cid:13)(cid:13)(cid:13) prox f ( A ( e )) (cid:13)(cid:13)(cid:13) >

1, which is incontradiction to nonexpansiveness of averaged mappings, that have derivative bounded by 1, ifit exists.Where paracontractions also occur are nonconvex feasibility problems, both consistent andinconsistent. As long as the ﬁxed point set of the averaged projections operator consists ofisolated points and the projectors are single-valued in a neighborhood of this ﬁxed point, [25,Theorem 3.2] shows that these operators are paracontractions, whenever all assumptions of thetheorem are met. Unfortunately, a statement on paracontractiveness, for the case that the ﬁxedpoint set of the averaged projections operator does not consist of isolated points, is however notpossible in general.Furthermore, also nonconvex forward-backward operators appearing in structured optimiza-tion of nonconvex objective functions show the paracontractiveness property, see [25, Proposition3.9], and these are not averaged in general, still the assumption that the ﬁxed points are isolatedis used.

References [1] D. Butnariu and S. D. Flåm, “Strong convergence of expected-projection methods in hilbertspaces,”

Numerical Functional Analysis and Optimization , vol. 16, no. 5-6, pp. 601–636,1995.[2] A. Nedić, “Random algorithms for convex minimization problems,”

Mathematical Program-ming , vol. 129, no. 2, pp. 225–253, 2011.[3] S. D. Flåm, “Successive averages of ﬁrmly nonexpansive mappings,”

Mathematics of Op-erations Research , vol. 20, no. 2, pp. 497–512, 1995.[4] D. Butnariu, “The expected-projection method: Its behavior and applications to linearoperator equations and convex optimization,”

Journal of Applied Analysis , vol. 1, no. 1,pp. 93–108, 1995.[5] A. Nedić, “Random projection algorithms for convex set intersection problems,” , pp. 7655–7660, 2010.[6] D. R. Luke, M. Teboulle, and N. H. Thao, “Necessary conditions for linear convergence ofPicard iterations and application to alternating projections,” arXiv , 2017.[7] O. Kallenberg,

Foundations of Modern Probability . Probability and Its Applications, NewYork: Springer, 1997.[8] H. H. Bauschke and P. L. Combettes,

Convex Analysis and Monotone Operator Theory inHilbert Spaces . Berlin: Springer, 2011.[9] W. R. Mann, “Mean value methods in iterations,”

Proc. Amer. Math. Soc. , vol. 4, pp. 506–510, 1953. 2810] M. A. Krasnoselski, “Two remarks on the method of successive approximations,”

Math.Nauk. (N.S.) , vol. 63, no. 1, pp. 123–127, 1955. (Russian).[11] M. Edelstein, “A remark on a theorem of M. A. Krasnoselski,”

Amer. Math. Monthly ,vol. 73, pp. 509–510, May 1966.[12] L. Gubin, B. Polyak, and E. Raik, “The method of projections for ﬁnding the commonpoint of convex sets,”

USSR Comput. Math. and Math. Phys. , vol. 7, no. 6, pp. 1–24, 1967.[13] J. B. Baillon, R. E. Bruck, and S. Reich, “On the asymptotic behavior of nonexpansivemappings and semigroups in Banach spaces,”

Houston J. Math. , vol. 4, no. 1, pp. 1–9,1978.[14] M. Hairer, “Ergodic properties of Markov processes,”

Lecture Notes in Mathematics ,vol. 1881, pp. 1–39, 2006.[15] F. Deutsch,

Best Approximation in Inner Product Spaces . New York: Springer, 2001.[16] A. Y. Kruger, D. R. Luke, and N. H. Thao, “Set regularities and feasibility problems,”

Mathematical Programming , vol. 168, no. 1-2, pp. 279–311, 2018.[17] J. Bolte, T. P. Nguyen, J. Peypouquet, and B. W. Suter, “From error bounds to the com-plexity of ﬁrst-order descent methods for convex functions,”

Mathematical Programming ,vol. 165, pp. 471–507, Oct 2017.[18] H. Alt,

Lineare Funktionalanalysis: Eine anwendungsorientierte Einführung (in German) .Springer Lehrbuch, Berlin: Springer, 2002.[19] V. Bogachev,

Measure Theory , vol. I. Springer Berlin Heidelberg, 2007.[20] M. Hairer, “Convergence of Markov processes,”

Lecture notes, University of Warwick , 2016.[21] H. Bauer,

Maß- und Integrationstheorie (in German) . De Gruyter Lehrbuch, Berlin: W.de Gruyter, 1992.[22] H. Robbins and D. Siegmund, “A convergence theorem for non negative almost super-martingales and some applications,” in

Optimizing Methods in Statistics (J. S. Rustagi,ed.), pp. 233 – 257, Academic Press, 1971.[23] H. H. Bauschke and J. M. Borwein, “On projection algorithms for solving convex feasibilityproblems.,”

SIAM Rev. , vol. 38, no. 3, pp. 367–426, 1996.[24] K. J. Arrow and A. C. Enthoven, “Quasi-concave programming.,”

Econometrica , vol. 29,pp. 779–800, 1961.[25] D. R. Luke, N. H. Thao, and M. K. Tam, “Quantitative convergence analysis of iteratedexpansive , set-valued mappings,”