[PDF] Examples of pathological dynamics of the subgradient method for Lipschitz path-differentiable functions

Abstract

We show that the vanishing stepsize subgradient method -- widely adopted for machine learning applications -- can display rather messy behavior even in the presence of favorable assumptions. We establish that convergence of bounded subgradient sequences may fail even with a Whitney stratifiable objective function satisfying the Kurdyka-Lojasiewicz inequality. Moreover, when the objective function is path-differentiable we show that various properties all may fail to occur: criticality of the limit points, convergence of the sequence, convergence in values, codimension one of the accumulation set, equality of the accumulation and essential accumulation sets, connectedness of the essential accumulation set, spontaneous slowdown, oscillation compensation, and oscillation perpendicularity to the accumulation set.

Full PDF

EExamples of pathological dynamics of the subgradient method forLipschitz path-diﬀerentiable functions

Rodolfo R´ıos-ZertucheJuly 24, 2020

Abstract

We show that the vanishing stepsize subgradient method —widely adopted for machine learningapplications— can display rather messy behavior even in the presence of favorable assumptions.We establish that convergence of bounded subgradient sequences may fail even with a Whitneystratiﬁable objective function satisfying the Kurdyka-(cid:32)Lojasiewicz inequality.Moreover, when the objective function is path-diﬀerentiable we show that various propertiesall may fail to occur: criticality of the limit points, convergence of the sequence, convergence invalues, codimension one of the accumulation set, equality of the accumulation and essential accu-mulation sets, connectedness of the essential accumulation set, spontaneous slowdown, oscillationcompensation, and oscillation perpendicularity to the accumulation set.

In our previous work [5], we investigate the vanishing-step subgradient method applied to a nonsmooth,nonconvex objective function f in the hope of ﬁndingarg min x ∈ R n f ( x ) . This paper is intended as a companion to [5], as it presents two examples that show that the resultsobtained there are sharp in several senses. We also aim here to provide insight into the types ofdynamics that the subgradient algorithm presents in the asymptotic limit, and we evaluate some ofthe ideas that are believed to show promise towards a proof of convergence of the algorithm, such asthe Kurdyka–(cid:32)Lojasiewicz inequality. We refer the reader to [5] for some discussion of the historicalbackground.We shall now give some deﬁnitions that will allow us to discuss our results.For a locally Lipschitz function f : R n → R , we denote by ∂ c f ( x ) the Clarke subdiﬀerential of f at x ∈ R n , that is, the convex envelope of the set of vectors v ∈ R n such that there is a sequence { y i } i ⊂ R n such that f is diﬀerentiable at y i , y i → x and ∇ f ( y i ) → v . Deﬁnition 1 (Small-step subgradient method) . Let f : R n → R be a locally Lipschitz function, and { ε i } i be a sequence of positive step sizes such that ∞ (cid:88) i =0 ε i = + ∞ and lim i → + ∞ ε i = 0 . Given x ∈ R n , consider the recursion, for i (cid:62) x i +1 = x i − ε i v i , v i ∈ ∂ c f ( x i ) . Here, v i is chosen freely among ∂ c f ( x i ). The sequence { x i } i ∈ N is called a subgradient sequence .1 a r X i v : . [ m a t h . O C ] J u l ince the dynamics of the subgradient method in the case of f locally Lipschitz had been shown [6,7]to be too unwieldy, in [5] we instead discuss the dynamics of the subgradient method for f path-diﬀerentiable. Deﬁnition 2 (Path-diﬀerentiable functions) . A locally Lipschitz function f : R n → R is path-differ-entiable if for each Lipschitz curve γ : R → R n , for almost every t ∈ R , the composition f ◦ γ isdifferentiable at t and the derivative is given by( f ◦ γ ) (cid:48) ( t ) = v · γ (cid:48) ( t )for all v ∈ ∂ c f ( γ ( t )). Deﬁnition 3 (Weak Sard condition) . We will say that f satisﬁes the weak Sard condition if it isconstant on each connected component of its critical set crit f = { x ∈ R n : 0 ∈ ∂ c f ( x ) } .Recall that the accumulation set acc { x i } i of the sequence { x i } i is the set of points x ∈ R n suchthat, for every neighborhood U of x , the intersection U ∩ { x i } i is an inﬁnite set. Its elements areknown as limit points . Deﬁnition 4 (Essential accumulation set) . Given sequences { x i } i ⊂ R n and { ε i } i ⊂ R (cid:62) , the essentialaccumulation set ess acc { x i } i is the set of points x ∈ R n such that, for every neighborhood U of x ,lim sup N → + ∞ (cid:88) (cid:54) i (cid:54) Nx i ∈ U ε i (cid:88) (cid:54) i (cid:54) N ε i > . (1) Deﬁnition 5 (Whitney stratiﬁable functions) . Let X be a nonempty subset of R m and 0 < p (cid:54) + ∞ .A C p stratiﬁcation X = { X i } i ∈ I of X is a locally ﬁnite partition of X = (cid:70) i X i into connectedsubmanifolds X i of R m of class C p such that for each i (cid:54) = jX i ∩ X j (cid:54) = ∅ = ⇒ X j ⊂ X i \ X i . A C p stratiﬁcation X of X satisﬁes Whitney’s condition A if, for each x ∈ X i ∩ X j , i (cid:54) = j , and foreach sequence { x k } k ⊂ X i with x k → x as k → + ∞ , and such that the sequence of tangent spaces { T x k X i } k converges (in the usual metric topology of the Grassmanian) to a subspace V ⊂ T x R m , wehave that T x X j ⊂ V . A C p stratiﬁcation is Whitney if it satisﬁes Whitney’s condition A.With the same notations as above, a function f : R n → R k is Whitney C p -stratiﬁable if there existsa Whitney C p stratiﬁcation of its graph as a subset of R n + k . Summary of the results.

Let • n > • f : R n → R be a locally Lipschitz, path-diﬀerentiable function, • the sequence { ε i } i ⊂ R > of step sizes satisfy lim i → + ∞ ε i = 0, and • { x i } i be a bounded subgradient sequence with stepsizes { ε i } i . In other parts of the literature (see e.g. [4]), this deﬁnition is given with absolutely-continuous curves, and this isequivalent because such curves can be reparameterized (for example, by arclength) to obtain Lipschitz curves, withoutaﬀecting their role in the deﬁnition.

Does the sequence { x i } i converge in general? While it is tempting to hope for the sequence to converge since we have proven [5, Theorems6(i),7(i),7(ii)] that the sequence slows down indeﬁnitely, in Section 2 we give an example inwhich the sequence forever accumulates around a circle and never converges. The function weconstruct satisﬁes the weak Sard condition, so even with that assumption there is no hope forthe convergence of { x i } i . The function also satisﬁes the Kurdyka–(cid:32)Lojasiewicz inequality; seeQ9.In contrast, in can be proven [3] that if f satisﬁes the weak Sard condition and the Kurdyka–(cid:32)Lojasiewicz inequality, then the ﬂow lines x : R → R n of the continuous-time subgradient ﬂow,which satisfy − ˙ x ( t ) ∈ ∂ c f ( x ( t )) , always converge. Thus the example in Section 2 shows that the convergence of the continuous-time process may not guarantee the convergence of the discrete subgradient sequence.Q2. Do the values { f ( x i ) } i converge for a general path-diﬀerentiable function f ? Although this convergence can be proved when f satisﬁes the weak Sard condition [5, Theorem7(v)], the example in Section 3 shows that the convergence of the values f ( x i ) fails in general.In fact, in that example we have f (acc { x i } i ) = [0 ,

1] = f (ess acc { x i } i ).Q3. Must acc { x i } i be a subset of crit f in general? The example in Section 3 shows that in general the set acc { x i } i \ ess acc { x i } i may not intersectcrit f . This contrasts with results that ess acc { x i } i is always contained in crit f [5, Theorem6(iii)], and that acc { x i } i is contained in crit f if f satisﬁes the weak Sard condition [5, Theorem7(iv)].Q4. Do we always have ess acc { x i } i = acc { x i } i ? No, in the example in Section 3 we have a situation in which the set ess acc { x i } i is strictly smallerthan acc { x i } i . We do not know the answer to this question with more stringent assumptions,such as f satisfying the weak Sard condition.Q5. Can the essential accumulation set ess acc { x i } i be disconnected? Yes. Although for simplicity we do not construct an example here, the reader will surely under-stand that the example in Section 3 can be easily modiﬁed (by taking several copies of Γ andjoining them with curves having roles similar to the one played by J ) to produce a situation inwhich ess acc { x i } i is disconnected. This contrasts with the fact that acc { x i } i is always connectedbecause dist( x i , x i +1 ) (cid:54) ε i Lip( f ) → i → + ∞ , where Lip( f ) is the Lipschitz constant for f in a compact set that contains { x i } i .Q6. A certain spontaneous slowdown phenomenon is proved in [5, Theorem 6(i)] of the fragmentsof the subgradient sequence as (roughly speaking) it traverses the piece of acc { x i } i starting ata point x and ending at another point y , such that x, y ∈ acc { x i } verify f ( x ) (cid:54) f ( y ) (see theprecise statement below).Is there any hope of proving, for general f , that this phenomenon always occurs uniformlythroughout the accumulation set, regardless of the restriction f ( x ) (cid:54) f ( y ) ? x and y be two distinct points in acc { x i } i satisfy f ( x ) (cid:54) f ( y ), and take subsequences { x i k } k and { x i (cid:48) k } k such that x i k → x , x i (cid:48) k → y as k → + ∞ , and i (cid:48) k > i k for all k . Then lim k → + ∞ i (cid:48) k (cid:88) p = i k ε p = + ∞ . This is veriﬁed independently of the subsequences taken.On the other hand, the endpoints x and y of the curve J in the example in Section 3 are containedin acc { x i } i , satisfy f ( x ) > f ( y ), and we can take subsequences { x i k } k and { x i (cid:48) k } k converging to x and y , respectively, and with i (cid:48) k > i k , for which we additionally havesup k i (cid:48) k (cid:88) p = i k ε p < + ∞ . Q7.

Does the oscillation compensation phenomenon described in [5, Theorem 6(ii)] occur on theentire accumulation set in general?

While we are able to prove an oscillation compensation result [5, Theorem 7(iii)] that holdsthroughout acc { x i } i with the assumption that f satisﬁes the weak Sard condition, the examplein Section 3 shows that in general, in the absence of the weak Sard condition, there need not beany oscillation compensation on acc { x i } i \ ess acc { x i } i , which in the example corresponds to thecurve J . For a precise statement, please refer to C7 in Section 3.Q8. Can the perpendicularity of the oscillations of { x i } i veriﬁed around ess acc { x i } i [5, Remark 9]be proved on the entire accumulation set? No, as is shown in the example of Section 3 this may fail on acc { x i } i \ ess acc { x i } i for general f . The perpendicularity can, however, be proved to happen on ess acc { x i } i or, if f satisﬁes theweak Sard condition, on all of acc { x i } i ; see [5, Remark 9].Q9. Would it be possible to prove the convergence of { x i } i if f is Whitney stratiﬁable (cf. Deﬁnition5) and satisifes a Kurdyka–(cid:32)Lojasiewicz inequality? No; more assumptions are necessary. The objective function f in the example in Section 2 isWhitney C ∞ stratiﬁable and satisﬁes a Kurdyka–(cid:32)Lojasiewicz inequality of the form (cid:107)∇ f ( x ) (cid:107) (cid:62)

12 for all x / ∈ crit f ,but we also construct a bounded subgradient sequence that fails to converge. However, in thecase of f smooth, the Kurdyka–(cid:32)Lojasiewicz inequality does suﬃce to prove convergence of thesubgradient method [1].Q10. Recall that the Hausdorﬀ dimension of a set X isdim X = inf { d ∈ R : H d ( X ) = 0 } , where H d ( X ) is the d -dimensional Hausdorﬀ outer measure, H d ( X ) := lim inf r → { (cid:80) i r di : there is a cover of X by balls of radii 0 < r i < r } . ust the Hausdorﬀ dimension of the accumulation set of { x i } i be dim acc { x i } i (cid:54) n − ? No, the example in Section 3 gives a function f : R → R and a subgradient sequence { x i } i suchthat the Hausdorﬀ dimension satisﬁes1 < dim acc { x i } i = dim ess acc { x i } i (cid:54) log 4log 3 ≈ . , (2)and actually depends on a parameter α that can be tweaked to produce any value of the Hausdorﬀdimension in this range; see Lemma 11. Although the function f in that example does not satisfythe weak Sard condition, the example can be easily modiﬁed (by changing the value of f on Γ ∪ J to a constant) to satisfy also this condition and still have the dimension attain any value in therange (2).This contrasts with the result [5, Remark 10] that, if f is Whitney C n stratiﬁable, thendim acc { x i } i (cid:54) n − . Q11.

Can the set of limit closed measures of the interpolant curve be inﬁnite?

Yes. This is the case in the situation of the example in Section 2 (and also in the example ofSection 3, but for simplicity we will not prove it in that case). Please refer to Section 2.2 for thefull deﬁnitions and an explanation.Q12.

Would the answer to any of the previous questions Q1–Q11 be diﬀerent if one enforced that thesequence be contained in the (full measure) set of diﬀerentiability points of the function f ? No, all our claims are based on constructive existence proofs of subgradient sequences { x i } i suchthat each point x i is contained in a ball in which the objective function f is C ∞ . Notation.

Given two sets A and B , denote by B c the complement of B and by A \ B = A ∩ B c .Let n be a positive integer, and let R n denote n -dimensional Euclidean space. For two vectors u =( u , . . . , u n ) and v = ( v , . . . , v n ) in R n , we let u · v = (cid:80) ni =1 u i v i and (cid:107) u (cid:107) = √ u · u . We will denote thegradient of f at x by ∇ f ( x ). We denote log b a = log a/ log b the logarithm of a in base b . We denotethe unit circle by S , and the open ball of radius r centered at x by B r ( x ). A number with a subindex b is in base b ; for example, 0 . = 1 / /

81. For a Lipschitz function g : R n → R m , we denote byLip( g ) = sup x,y ∈ R n (cid:107) g ( x ) − g ( y ) (cid:107)(cid:107) x − y (cid:107) . We construct a path-diﬀerentiable function f : R → R and a subgradient sequence { x i } i that doesnot converge and instead accumulates around a circle. The function f additionally has the propertythat it is Whitney C ∞ stratiﬁable and satisﬁes a Kurdyka-(cid:32)Lojasiewicz inequality. The construction isgiven in Section 2.1 and the main properties are collected in Proposition 6.In the context of the theory developed in [5, § { x i } i in the example of Section 2. For i (cid:62)

2, let (see Figure 1) x i = [1 + ( − i i ](cos ϑ i , sin ϑ i ) with ϑ i = i (cid:88) k =2 k log k and ε i = (cid:107) x i +1 − x i (cid:107) , v i = − x i +1 − x i (cid:107) x i +1 − x i (cid:107) , so that x i +1 = x i − ε i v i . Note that ε i satisﬁes, for large i ,2 i + 1 < ε i < i + 1 i + 1 + 1 i log i < i , so that ε i → (cid:80) i ε i = + ∞ , and (cid:80) i ε i < + ∞ .We want to obtain a function f that is very close to the function φ given by the distance to thecircle, φ ( x ) = | − (cid:107) x (cid:107)| , yet satisﬁes ∇ f ( x ) = v i for all x ∈ B / i ( x i ) . (3)Let ψ : R → [0 ,

1] be a C ∞ function with radial symmetry (i.e. ψ ( x ) = ψ ( y ) for (cid:107) x (cid:107) = (cid:107) y (cid:107) ), suchthat ψ ( x ) = 1 for x ∈ B (0), ψ ( x ) = 0 for (cid:107) x (cid:107) (cid:62)

2, and decreases monotonically on rays emanatingfrom the origin. Let ψ i ( x ) = ψ (2 i ( x − x i )) , so that ψ i equals 1 on B / i ( x i ) and vanishes outside B / i − ( x i ). Note that the supports of thefunctions ψ i are pairwise disjoint.Deﬁne V i ( x ) = ( x − x i ) · v i + 1 i . Proposition 6.

Let i (cid:62) and f ( x ) = (cid:32) − ∞ (cid:88) i = i ψ i ( x ) (cid:33) φ ( x ) + ∞ (cid:88) i = i ψ i ( x ) V i ( x ) . (4) Then we have: . The function f is C ∞ on R \ S .ii. The function f satisﬁes (3) , so that { x i } i is a subgradient sequence with stepsizes { ε i } i .iii. Let p be a point in the unit circle, then ∂ c f ( p ) = { ap : a ∈ [ − , } = ∂ c φ ( p ) .iv. The critical set of f is crit f = S ∪ { } .v. The function f is Lipschitz path-diﬀerentiable.vi. The function f is Whitney C ∞ stratiﬁable.vii. If i is large enough, f satisﬁes a Kurdyka-(cid:32)Lojasiewicz inequality of the form (cid:107)∇ f ( x ) (cid:107) > / for x ∈ R \ crit f . To prove the proposition we need

Lemma 7.

For i large enough we have the estimates (cid:13)(cid:13)(cid:13)(cid:13) v i − ( − i x i (cid:107) x i (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) (cid:54) i (5) and, if dist( x i , y ) (cid:54) − i , (cid:13)(cid:13)(cid:13)(cid:13) x i (cid:107) x i (cid:107) − y (cid:107) y (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) (cid:54) x i , y ) . (6) Proof.

To show (5), ﬁrst observe that, in the deﬁnition of x i , the jump in the direction tangential tothe circle has magnitude ϑ i − ϑ i − = 1 /i log i , while the jump in the direction normal to the circle hasmagnitude i + i +1 . It follows that 12 i log i (cid:54) ( x i +1 − x i ) · x ⊥ i (cid:107) x i (cid:107) (cid:54) i log i , i + 1 (cid:16) − i (cid:17) (cid:54) i + 1 (cid:115) − i (cid:54) ( x i +1 − x i ) · x i (cid:107) x i (cid:107) (cid:54) (cid:107) x i +1 − x i (cid:107) , where ( a, b ) ⊥ = ( − b, a ) and we have used the Cauchy–Schwarz inequality. Since 1 /i (cid:54) ε i = (cid:107) x i +1 − x i (cid:107) (cid:54) /i , together with v i = − ( x i +1 − x i ) /ε i and the estimates above, we also have12 log i (cid:54) (cid:12)(cid:12)(cid:12)(cid:12) v i · x ⊥ i (cid:107) x i (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) (cid:54) i , (7)and ii + 1 (cid:18) − i (cid:19) (cid:54) (cid:12)(cid:12)(cid:12)(cid:12) v i · x i (cid:107) x i (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) (cid:54) . (8)The estimate (5) follows from (7) and (8): (cid:13)(cid:13)(cid:13)(cid:13) v i − ( − i x i (cid:107) x i (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) = (cid:115)(cid:18) v i · x i (cid:107) x i (cid:107) − (cid:19) + (cid:18) v i · x ⊥ i (cid:107) x i (cid:107) (cid:19) (cid:54) (cid:115)(cid:18) ii + 1 (cid:18) i + 1 (cid:19) − (cid:19) + (cid:18) i (cid:19) (cid:54) i + 2 i (cid:54) i . w = y − x i , so that (cid:107) w (cid:107) = dist( x, y ) and observing that1 − i (cid:54) (cid:107) x i (cid:107) (cid:54) i and (cid:107) x i + w (cid:107) = (cid:107) y (cid:107) (cid:62) − i , which means that, for i large, we have (cid:13)(cid:13)(cid:13)(cid:13) x i (cid:107) x i (cid:107) − y (cid:107) y (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) x i (cid:107) x i (cid:107) − x i + w (cid:107) x i + w (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13) x i ( (cid:107) x i + w (cid:107) − (cid:107) x i (cid:107) ) + w (cid:107) x i (cid:107) (cid:13)(cid:13) (cid:107) x i (cid:107) (cid:107) x i + w (cid:107) (cid:54) (cid:107) x i (cid:107) (cid:107) w (cid:107)(cid:107) x i (cid:107) (cid:107) x i + w (cid:107) (cid:54) i (1 − i ) (cid:107) w (cid:107) (cid:54) (cid:107) w (cid:107) . Proof of Proposition 6.

Item (i) becomes evident once we realize that the sum (4) reduces to f ( x ) =(1 − ψ i ( x )) φ ( x ) + ψ i ( x ) V i ( x ) for x in B / i − ( x i ) and to f ( x ) = φ ( x ) elsewhere, since ψ i , V i and φ are C ∞ on R \ ( S ∪ { } ).To prove item (ii), note that, for x ∈ B / i ( x i ), we have f ( x ) = V i ( x ) and ∇ f ( x ) = ∇ V i ( x ) = v i so that x i +1 − x i = − ε i v i = − ε i ∇ f ( x i ).In order to prove item (iii), let p ∈ S . Let us ﬁrst show that, as y ∈ R with (cid:107) y (cid:107) < p , ∇ f ( y ) → − p . If y / ∈ (cid:83) i B / i − ( x i ) is near p , then (cid:107)∇ f ( y ) + p (cid:107) = (cid:107)∇ φ ( y ) + p (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13) − y (cid:107) y (cid:107) + p (cid:13)(cid:13)(cid:13)(cid:13) , which clearly tends to 0 as y → p . If y ∈ B / i − ( x i ) (and since (cid:107) y (cid:107) < i odd), thenwe have, by a Taylor expansion, ∇ φ ( x i ) = − x i / (cid:107) x i (cid:107) , the Cauchy–Schwarz inequality, and (5), | V i ( y ) − φ ( y ) | = (cid:12)(cid:12) ( y − x i ) · v i + i − φ ( x i ) − ∇ φ ( x i ) · ( y − x i ) (cid:12)(cid:12) + 2 (cid:107) y − x i (cid:107) = (cid:12)(cid:12)(cid:12)(cid:12) ( y − x i ) · ( v i + x i (cid:107) x i (cid:107) ) + i − i (cid:12)(cid:12)(cid:12)(cid:12) + 2 (cid:107) y − x i (cid:107) (cid:54) (cid:107) y − x i (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) v i + x i (cid:107) x i (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) + 2 (cid:18) i − (cid:19) (cid:54) i − i = 122 i − log i and, since also ∇ φ ( y ) = − y/ (cid:107) y (cid:107) , ∇ V i ( y ) = v i , Lip( ∇ ψ i ) = 2 i Lip( ∇ ψ ), | ψ i ( y ) | (cid:54)

1, the triangle8nequality, the estimates from Lemma 7, and y ∈ B / i − ( x i ), (cid:13)(cid:13)(cid:13)(cid:13) ∇ f ( y ) + y (cid:107) y (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) ∇ [(1 − ψ i ( y )) φ ( y ) + ψ i ( y ) V i ( y )] + y (cid:107) y (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) ∇ ψ i ( y )( V i ( y ) − φ ( y )) + ψ i ( y ) (cid:18) ∇ V i ( y ) + y (cid:107) y (cid:107) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) (cid:54) Lip( ∇ ψ i ) | V i ( y ) − φ ( y ) | + (cid:13)(cid:13)(cid:13)(cid:13) v i + y (cid:107) y (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) (cid:54) i Lip( ∇ ψ ) | V i ( y ) − φ ( y ) | + (cid:13)(cid:13)(cid:13)(cid:13) v i + x i (cid:107) x i (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) x i (cid:107) x i (cid:107) − y (cid:107) y (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) (cid:54) i Lip( ∇ ψ ) 122 i − log i + 6log i + 32 i − = (12 Lip( ∇ ψ ) + 6) 2log i + 32 i − → i → + ∞ .It follows from the triangle inequality that (cid:107)∇ f ( y ) + p (cid:107) (cid:54) (cid:13)(cid:13)(cid:13)(cid:13) ∇ f ( y ) + y (cid:107) y (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) p − y (cid:107) y (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) so that, as y → p with (cid:107) y (cid:107) <

1, we have ∇ f ( y ) → − p . A similar argument yields that, as y → p with (cid:107) y (cid:107) >

1, we have ∇ f ( y ) → p , which proves item (iii).To prove item (v), note that, by items (i) and (iii), if a Lipschitz curve γ satisﬁes either γ ( t ) ∈ S and γ (cid:48) ( t ) tangent to S or γ ( t ) ∈ R \ S , then indeed we have ( f ◦ γ ) (cid:48) ( t ) = v ◦ γ (cid:48) ( t ) for all v ∈ ∂ c f ( γ ( t )).On the other hand, the set of points t in the domain of γ such that γ ( t ) ∈ S but γ (cid:48) ( t ) is not tangent to S is at most countable (these points t can be covered by disjoint open sets) and hence has measure zero;see also the proof of [8, Theorem 5.3]. It follows that the chain rule condition for path diﬀerentiabilityis satisﬁed for almost all t . Since this is true for all curves γ , f is path-diﬀerentiable.Item (vi) is clear in view of items (i) and (iii).If follows from item (iii) that S ⊆ crit f . Recall f = φ in a neighborhood of 0 and 0 ∈ crit φ ,so 0 ∈ crit f . If x / ∈ (cid:83) i B / i − ( x i ), then (cid:107)∇ f ( x ) (cid:107) = (cid:107)∇ φ ( x ) (cid:107) = 1 and ∇ f ( x ) is the only element of ∂ c f ( x ), so x / ∈ crit f . If x ∈ B / i − ( x i ), then, taking i large enough, we can ensure that, for i (cid:62) i ,we have, by the triangle inequality and the estimates above, (cid:107)∇ f ( x ) (cid:107) (cid:62) (cid:13)(cid:13)(cid:13)(cid:13) x (cid:107) x (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13)(cid:13) ∇ f ( x ) − x (cid:107) x (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) > . This settles items (vi) and (vii).

Here we recall some of the theory of [5, Section 4], and we show that in the example constructed inSection 2.1, the set of limiting measures is uncountable. We also compute those measures explicitly.

The interpolating curve and its associated closed measures.

Given a measure ξ on X anda measurable map g : X → Y , the pushfoward g ∗ ξ is deﬁned to be the measure on Y such that, for A ⊂ Y measurable, g ∗ ξ ( A ) = ξ ( g − ( A )).Recall that the support supp µ of a positive Radon measure µ on R n is the set of points x ∈ R n such that µ ( U ) > U of x . It is a closed set.9 eﬁnition 8. A compactly-supported, positive, Radon measure on R n × R n is closed if, for allfunctions f ∈ C ∞ ( R n ), (cid:90) R n × R n ∇ f ( x ) · v dµ ( x, v ) = 0 . Let π : R n × R n → R n be the projection π ( x, v ) = x . To a measure µ in R n × R n we can associateits projected measure π ∗ µ . We have supp π ∗ µ = π (supp µ ) ⊆ R n .Let γ : R (cid:62) → R n be the curve linearly interpolating the sequence { x i } i with γ ( t i ) = x i for t i = (cid:80) i − j =0 ε i and γ (cid:48) ( t ) = v i for t i < t < t i +1 .For a bounded set B ⊂ R (cid:62) , we deﬁne a measure on R n × R n by µ γ | B = 1 | B | ( γ, γ (cid:48) ) ∗ Leb B , where | B | = (cid:82) B dt is the length of B , and Leb B is the Lebesgue measure on B (so that Leb B ( A ) = | A | for A ⊆ B measurable). If ϕ : R n × R n → R is measurable, then (cid:90) R n × R n ϕ dµ γ | B = 1 | B | (cid:90) B ϕ ( γ ( t ) , γ (cid:48) ( t )) dt. Lemma 9 ( [5, Lemmas 20 and 21]) . In the weak* topology, the set of limit points of the sequence { µ γ | [0 ,N ] } N is nonempty, and its elements are closed probability measures. Also, (cid:91) µ ∈ acc { µ γ | [0 ,N ] } N π (supp µ ) = ess acc { x i } i . A measure µ on R n × R n can be ﬁberwise disintegrated as µ = (cid:90) R n µ x dπ ∗ µ ( x ) , where µ x is a probability on R n for each x ∈ R n . We deﬁne the centroid ﬁeld ¯ v x of µ by¯ v x = (cid:90) R n v dµ x ( v ) . An important intermediate result of [5] is

Theorem 10 (Subgradient-like closed measures are trivial [5, Theorem 23]) . Assume that f : R n → R is a path-diﬀerentiable function. Let µ be a closed measure on R n × R n , and assume that every ( x, v ) ∈ supp µ satisﬁes − v ∈ ∂ c f ( x ) . Then the centroid ﬁeld ¯ v x of µ vanishes for π ∗ µ -almost every x . Analysis of the example.

Let γ be the interpolating curve of the sequence { x i } i , as deﬁned inSection 2.1. In this example, the set of limit points of the sequence { µ γ | [0 ,N ] } N consists of all measureson T R given by µ θ = (cid:90) θ θ − π δ ( r ( θ ) ,r ( θ )) + δ ( r ( θ ) , − r ( θ )) e θ − θ − e − π dθ, θ ∈ R , (9)where r ( θ ) = (cos θ, sin θ ) and δ ( u,v ) denotes the Dirac delta in R × R concentrated at ( u, v ). This isthe measure that captures the dynamics occurring whenever x N has angle ϑ N close to θ . Of course,10e have µ θ = µ θ if θ − θ is an integer multiple of 2 π , as well as R ξ ∗ µ θ = µ ξ + θ for R ξ the rotationby angle ξ .Before proving (9), we remark that in accordance with Theorem 10 we have, for x ∈ S , µ θ x = δ ( x,x ) + δ ( x, − x ) v x = (cid:90) R v dµ x ( v ) = x − x = 0 . Also the conclusion of Lemma 9 is veriﬁed: we haveess acc { x i } i = π (supp µ θ ) = S , and each µ θ is a closed probability measure.Let us see how to arrive at (9). From the construction, it is clear that these measures must havethe form (cid:90) θ θ − π δ ( r ( θ ) ,r ( θ )) + δ ( r ( θ ) , − r ( θ )) (cid:37) ( θ ) dθ for some density ρ on R ; the sum of Dirac deltas in (9) can be deduced from the fact that the vectors v i asymptotically approach y and − y as x i → y ∈ S (with a subsequence), together with γ (cid:48) ( t ) = v i for t i < t < t i +1 .Let us compute the density (cid:37) . Let I ⊂ R be an interval of length 0 < α = | I | (cid:54) π . Considering I as an arc in the circle, we will write β ∈ I mod 2 π if β ∈ R and there is some k ∈ Z such that β + 2 πk ∈ I . Let m = min { i : ϑ i ∈ I mod 2 π } . Writing P ≈ Q if P/Q → N → + ∞ , if m < n are two integers such that α = ϑ n − ϑ m = (cid:80) n − k = m k log k , then α ≈ (cid:90) nm dx/x log x = log log n − log log m ;thus n ≈ m e α . In other words, the intervals J ⊂ N such that ϑ i + 2 πk ∈ I mod 2 π if i ∈ J areapproximately [ m , m e α ] , [ m e π , m e α +2 π ] , [ m e π , m e α +4 π ] , . . . , [ m e kπ , m e α +2 kπ ] , . . . k N ∈ N be such that N = m e α +2 πkN , we compute (cid:80) ϑ i ∈ Ii (cid:54) N ε i (cid:80) Ni =2 ε i ≈ (cid:80) ϑ i ∈ Ii (cid:54) N /i (cid:80) Ni =2 /i ≈ N k N (cid:88) k =0 (cid:90) m eα +2 kπ m e kπ dxx = 1log N k N (cid:88) k =0 ( e α − e kπ log m = ( e α −

1) log m log N e π ( k N +1) − e π −

1= ( e α −

1) log m log N e π − α log N/ log m − e π − → − e − α − e − π =: p ( α )as N → + ∞ . To compute (cid:37) , we apply that to an interval I of the form [ θ, θ ] = [ θ − α, θ ] and wetake the derivative (cid:37) ( θ ) = dp ( θ − θ ) dθ = ddθ − e − ( θ − θ ) − e − π = e θ − θ − e − π , θ ∈ [ θ − π, θ ) . In the spirit of Whitney’s counterexample [11] to the Morse–Sard theorem, we construct a function f : R → R and a bounded subgradient sequence { x i } i satisfying:C1. f is path-diﬀerentiable,C2. f (crit f ) ⊃ f (ess acc { x i } i ) = f (acc { x i } i ) = [0 , { x i } i is not contained in crit f , andess acc { x i } i (cid:54) = acc { x i } i . C4. { x i } i and { f ( x i ) } i do not converge.C5. The Hausdorﬀ dimensions of ess acc { x i } i and acc { x i } i are greater than 1 and satisfy (2).C6. There are points x and y in ess acc { x i } i \ acc { x i } i such that we can take subsequences { x i k } k and { x i (cid:48) k } k converging to x and y , respectively, with i k < i (cid:48) k < i k +1 for all k andsup k i (cid:48) k (cid:88) p = i k ε p < + ∞ . C7. There is no oscillation compensation on acc { x i } i \ ess acc { x i } i . This means, precisely, that thereis a continuous function Q : R n → [0 ,

1] such thatlim inf N → + ∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:80) Ni =0 ε i v i Q ( x i ) (cid:80) Ni =0 ε i Q ( x i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > . (10)12igure 2: The ﬁrst three steps Γ , Γ , and Γ in the construction of the set Γ.Crucially, since we strive to show that the dynamics on acc { x i } i \ ess acc { x i } i may be verydiﬀerent to the one displayed on ess acc { x i } i , we are not requiring the condition from [5, Theorem6(ii)], namely, the existence of a sequence { N i } i withlim inf j → + ∞ (cid:80) N j i =1 ε i Q ( x i ) (cid:80) N j i =0 ε i > , which would force the focus to be on the dynamics around ess acc { x i } i .C8. The oscillations near acc { x i } i \ ess acc { x i } i are not asymptotically perpendicular to acc { x i } i . Outline.

To construct the function f , we will ﬁrst deﬁne a fractal curve Γ and f on it, aiming tohave Γ ⊂ crit f and f (Γ) = [0 , J such that Γ ∪ J is a closed loop and J only intersects crit J at its endpoints. We will construct an auxiliary path-diﬀerentiable function h coinciding with f on the curve Γ, and in Lemma 12 we will prove some properties of h . We willnext construct a series of loops T , T , T , . . . that will help us deﬁne the sequence { x i } i , which wecarefully specify so that it is almost a subgradient sequence of h . The dynamics of { x i } i around Γwill mimic that of the sequence in the example of Section 2, and near J it will instead move relativelyfast. To obtain f , we modify h slightly in a way that ensures that { x i } i is a subgradient sequence.In Proposition 13 we show that f has certain properties, which we will ﬁnally link, in our concludingremarks, to claims C1–C8 above.The reader will ﬁnd this example easier to follow after having looked at the construction of Section2.1. The role of the function φ in that construction is taken by the function h in the one presentedbelow. A fractal curve.

Pick < α (cid:54) . We begin by constructing a set Γ ⊂ R recursively as illustratedin Figure 2. For the ﬁrst step, we pick four disjoint squares of side α inside the unit square, and welet Γ be the closed set consisting of the ﬁve disjoint paths joining the left and bottom sides of theunit square with those four squares successively, as in the ﬁgure. In each of the following inductivesteps, we rescale the set Γ i we had for the previous step and we place new copies inside each of thefour squares, perhaps rotated by an angle π/

2, so that the paths making up Γ connect with those ofeach rescaled copy of Γ i . The set Γ i +1 is then the union of Γ with the four rescaled and appropriatelyrotated copies of Γ i . This deﬁnes an increasing sequence of sets (Γ i ) i ∈ N and Γ = (cid:83) i Γ i .13e proceed to parameterize Γ with a continuous curve φ : [0 , → R . To do this, we will imi-tate the procedure in the construction of the Cantor staircase. Thus, we ﬁrst divide [0 ,

1] into ninecontiguous intervals of equal length, namely, the nine intervals (we write in base 9)[0 , . ) , [0 . , . ) , . . . , [0 . , . ) , [0 . , ] . We deﬁne the map φ on each of the of the ﬁve odd-numbered intervals[0 , . ) , [0 . , . ) , [0 . , . ) , [0 . , . ) , [0 . , ]to map the corresponding interval to one of the intervals making up Γ (in Figure 2, these are theﬁve blue curves in the left-hand diagram). Then iteratively, at step i , we divide each of the remainingintervals into nine equal subintervals, and we map the odd-numbered subintervals into the pieces ofΓ i \ Γ i − . Thus for example the interval [0 . , . ) gets divided into[0 . , . ) , [0 . , . ) , . . . , [0 . , . ) , [0 . , . ) , and the images of the intervals [0 . , . ) and [0 . , . ) will touch the images of the intervals[0 . , . ) and [0 . , . ), but the intervals[0 . , . ) , [0 . , . ) , [0 . , . )will not touch the image of the curve deﬁned in the previous step; refer to the middle diagram inFigure 2. The map φ is the unique continuous extension of the thus-deﬁned function.The resulting curve φ has inﬁnite arc length. Indeed at each construction step of the Γ i , the pathsin Γ i \ Γ i − are contained in 4 i − squares, each of them contributing in an increase of at least 2 α i inthe total length. This results in a global increase of at least (4 α ) i / > / i -th step.Let p : [0 , → N ∪ { + ∞} be the function that assigns to a number t the ﬁrst appearance ofan even digit after the decimal point in its base 9 expansion, so that for example p (0 ) = 1 and p (0 . ) = 4. Thus if t ∈ [0 ,

1] and p ( t ) < + ∞ , then φ ( t ) ∈ Γ p ( t ) \ Γ p ( t ) − , and if p ( t ) = + ∞ then φ ( t ) is a point in the Cantor set Γ \ (cid:83) i Γ i at the intersection of all the squares used in the construction. Lemma 11.

The Hausdorﬀ dimension of Γ is log α .Proof. The deﬁnition of Hausdorﬀ dimension was recalled in Q10 in Section 1.Let r >

0. As explained above, the length of Γ i \ Γ i − is at least (4 α ) i /

2. Thus, a lower boundon the number of balls of radius r necessary to cover Γ i \ Γ i − is (4 α ) i / r − i such that α i > r , i.e., i < log α r . We have, for d > H d (Γ) (cid:62) H d ( (cid:83) i Γ i ) (cid:62) lim inf r → α r − (cid:88) i =1 r d (cid:18) (4 α ) i r − (cid:19) = lim inf r → r d − (4 α ) log α r − α − − (log α r − r d = lim inf r → r d − (4 α ) log α r α −

1= lim inf r → / α − d − α (4 α )) log r )= lim inf r → / α − (cid:0) d − log α (cid:1) log r ) . H d (Γ) = 0 it is necessary that d > log α because this lim inf must vanish andlog r → −∞ . This translates to dim Γ (cid:54) log α .Let us prove the opposite inequality. For r > i \ Γ i − with A (4 α ) i /r balls of radius r for i such that α i (cid:62) r ; here A > Aα i is an upper bound for the contribution of thepaths in each of the 4 i − squares added. Since Γ \ (cid:83) i Γ i is also the intersection of the squares in theconstruction above, we know that it can be covered by 4 k balls of radius 2 α k , and these balls will coverthe remaining part of (cid:83) i Γ i . Hence we have, with r = 2 α k and its consequence log α r = k + log α H d (Γ) (cid:54) H d (Γ \ (cid:83) i Γ i ) + H d ( (cid:83) i Γ i ) (cid:54) lim inf k → + ∞ k r d + lim inf k → + ∞ log α r (cid:88) i =1 r d A (4 α ) i r = lim inf k → + ∞ k (cid:0) α k (cid:1) d + lim inf k → + ∞ k +log α (cid:88) i =1 (2 α k ) d A (4 α ) i α k (cid:54) lim inf k → + ∞ e k (log 4+ d log α )+ d log 2 + lim inf k → + ∞ A (2 α k ) d − (4 α ) k +log α − α − , which vanishes unless log 4 + d log α >

0, that is, unless d < log α . This gives dim Γ (cid:62) log α . Deﬁning f on Γ ∪ J . We deﬁne f on Γ imitating the construction of the Cantor staircase as follows.For a point q ∈ Γ i , we let t = φ − ( q ), and we express t in base 9, so that the ﬁrst k = p ( t ) − a , a , . . . a k in the base 9 expansion x = (0 .a a a . . . ) are odd. We then let, for 1 (cid:54) i (cid:54) k , b i = ( a i − /

2, and f ( q ) = (0 .b b . . . b k ) in base 4. The values so-assigned for f are illustratedin Figure 3. The reader will convince herself that with this deﬁnition, f is constant on each path-connected component of (cid:83) i Γ i and can be uniquely extended to a continuous function on all of Γ.We remark that the function f ◦ φ : [0 , → [0 , p is constant, itsderivative ( f ◦ φ ) (cid:48) vanishes almost everywhere on [0 , f ◦ φ is not constant, contradicting thefundamental theorem of calculus, which is valid for absolutely continuous functions.Let J ⊂ R \ ((0 , × (0 , f on J to smoothly and strictly monotonouslytake the values between 0 and 1, keeping f continuous. Lipschitz continuity of f on Γ ∪ J . Let j : (0 , × (0 , → N ∪ { + ∞} be, for each pair of points s and t in (0 , s and t that diﬀers; thus forexample j (0 . , . ) = 2. Since we have | s − t | > − j ( s,t ) − , as | s − t | → j ( s, t ) → + ∞ .Note also that if j ( s, t ) >

0, then φ ( s ) and φ ( t ) must be contained in the same square of side α j ( s,t ) − . Thus, for some A, B > | φ ( s ) − φ ( t ) | (cid:62) Aα j ( s,t ) and | f ◦ φ ( s ) − f ◦ φ ( t ) | (cid:54) B − j ( s,t ) .Thus if x, y ∈ Γ, letting s and t be such that φ ( s ) = x and φ ( t ) = y , we have (cid:107) x − y (cid:107) = | φ ( s ) − φ ( t ) | (cid:62) Figure 3: Values of f on the set Γ . All numbers are in base 4.16 α j ( s,t ) or, equivalently, − j ( s, t ) (cid:54) − log α (cid:107) x − y (cid:107) A . Also, | f ( x ) − f ( y ) | = | f ◦ φ ( s ) − f ◦ φ ( t ) | (cid:54) B − j ( s,t ) (cid:54) B − log α (cid:107) x − y (cid:107) A = BA (cid:107) x − y (cid:107) log α (cid:54) BA (cid:107) x − y (cid:107) . because α > / α / >

1. This, together with the smoothness of f on J implies that f onΓ ∪ J is Lipschitz. Let Lip( f | Γ ∪ J ) be the Lipschitz constant of f on Γ ∪ J . The auxiliary function h . Let C be a connected component of J ∪ Γ i for some i , without itsendpoints. As such, C is a smooth, non-self-intersecting curve, diﬀeomorphic to an open interval. Asis well known (see for example [10, p. 109]) there exists a tubular neighborhood W C around C , bywhich we mean speciﬁcally: • there is an open set W C ⊂ R that contains C , • there is an open set U ⊂ R of the form ( a, b ) × ( − c, c ) for some a, b, c ∈ R , c >

0, and • there is a smooth, bijective function ϕ C : U → W C such that – the map x (cid:55)→ ϕ C ( x,

0) is a parameterization of C by arclength, – the map y (cid:55)→ ϕ C ( x, y ) is a parameterization, by arclength, of the segment perpendicular to C and passing through x .We will refer to ϕ C as the chart of W C , and to the number c > thickness of W C .The statement of existence of the tubular neighborhoods W C is obvious if we choose all Γ i and J to be composed of straight line segments and circle arcs, so readers unfamiliar with the general casemay assume that this is the case. Lemma 12.

There is a function h : R → R such thati. h is locally Lipschitz and path-diﬀerentiable.ii. h coincides with f on Γ ∪ J .iii. h is C on R \ (Γ ∪ J ) .iv. On a tubular neighborhood W C of each connected component C of J ∪ Γ i , h is deﬁned by h ( ϕ C ( x, y )) = f ( ϕ C ( x, L | y | , (11) where ϕ C is the chart of W C . Hence h is piecewise C ∞ in W C , with the singular locus of h within W C coinciding exactly with C .v. Let L = Lip( f | Γ ∪ J ) . If p ∈ Γ i for some i , and if n is a unit vector normal to Γ i at p , then ∂ c h ( p ) = { λ n : − L (cid:54) λ (cid:54) L } , p ∈ (cid:83) i Γ i . T , T , T , used to deﬁne the subgradient sequence { x i } i . The loop T i contains J and, for i >

0, it also contains Γ i . More precisely, the gradients of h on each side of Γ i at p are asymptotically equal to L n and − L n , respectively, pointing away from Γ i .Similarly, if now p ∈ J , n is a unit vector normal to J at p , and t is the unit vector tangent to J at p that points in the clockwise direction (for the loop Γ ∪ J ) and if a > is the magnitudeof the derivative of f | J at p , then ∂ c h ( p ) = {− a t + λ n : − L (cid:54) λ (cid:54) L } , p ∈ J. More precisely, the gradients of h on each side of J at p are asymptotically equal to − a t + 2 L n and − a t − L n , respectively, pointing away from J .vi. The norm of the Hessian of h is bounded on each connected component of W C \ C , for C and W C as in item (iv).vii. Γ ⊆ crit h . This lemma will be proved in Appendix A.

A skeleton curve for the sequence.

We shall now deﬁne a sequence of smooth loops T , T , . . . that will guide the trajectory of the sequence { x i } i . Figure 4 illustrates the shapes of the ﬁrst elementsof the sequence of closed curves that we now proceed to construct.The ﬁrst one, T , will simply be a small loop around the origin, containing J = T \ ((0 , × (0 , , × [0 , i >

0, the path T i will be equal to Γ i ∪ J together with some small circular arcs glued to closeup the loose ends in such a way that we obtain a smooth loop that does not touch the 4 i +1 smallersquares of side α i +1 involved in the construction of Γ i +1 . Speciﬁcation of the sequence { x i } i . Unlike what we did for the example described in Section2, we will not try here to deﬁne { x i } i explicitly; instead, we will take the lesson from that exampleas to what this sequence should look like. We pick { x i } i to be a sequence of distinct points with (cid:107) x i +1 − x i (cid:107) → T , T , . . . . Thus, the sequence will start near J , it will go around T a few times, and while it is at J , it will start going around T , which it will doa few times, and then T , and so on. 18et L = Lip( f | Γ ∪ J ) and let I , I , · · · ⊂ N be the intervals during which x i will be going aroundeach of the paths T j , respectively. We will choose an initial value i > { x i } ∞ i = i will satisfy:S1. Not self-accumulating. We require the sequence { x i } i to be such that, for each i , there is some r > B r ( x i ) ∩ { x j } j (cid:54) = i = ∅ .S2. If i ∈ I j then 2 iL (cid:54) thickness( W C )for all connected components C of Γ j ∪ J .S3. Bouncing. If j > i, i + 1 ∈ I j , the points x i and x i +1 are on opposite sides of T j .S4. Distance to T j . For i ∈ I j , we require the points x i to remain at a distance (cid:12)(cid:12)(cid:12)(cid:12) dist( x i , T j ) − iL (cid:12)(cid:12)(cid:12)(cid:12) (cid:54) i . S5. Around Γ j ∪ J . Recall from Lemma 12 that h is piecewise smooth near Γ j ∪ J . If j > i ∈ I j and the closest point of T j to x i is y ∈ Γ j ∪ J , and if t is the unit vector tangent to Γ j ∪ J pointing in the clockwise direction, then we require h to be diﬀerentiable at x i and (cid:13)(cid:13)(cid:13)(cid:13) [ x i +1 − x i + 1 iL ∇ h ( x i )] · t − i log i (cid:13)(cid:13)(cid:13)(cid:13) (cid:54) i . S6. Around the circle arcs T j \ (Γ j ∪ J ). If j > i ∈ I j , and the point y of T j closest to x i is in T j \ (Γ j ∪ J ), and if t is a unit vector tangent to T j at y pointing in the clockwise direction, werequire 3 iL (cid:54) ( x i +1 − x i ) · t (cid:54) iL S7. Small jumps. For all i in the situation of S5, (cid:107) x i +1 − x i + 1 iL ∇ h ( x i ) (cid:107) (cid:54) iL . For all i in the situation of S6, (cid:107) x i +1 − x i (cid:107) (cid:54) iL . Let us explain how such a sequence can be constructed. First, we choose i > C is the connected component of J ∪ Γ containing J , then 2 /i L (cid:54) thickness( W C ). We thenchoose x i in W C such that the point of C closest to x i is in J , and such that S4 is satisﬁed with j = 0.By induction, assuming that for some i (cid:62) i we have chosen x i satisfying S1-S7, we let x i +1 bea point in the component on the opposite side of T j (thus complying with S3) of the nonempty set X i determined by S4 and S7 together with either S5 or S6, depending on the location of x i . Theset X i is indeed nonempty because the inequality in S4 determines two stripes going parallel to T j ,while S5 and S6 determine stripes perpendicular to T j . So they intersect (with at least one connectedcomponent of the intersection on each side of T j ) as long as the step size is small enough with respectto the curvature of T j ; this can be ensured in the case of j = 0 by increasing i , and in the caseof j > T j − before moving on to T j .19lthough the intersection of the condition in S4 and those of either S5 or S6 may also include pointslocated far from x i , S7 forces the choose a connected component that is directly ahead along T j , andit is impossible that the sequence will jump very far. Thus S3–S7 can be complied with.To see that S1 can be complied with as well, note that, by S4, together with S5 and S6, a ball ofradius r = 1 / i works automatically once the other conditions have been satisﬁed. To ensure S2 istrue, we let the sequence go around each T j a few times until i grows enough that the inequality inS2 becomes true.We remark that the precise form of S6 will not be used explicitly, and its only purpose is to keepthe sequence moving around the circular arcs T j \ (Γ j ∪ J ) at a moderate rate. Construction of f . Choose real numbers { r i } i ⊂ R such that 0 < r i < /i and such that the disk B r i ( x i ) of radius 3 r i centered at x i does not intersect Γ and all the disks B r i ( x i ) are disjoint. Thisis possible because of our speciﬁcation S1.Let ψ : R → [0 ,

1] be a C ∞ function with radial symmetry, ψ ( x ) = ψ ( y ) for (cid:107) x (cid:107) = (cid:107) y (cid:107) , such that ψ ( x ) = 1 for x ∈ B (0), ψ ( x ) = 0 for (cid:107) x (cid:107) (cid:62)

2, and decreases monotonically on rays emanating fromthe origin. Let ψ i ( x ) = ψ (cid:16) x − x i r i (cid:17) , so that ψ i equals 1 on B r i ( x i ) and vanishes outside B r i ( x i ). Denote by Lip( ψ ) > ψ , and by Lip( ∇ ψ ) > ψ i are pairwise disjoint and Lip( ∇ ψ i ) = r i Lip( ∇ ψ ).Let v i = iL ( x i − x i +1 ) , and deﬁne, for h as in Lemma 12, V i ( x ) = ( x − x i ) · v i + h ( x i ) . Proposition 13.

Let f ( x ) = (cid:32) − ∞ (cid:88) i =0 ψ i ( x ) (cid:33) h ( x ) + ∞ (cid:88) i =0 ψ i ( x ) V i ( x ) . (12) Then we havei. f is piecewise C ∞ in a tubular neighborhood W C of each connected component C of Γ i ∪ J , i > .ii. { x i } i is a subgradient sequence for f with stepsizes ε i = 1 iL . In particular, (cid:80) i ε i = + ∞ and (cid:80) i ε i < + ∞ .iii. acc { x i } i = Γ ∪ J .iv. Let p be a point in Γ i ∪ J for some i > . Then ∂ c f ( p ) = ∂ c h ( p ) . v. The critical set of f contains Γ , but J ∩ crit f consists only of the two endpoints of J .vi. f is locally Lipschitz and path-diﬀerentiable. roof. By item (iv) in Lemma 12 we know that h is piecewise C ∞ in a tubular neighborhood W C ofeach connected component C of Γ i ∪ J . Item (i) then follows from the facts that V i is C ∞ and thatthe supports of the functions ψ i are piecewise disjoint, and the form of (12).From (12) and the fact that ∇ ψ j ( x i ) = 0 = 1 − (cid:80) k ψ k ( x i ) for all i and all j , we have ∇ f ( x i ) = ∇ V i ( x i ) = v i . Thus ∂ c f ( x i ) = { v i } and x i − ε i v i = x i − iL iL ( x i +1 − x i ) = x i +1 , which is the statement of item (ii).Note that S5 and S6 force the sequence to always advance around each T j and ﬁnish the loop.Item (iii) is then clear from the construction of Γ and the loops T j ⊃ J , together with the speciﬁcationS4 that forces the sequence { x i } i to get ever closer to Γ ∪ J .Let us prove item (iv). Fix j > p ∈ Γ j ∪ J , and denote by C the connected component ofΓ j ∪ J that contains p . Consider a point y near p . In particular, we may assume that y is not in thesituation described in S6. If y / ∈ (cid:83) i B r i ( x i ), then f = h on a neighborhood of y and we have nothingto prove. Otherwise, we have y ∈ B r i ( x i ) for some i (cid:62)

0, and by S2 we may assume B r i ( x i ) iscontained in the neighborhood W C of item (iv) in Lemma 12. Item (vi) in Lemma 12 means that thederivative of ∇ h (the Hessian of h ) is bounded on W C , which means in particular that ∇ h is Lipschitzin W C ; in other words, there is some K >

0, depending only on C such that, for all z ∈ B r i ( x i ), (cid:107)∇ h ( z ) − ∇ h ( x i ) (cid:107) (cid:54) K (cid:107) z − x i (cid:107) . Note that it follows from S3, S4, and S5 and item (v) of Lemma 12 that, if i is large enough, ε i (cid:13)(cid:13)(cid:13)(cid:13) v i − ∇ h ( x i ) − L log i t (cid:13)(cid:13)(cid:13)(cid:13) (cid:54) i . (13)By (12), the Lipschitzity of ∇ ψ , the fact that 0 (cid:54) ψ i (cid:54)

1, a Taylor expansion with w a point in thesegment joining x i and y , the deﬁnition of K , the Cauchy-Schwarz and triangle inequalities, and (13), (cid:107)∇ f ( y ) − ∇ h ( y ) (cid:107) = (cid:107)∇ [(1 − ψ i ( y )) h ( y ) + ψ i ( y ) V i ( y )] − ∇ h ( y ) (cid:107) = (cid:107)∇ ψ i ( y )( V i ( y ) − h ( y )) + ψ i ( y ) ( ∇ V i ( y ) − ∇ h ( y )) (cid:107) (cid:54) Lip( ∇ ψ i ) | V i ( y ) − h ( y ) | + (cid:107) v i − ∇ h ( y ) (cid:107) (cid:54) r i Lip( ∇ ψ )( (cid:107) h ( x i ) + v i · ( y − x i ) − h ( x i ) − ∇ h ( w ) · ( y − x i ) (cid:107) ) + (cid:107) v i − ∇ h ( y ) (cid:107) (cid:54) r i Lip( ∇ ψ ) (cid:107) v i − ∇ h ( w ) (cid:107) (cid:107) y − x i (cid:107) + (cid:107) v i − ∇ h ( y ) (cid:107) (cid:54) r i Lip( ∇ ψ )(2 r i )( (cid:107) v i − ∇ h ( x i ) (cid:107) + (cid:107)∇ h ( x i ) − ∇ h ( w ) (cid:107) )+ (cid:107) v i − ∇ h ( x i ) (cid:107) + (cid:107)∇ h ( x i ) − ∇ h ( y ) (cid:107) (cid:54) ∇ ψ )( (cid:107) v i − ∇ h ( x i ) (cid:107) + 2 Kr i ) + (cid:107) v i − ∇ h ( x i ) (cid:107) + 2 Kr i (cid:54) (2 Lip( ∇ ψ ) + 1) (cid:18) L log i + 2 Kr i (cid:19) → , as y → p because, in that case, i → + ∞ . So item (iv) follows.21tem (v) follows from item (iv) together with the same being true for h ; see items (v) and (vii) inLemma 12.Item (vi) follows from item (i) in Lemma 12, the form of (12) on R \ (Γ ∪ J ), which ensures thatthe path diﬀerentiability of h is inherited by f on that region, and from item (iv) above, which ensuresthe modiﬁcation (12) of h does not change the path diﬀerentiability property on Γ ∪ J . Lemma 14. ess acc { x i } i = Γ .Proof. We will ﬁrst show that (cid:83) i Γ i ⊆ ess acc { x i } i , and from the fact that ess acc { x i } i is closed it willfollow that Γ is contained in it. We use the notation P ≈ Q to mean that P/Q → j > p ∈ Γ j that is not an endpoint of the connected component C of Γ j containing p , and { N i } i ⊂ N be a subsequence such that lim i x N i = p . Let α > p and the closest of the two endpoints of C . Let also { M i } i ⊂ N be a subsequence such that q = lim i x M i is a point on C at arclength α from p , M i < N i for all i , and dist( x k , C ) < /kL for all M i (cid:54) k (cid:54) N i .In view of S4, for each i the sequence x M i , x M i +1 , . . . , x N i is bouncing around the segment of C oflength α that starts at q and ends at p . By item (v) of Lemma 12, we know that ∂ c h on the points of C contains only vectors that are normal to C , so S5 implies that α (cid:54) N i (cid:88) k = M i i log i ≈ log log N i − log log M i . This means that log M i (cid:54) exp(log log N i − α ) = e − α/ log N i . Hence also N i (cid:88) k = M i ε k = N i (cid:88) k = M i kL ≈ L (log N i − log M i ) (cid:62) L (1 − e − α/ ) log N i . Similarly, N i (cid:88) k =1 ε k = N i (cid:88) k =1 kL ≈ L log N i . Thus the lim sup in the deﬁnition (1) of ess acc { x i } i is at least 1 − e − α/ >

0. This proves that p ∈ ess acc { x i } i , and thus also that Γ ⊆ ess acc { x i } i .In view of item (iii) of Proposition 13 and the fact that ess acc { x i } i ⊆ acc { x i } i = Γ ∪ J , we nowneed to show that if p (cid:48) ∈ J \ Γ then p (cid:48) / ∈ ess acc { x i } i . For such p (cid:48) we pick an open ball U containing p (cid:48) such that U ∩ Γ = ∅ and κ U := inf x ∈ U dist(0 , ∂ c f ( x )) > , as is possible because of item (iv) of Proposition 13, together with the fact that f is strictly monotonouson J . Let a > J ∩ U . Then from S5 it follows that if i < i are such that forall i (cid:54) k (cid:54) i we have x k ∈ U , while x i − , x i +1 / ∈ U , then2 a (cid:62) i (cid:88) k = i ε k (cid:107) v k (cid:107) (cid:62) i (cid:88) k = i ε k κ U For (cid:96) >

0, let p (cid:96) denote the number of times the sequence goes around T (cid:96) . If N > I j , so thatthe sequence is bouncing around T j , then (cid:88) x i ∈ Ui (cid:54) N ε k (cid:54) aκ U j (cid:88) (cid:96) =0 p (cid:96) .

22n the other hand, to estimate j as a function of N we compute a lower bound of the length of thepath traversed by x , . . . , x N , j − (cid:88) i =1 p i arc length Γ i (cid:62) j − (cid:88) i =1 p i i (cid:88) k =1 (4 α ) k j − (cid:88) i =1 p i (4 α ) i +1 − α − (cid:62) j − (cid:88) i =1 (4 α ) i +1 − α − − α ) ((4 α ) j +1 − α + ( j − − α ))= 12(1 − α ) (4 α ) j +1 + O ( j ) . To turn this lower bound on the length of the path into an lower bound of the number N of steps weuse S5 and the fact that ∇ h is normal to Γ k , so that we have12(1 − α ) (4 α ) j +1 + O ( j ) (cid:54) N (cid:88) k =2 k log k ≈ log log N. Whence j (cid:54) A log log log N for some A >

0, and (1) can be bounded by (cid:80) x k ∈ Uk (cid:54) N ε k (cid:80) Ni =1 ε k (cid:54) a/κ U (log N ) /L j (cid:88) (cid:96) =0 p (cid:96) = O (cid:32) (cid:80) j(cid:96) =0 p (cid:96) e e j (cid:33) . Because of the fractal form of the construction of Γ, we see that the thickness of the tubular neigh-borhoods around the connected components of Γ i ∪ J and those around the connected componentsaround Γ i +1 ∪ J are related by a factor α . From our calculation above we conclude that, the numberof steps it takes to traverse each T j increases rapidly, so that in view of S2, we see that p (cid:96) can beuniformly bounded. This means that (cid:80) j(cid:96) =0 p (cid:96) (cid:54) Cj for some C >

0, and hence, as j → + ∞ , (cid:80) j(cid:96) =0 p (cid:96) e e j (cid:54) Cje e j → . This proves that J \ Γ is not in ess acc { x i } i , and concludes the proof of the lemma. Conclusion.

Claim C1 was proved as item (vi) of Proposition 13.It follows from item (v) in Proposition 13 and Lemma 14 that Γ = ess acc { x i } i ⊂ crit f , and since f (Γ) = [0 , f satisﬁes claim C2.Claim C3 is true by item (iii) of Proposition 13 and Lemma 14.Since f (Γ) = [0 ,

1] and { x i } i bounces endlessly around Γ ∪ J by item (iii) in Proposition 13, thesequence { f ( x i ) } i also does not converge, which is claim C4.Claim C5 follows from Lemmas 11 and 14 and item (iii) of Proposition 13.23laim C6 requires some analysis. Let x and y be distinct points in J with f ( x ) > f ( y ). Let { x i k } k and { x i (cid:48) k } k be subsequences that converge to them, respectively, and such that i k < i (cid:48) k for all k . Let u : [0 , T ] → J be a parameterization of J such that (cid:107) u (cid:48) ( t ) (cid:107) = − ( f ◦ u ) (cid:48) ( t ) (this determines T > u is a gradient curve, that is, − u (cid:48) ( t ) ∈ ∂ c f ( u ( t )). Then it follows from item (v) in Lemma 12,item (iv) in Proposition 13, and S5 that the subgradient sequence { x i } i goes along J at about thesame speed as the neighboring curve u , so a very rough estimate of the amount of time it takes for itto go between x and y is sup k (cid:80) i (cid:48) k p = i k ε p (cid:54) T , which is claim C6.Claim C7 is true because, if we choose the function Q so that its support intersects J but not Γ,then it follows from item (v) in Lemma 12, item (iv) in Proposition 13, and assumption S5 that theaverages in the lim inf in (10) asymptotically approach (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:82) T Q ( u ( t )) u (cid:48) ( t ) dt (cid:82) T Q ( u ( t )) dt (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:54) = 0 , with u as in our discussion of claim C6 above, which immediately implies inequality (10).Claim C8 follows immediately from item (v) in Lemma 12, and assumptions S4 and S5. A Proof of Lemma 12

For i > C of J ∪ Γ i , and let W C be its tubular neighborhood with chart ϕ C .On the tubular neighborhood W C , we deﬁne h by (11). Observe that with this deﬁnition, h is C ∞ on each of the two connected components of W C \ C , settling items (iv) and (vi).Since the coordinates given by the charts ϕ C are compatible for the diﬀerent connected components C of the sets J ∪ Γ i , i ∈ N , this deﬁnes h on the closure of the union R = (cid:83) C W C . In particularΓ ⊂ R . From (11), we see that h is smooth on each connected component of R \ Γ.In the following, we will extend h continuously, so the fact that h coincides with f on Γ ∪ J ,item (ii), will follow from the observation that, as we see from (11), it is true on each of connectedcomponent of J ∪ (cid:83) i Γ i .Item (v) also follows directly from (11) because the components of elements of the Clarke subdif-ferential in the t and n directions coincide with the derivatives in the x and y variables, respectively,since (cid:13)(cid:13)(cid:13)(cid:13) ∂ϕ C ∂x ( x, (cid:13)(cid:13)(cid:13)(cid:13) = 1 and (cid:13)(cid:13)(cid:13)(cid:13) ∂ϕ C ∂y ( x, y ) (cid:13)(cid:13)(cid:13)(cid:13) = 1 . It follows from item (v) that C ∩ Γ ⊂ crit h for each connected component C of J ∪ (cid:83) i Γ i , so inorder to conclude that Γ ⊂ crit h , item (vii), we observe that Γ = (cid:83) C (Γ ∩ C ) and recall that the graphof the Clarke subdiﬀerential is closed.Recall Lemma 15 (Whitney partition of unity [2, Lemma 2.5]) . Let K be a compact subset of R n . Thereexists a countable family of functions φ i ∈ C ∞ ( R n \ K ) , i ∈ N , such that1. for each x ∈ R n \ K there are at most n numbers i ∈ N such that x ∈ supp φ i ,2. φ i (cid:62) for all i ∈ N , and (cid:80) i φ i ( x ) = 1 for all x ∈ R n \ K ,3. φ i , K ) (cid:62) diam(supp φ i ) for all i ∈ N , . there exist constants C k > , depending only on k and n , such that, if x ∈ R n \ K , then (cid:107) D k φ i ( x ) (cid:107) (cid:54) C k (cid:18) x, K ) | k | (cid:19) where D k g denotes the k th derivative of g . Let { φ i } i be a Whitney partition of unity of R \ R , as in Lemma 15 with K = R . For each i ,choose be a point p i in R minimizing the distance to supp φ i . Although the deﬁnition (11) does notgive h on a neighborhood of each p i not on Γ, we may assume that ∇ h ( p i ) is well deﬁned, perhapsafter shrinking R slightly. Let g i : R → R be the aﬃne function given by g i ( x ) = (cid:40) h ( p i ) + ∇ h ( p i ) · ( x − p i ) , p i / ∈ Γ ,h ( p i ) , p i ∈ Γ . Similar to the proof [2] to Whitney’s extension theorem, for x ∈ R \ R we deﬁne h ( x ) = (cid:88) i g i ( x ) φ i ( x ) . On R \ R , it is clear that h is smooth, because locally it is a ﬁnite sum of smooth functions.On the other hand, on the set ( ∂R ) \ Γ that is the boundary of R with Γ removed, by construction h is diﬀerentiable and its gradient is continuous. To see why, one can use the same technique as inthe well-known proof of Whitney’s extension theorem [2, Theorem 2.3]; we sketch the main ideas. Toshow that h is diﬀerentiable at r ∈ ( ∂R ) \ Γ, it is enough to show that | h ( x ) − h ( r ) − ∇ h ( r ) · ( x − r ) | = O ( (cid:107) x − r (cid:107) )for x ∈ R \ R such that the point r minimizes the distance from x to R . For this, one uses the factthat h is smooth on each connected component of W C \ (Γ ∪ J ), so that for p i near r we have theTaylor estimates | h ( p i ) + ∇ h ( p i ) · ( r − p i ) − h ( r ) | = O ( (cid:107) r − p i (cid:107) )and (cid:107)∇ h ( r ) − ∇ h ( p i ) (cid:107) = O ( (cid:107) r − p i (cid:107) ) . These give | h ( x ) − h ( r ) − ∇ h ( r ) · ( x − r ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i g i ( x ) φ i ( x ) − h ( r ) − ∇ h ( r ) · ( x − r ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) (cid:88) i φ i ( x ) | g i ( x ) − h ( r ) − ∇ h ( r ) · ( x − r ) | = (cid:88) i φ i ( x ) | h ( p i ) + ∇ h ( p i )( x − p i ) − h ( r ) − ∇ h ( r ) · ( x − r ) | (cid:54) (cid:88) i φ i ( x )( | h ( p i ) + ∇ h ( p i ) · ( r − p i ) − h ( r ) | + |∇ h ( p i ) · ( x − r ) − ∇ h ( r ) · ( x − r ) | )= (cid:88) i φ i ( x )( O ( (cid:107) r − p i (cid:107) ) + O ( (cid:107) r − p i (cid:107)(cid:107) x − r (cid:107) ))= (cid:88) i φ i ( x ) O ( (cid:107) x − r (cid:107) ) = O ( (cid:107) x − r (cid:107) ) , (cid:107) x − p i (cid:107) (cid:54) diam(supp φ i ) + dist(supp φ i , R ) (cid:54) diam(supp φ i ) + dist( x, R ) (cid:54) φ i , R ) + dist( x, R ) (cid:54) x, R ) = 3 (cid:107) x − r (cid:107) and (cid:107) r − p i (cid:107) (cid:54) (cid:107) r − x (cid:107) + (cid:107) x − p i (cid:107) (cid:54) (cid:107) x − r (cid:107) . Similarly, since (cid:80) i ∇ φ i ( x ) = 0 because (cid:80) i φ i ( x ) = 1, so that also (cid:80) i ∇ φ i ( x )( ∇ h ( r ) · ( x − r )) = 0,and using the triangle inequality and Lemma 15, (cid:107)∇ h ( r ) − ∇ h ( x ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∇ h ( r ) − ∇ (cid:88) i g i ( x ) φ i ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∇ h ( r ) − ∇ (cid:88) i φ i ( x )( h ( p i ) + ∇ h ( p i ) · ( x − p i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:54) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i φ i ( x )( ∇ h ( r ) − ∇ h ( p i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i ∇ φ i ( x )( h ( p i ) + ∇ h ( p i ) · ( x − p i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i φ i ( x )( ∇ h ( r ) − ∇ h ( p i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) i ∇ φ i ( x )[( h ( p i ) + ∇ h ( p i ) · ( r − p i ))+ ( ∇ h ( p i ) · ( x − r ) + ∇ h ( r ) · ( x − r ))] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:54) (cid:88) i φ i ( x ) O ( (cid:107) r − p i (cid:107) ) + (cid:88) i (cid:107)∇ φ i ( x ) (cid:107) (cid:0) O ( (cid:107) r − p i (cid:107) )+ O ( (cid:107) p i − r (cid:107) ) O ( (cid:107) x − r (cid:107) ) (cid:1) (cid:54) O ( (cid:107) r − p i (cid:107) ) + 3 C (cid:18) (cid:107) x − r (cid:107) (cid:19) (cid:0) O ( (cid:107) r − p i (cid:107) )+ O ( (cid:107) p i − r (cid:107) ) O ( (cid:107) x − r (cid:107) ) (cid:1) = O ( (cid:107) x − r (cid:107) )Thus h is C on R \ (Γ ∪ J ), which settles item (iii).It also follows from that, together with the fact that from (11) we know that h is Lipschitz withconstant (cid:54) L on R , that h is locally Lipschitz.To prove that h is path-diﬀerentiable, let γ : R → R be a Lipschitz curve. By Lemma 16, γ − (Γ \ (cid:83) i Γ i ) is a set of measure zero. The set of points t ∈ R such that γ ( t ) ∈ (cid:83) i Γ i with γ (cid:48) ( t ) nottangent to (cid:83) i Γ i is countable as each such t is isolated. If γ ( t ) is in J ∪ Γ i for some i , and γ (cid:48) ( t ) istangent to J ∪ Γ i , then it follows from item (v) that the chain rule condition for path-diﬀerentiabilityholds at t , and this condition also holds on R \ (Γ ∪ J ) because of item (iii). This proves item (i). Lemma 16. If γ : R → R is Lipschitz with γ (cid:48) ( t ) (cid:54) = 0 for almost every t , then γ − (Γ \ (cid:83) i Γ i ) hasmeasure zero.Proof. Write γ = ( γ , γ ) for the two coordinate components of γ . Let P (cid:96) be the projection of Γ \ (cid:83) i Γ i into the (cid:96) th coordinate axis, (cid:96) = 1 ,

2. Note that since α (cid:54) , P (cid:96) is a Cantor set of measure zero, | P (cid:96) | = 0.26ecause of Rademacher’s theorem and the fact that γ (cid:48) ( t ) (cid:54) = 0 for almost every t ∈ R , the sets A and A where the derivatives γ (cid:48) ( t ) and γ (cid:48) ( t ), respectively, are well-deﬁned and nonzero, satisfy that A ∪ A is a set of full measure. If B is the null set of real numbers t ∈ R such that γ (cid:48) ( t ) is either notdeﬁned or equal to zero, then A ∪ A ∪ B = R .For i ∈ { , } and p ∈ P i , γ − i ( p ) ∩ A i is countable because the isolated points in this set are onlycountably-many, and the non-isolated points t ∈ γ − i ( p ) ∩ A i either satisfy γ (cid:48) i ( t ) = 0 (as can be seenby taking the limit in the deﬁnition of the derivative restricting to points in γ − i ( p ) ∩ A i ) or γ (cid:48) i ( t ) isnot deﬁned; in other words, if t ∈ γ − i ( p ) ∩ A i is not isolated, then t ∈ B .Thus γ − i ( P ) ∩ A i can be written as a countable, disjoint union (cid:70) ∞ j =1 Q ij of measurable sets Q ij ⊂ A i such that γ i ( Q ij ∪ B ) = P i and γ i is injective on Q ij . Since, by the change of variable formula [9, p.99], 0 (cid:54) (cid:90) Q ij (cid:107) γ (cid:48) i ( t ) (cid:107) dt = | γ i ( Q ij ) | (cid:54) | P i | = 0 , and since this is only possible if | Q ij | = 0 because the integrand is strictly positive throughout Q ij , allthe sets Q ij must be Lebesgue null. As a consequence, γ − i ( P ) ∩ A i is a countable union of null sets,and it is hence null.Now, γ − (Γ \ (cid:83) i Γ i ) = γ − ( P ) ∩ γ − ( P ) ⊆ ( γ − ( P ) ∩ A ) ∪ ( γ − ( P ) ∩ A ) ∪ B, and the three sets on the right-hand side are null, so this proves the lemma. Acknowledgements.

The author is deeply grateful for the guidance and support of J´erˆome Bolteand Edouard Pauwels. The author acknowledges the support of ANR-3IA Artiﬁcial and NaturalIntelligence Toulouse Institute.

References [1] Hedy Attouch, J´erˆome Bolte, and Benar Fux Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularizedGauss–Seidel methods.

Mathematical Programming , 137(1-2):91–129, 2013.[2] Edward Bierstone. Diﬀerentiable functions.

Boletim da Sociedade Brasileira de Matem´atica-Bulletin/Brazilian Mathematical Society , 11(2):139–189, 1980.[3] J´erˆome Bolte, Aris Daniilidis, Olivier Ley, and Laurent Mazet. Characterizations of (cid:32)lojasiewiczinequalities: subgradient ﬂows, talweg, convexity.

Transactions of the American MathematicalSociety , 362(6):3319–3363, 2010.[4] J´erˆome Bolte and Edouard Pauwels. Conservative set valued ﬁelds, automatic diﬀerentiation,stochastic gradient methods and deep learning.

Mathematical Programming , 2020.[5] J´erˆome Bolte, Edouard Pauwels, and Rodolfo R´ıos-Zertuche. Long term dynamics of the subgra-dient method for Lipschitz path diﬀerentiable functions. Preprint. arXiv:2006.00098 [math.OC].[6] Jonathan Borwein, Warren Moors, and Xianfu Wang. Generalized subdiﬀerentials: a Baire cate-gorical approach.

Transactions of the American Mathematical Society , 353(10):3875–3893, 2001.[7] Aris Daniilidis and Dmitriy Drusvyatskiy. Pathological subgradient dynamics.

SIAM Journal onOptimization , 30(2):1327–1338, 2020. 278] Damek Davis, Dmitriy Drusvyatskiy, Sham Kakade, and Jason D. Lee. Stochastic subgradientmethod converges on tame functions.

Foundations of Computational Mathematics , 01 2019.[9] Lawrence Craig Evans and Ronald F Gariepy.

Measure theory and ﬁne properties of functions .CRC Press, 2015.[10] Serge Lang.

Diﬀerential and Riemannian manifolds , volume 160 of

Graduate Texts in Mathemat-ics . Springer, 2012.[11] Hassler Whitney. A function not constant on a connected set of critical points.