Derivative-Free Superiorization: Principle and Algorithm
DDERIVATIVE-FREE SUPERIORIZATION: PRINCIPLE ANDALGORITHM ∗ YAIR CENSOR † , EDGAR GARDU ˜NO ‡ , ELIAS S. HELOU § , AND
GABOR T. HERMAN ¶ Abstract.
The superiorization methodology is intended to work with input data of constrainedminimization problems, that is, a target function and a set of constraints. However, it is basedon an antipodal way of thinking to what leads to constrained minimization methods. Insteadof adapting unconstrained minimization algorithms to handling constraints, it adapts feasibility-seeking algorithms to reduce (not necessarily minimize) target function values. This is done byinserting target-function-reducing perturbations into a feasibility-seeking algorithm while retainingits feasibility-seeking ability and without paying a high computational price. A superiorized al-gorithm that employs component-wise target function reduction steps is presented. This enablesderivative-free superiorization (DFS), meaning that superiorization can be applied to target func-tions that have no calculable partial derivatives or subgradients. The numerical behavior of ourderivative-free superiorization algorithm is illustrated on a data set generated by simulating a prob-lem of image reconstruction from projections. We present a tool (we call it a proximity-target curve )for deciding which of two iterative methods is “better” for solving a particular problem. The plots ofproximity-target curves of our experiments demonstrate the advantage of the proposed derivative-freesuperiorization algorithm.
Key words. derivative-free; superiorization; constrained minimization; component-wise pertur-bations; proximity function; bounded perturbations; regularization
AMS subject classifications.
1. Introduction.1.1. The superiorization methodology (SM).
In many applications thereexist efficient iterative algorithms for producing constraints-compatible solutions. Of-ten these algorithms are perturbation resilient in the sense that, even if certain kindsof changes are made at the end of each iterative step, the algorithms still producea constraints-compatible solution. This property is exploited in superiorization byusing such perturbations to steer an algorithm to an output that is as constraints-compatible as the output of the original algorithm, but is superior (not necessarilyoptimal) to it with respect to a given target function.Superiorization has a world-view that is quite different from that of classicalconstrained optimization. Both in superiorization and in classical constrained opti-mization there is an assumed domain Ω and a criterion that is specified by a targetfunction φ that maps Ω into R . In classical optimization it is assumed that there is aconstraints set C and the task is to find an x ∈ C for which φ ( x ) is minimal over C .Two difficulties with this approach are: (1) The constraints that arise in a practical ∗ February 3, 2020.
Funding:
Edgar Gardu˜no would like to thank the support of DGAPA-UNAM. The work of YairCensor is supported by the ISF-NSFC joint research program Grant No. 2874/19. Elias S. Helouwas partially supported by CNPq grant No. 310893/2019-4. † Department of Mathematics, University of Haifa, Mt. Carmel, Haifa 3498838, Israel([email protected]) ‡ Departamento de Ciencias de la Computaci´on, Instituto de Investigaciones en Matem´aticas Apli-cadas y en Sistemas, Universidad Nacional Aut´onoma de M´exico, Cd. Universitaria, C.P. 04510,Mexico City, Mexico ([email protected]) § Department of Applied Mathematics and Statistics, University of S˜ao Paulo, S˜ao Carlos, S˜aoPaulo 13566-590, Brazil ([email protected]) ¶ Computer Science Ph.D. Program, The Graduate Center, City University of New York, NewYork, NY 10016, USA ([email protected])1 a r X i v : . [ m a t h . O C ] M a r Y. CENSOR, E. GARDU ˜NO, E. S. HELOU AND G. T. HERMAN problem may not be consistent, so C could be empty and the optimization task asstated would not have a solution. (2) Even for nonempty C , iterative methods ofclassical constrained optimization typically converge to a solution only in the limitand some stopping rule is applied to terminate the process. The actual output at thattime may not be in C (especially if the iterative algorithm is initialized at a pointoutside C ) and, even if it is in C , it is most unlikely to be a minimizer of φ over C .Both issues are handled in the superiorization approach investigated here by re-placing the constraints set C by a nonnegative real-valued proximity function P r T that indicates how incompatible a given x ∈ Ω is with specified constraints T . Thenthe merit of an actual output x of an algorithm is represented by the smallness ofthe two numbers P r T ( x ) and φ ( x ). Roughly, if an iterative algorithm produces anoutput x , then its superiorized version will produce an output x (cid:48) for which P r T ( x (cid:48) ) isnot larger than P r T ( x ), but (as in-practice demonstrated) generally φ ( x (cid:48) ) is smallerthan φ ( x ).As an example, let Ω = R J and consider a set T of constraints of the form(1.1) (cid:10) d i , x (cid:11) = h i , i = 1 , , . . . , I, where d i ∈ R J and h i ∈ R , for all i = 1 , , . . . , I , and (cid:104)· , ·(cid:105) is the Euclidean inner prod-uct in R J . There may or may not be an x ∈ R J that satisfies this set of constraints,but we can always define a proximity function for T as, for example, by(1.2) P r T ( x ) := I (cid:88) i =1 (cid:0)(cid:10) d i , x (cid:11) − h i (cid:1) . There are several approaches in the literature that attempt to minimize bothcompeting objectives P r T ( x ) and φ ( x ) as a way to handle constrained minimization.The oldest one is the penalty function approach, also useful in regularization of inverseproblems [17]. In that approach, the constrained minimization problem is replaced bythe unconstrained minimization of the combination φ ( x ) + π P r T ( x ), in which π ≥ P r T ( x )and φ ( x ). None of these approaches are close in their underlying principles to thesuperiorization methodology employed in this paper. Our moti-vating purpose in this paper is to investigate the general applicability of derivative-free superiorization (DFS) as an alternative to previously proposed superiorizationapproaches. These earlier approaches were based on generation of nonascending vec-tors, for target function reduction steps, that mostly required the ability to calculategradients or subgradients of the target function. Paralleling the body of knowledge ofderivative-free optimization (DFO), see, e.g., [13], we explore a specific DFS algorithmand demonstrate its action numerically.The output of a superiorized version of a constraints-compatibility-seeking algo-rithm will have smaller (but not minimal) target function φ value than the outputby the same constraints-compatibility-seeking algorithm without perturbations, ev-erything else being equal. Even though superiorization is not an exact minimization ERIVATIVE-FREE SUPERIORIZATION bound-constrained optimization problems using algorithms that require only the availabilityof objective function values but no derivative information,” with bound constraintsimposed on the vector x . The book by Conn, Scheinberg and Vicente [13] dealsonly with derivative-free unconstrained minimization, except for its last chapter (of10 pages out of the 275) entitled “Review of constrained and other extensions toderivative-free optimization.” Li et al . [26] do not even mention constraints. In [15]the numerical work deals with: “The dimension of the problems [i.e., the size of thevector x ] varies between 2 and 16, while the number of constraints are between 1and 38, exceeding 10 in only 5 cases.” In [16] the numerical tests are limited to:“The first case has 80 optimization variables [i.e., the size of the vector x ] and onlybound constraints, while the second example is a generally constrained productionoptimization involving 20 optimization variables and 5 general constraints.” Similarorders of magnitude for problem sizes appear in the numerical results presented in [1]and also in the book of Audet and Hare [2].This indicates that (i) much of the literature on derivative-free minimization isconcerned with unconstrained minimization or with bound-constraints on the vari-ables, and (ii) many, if not all, proposed methods were designed (or, at least, demon-strated) only for small-scale problems. In contrast, the DFS method proposed here canhandle any type of constraints for which a separate efficient derivative-free constraints-compatibility-seeking algorithm is available and is capable of solving very large prob-lems. In that respect, of problem sizes, we discover here, admittedly with a verypreliminary demonstration, that DFS can compete well with DFO on large problems.Since the constraints-compatibility-seeking algorithm forms part of the proposed DFSmethod, the method can use exterior initialization (that is initializing the iterationsat any point in space). Furthermore, very large-scale problems can be accommodated.The progressive barrier (PB) approach, described in Chapter 12 of the book [2],originally published in [1], is an alternative to the exterior penalty (EP) approach thatwe mention in Subsection 5.4 below. However, the PB differs from our DFS method,in spite of some similarities with it, as we explain in Subsection 5.5 below. A comprehensive overview of the stateof the art and current research on superiorization appears in our continuously updatedbibliography Internet page that currently contains 109 items [8]. Research works inthis bibliography include a variety of reports ranging from new applications to newmathematical results on the foundations of superiorization. A special issue entitled:“Superiorization: Theory and Applications” of the journal Inverse Problems [11] con-tains several interesting papers on the theory and practice of SM, such as [6], [21], [30],to name but a few. Later papers continue research on perturbation resilience, whichlies at the heart of the SM, see, e.g., [3]. An early paper on bounded perturbationresilience is [4], a recent book containing many results on the behavior of algorithmsin the presence of summable perturbations is [32].
In Section 2 we present the basics of the superi-orization methodology. We present our DFS algorithm in Section 3 and juxtapose itwith an existing superiorization algorithm that uses derivative information. In Sec-tion 4 we present a tool (we call it a proximity-target curve ) for deciding which of
Y. CENSOR, E. GARDU ˜NO, E. S. HELOU AND G. T. HERMAN two iterative methods is “better” for solving a particular problem. The experimentaldemonstration of our DFS algorithm appears in Section 5. In Section 6 we offer abrief discussion and some conclusions.
2. The basics of the superiorization methodology.
We follow the approachof [24]. Ω denotes a nonempty set in the Euclidean space R J . T is a problem set ;each problem T ∈ T is described by a particular set of constraints such as provided,for example, in (1.1). P r is a proximity function on T such that, for every T ∈ T , P r T : Ω → R + (nonnegative real numbers). P r T ( x ) measures how incompatible x iswith the constraints of T . A problem structure is a pair ( T , P r ), where T is a problemset and P r is a proximity function on T . For an x ∈ Ω, we say that x is ε -compatible with T if P r T ( x ) ≤ ε . We assume that we have computer procedures that, for any x ∈ R J , determine whether x ∈ Ω and, for any x ∈ Ω and T ∈ T , calculate P r T ( x ).In many applications, each problem T ∈ T is determined by a family of sets { C i } Ii =1 ,where each C i is a nonempty, often closed and convex, subset of Ω and the problem T is to find a point that is in the intersection of the C i .We introduce ∆, such that Ω ⊆ ∆ ⊆ R J and a target function φ : ∆ → R , whichis referred to as an optimization criterion in [24]. We assume that we have a computerprocedure that, for any x ∈ R J , determines whether x ∈ ∆ and, if so, calculates φ ( x ).An algorithm P for a problem structure ( T , P r ) assigns to each problem T ∈ T a computable algorithmic operator P T : ∆ → Ω. For any initial point x ∈ Ω, P T produces the infinite sequence (cid:16) ( P T ) k x (cid:17) ∞ k =0 of points in Ω. Definition
The ε -output of a sequence For a problem structure ( T , P r ) , a T ∈ T , an ε ∈ R + and a sequence R := (cid:0) x k (cid:1) ∞ k =0 of points in Ω , we use O ( T, ε, R ) to denote the x ∈ Ω that has the following properties: P r T ( x ) ≤ ε, and there is a nonnegative integer K such that x K = x and, for allnonnegative integers k < K , P r T (cid:0) x k (cid:1) > ε . If there is such an x , then it is unique.If there is no such x , then we say that O ( T, ε, R ) is undefined, otherwise it is defined. If R is an infinite sequence generated by a process that repeatedly applies P T , then O ( T, ε, R ) is the output produced by that process when we add to it instructions thatmake it terminate as soon as it reaches a point that is ε -compatible with T . Roughly,we refer to P as a feasibility-seeking algorithm for a problem structure ( T , P r ) thatarose from a particular application if, for all T ∈ T and ε ∈ R + of interest for theapplication, O ( T, ε, R ) is defined for all infinite sequences R generated by repeatedapplications P T . Each application of P T is referred to as a feasibility-seeking step . Definition
Strong perturbation resilience
An algorithm P for a problem structure ( T , P r ) is said to be strongly perturbationresilient if, for all T ∈ T ,1. there is an ε ∈ R + such that O (cid:16) T, ε, (cid:16) ( P T ) k x (cid:17) ∞ k =0 (cid:17) is defined for every x ∈ Ω;2. for all ε ∈ R + such that O (cid:16) T, ε, (cid:16) ( P T ) k x (cid:17) ∞ k =0 (cid:17) is defined for every x ∈ Ω,we also have that O ( T, ε (cid:48) , R ) is defined for every ε (cid:48) > ε and for every sequence R = (cid:0) x k (cid:1) ∞ k =0 of points in Ω generated by(2.1) x k +1 = P T (cid:0) x k + β k v k (cid:1) , for all k ≥ , where β k v k are bounded perturbations , meaning that the sequence ( β k ) ∞ k =0 of ERIVATIVE-FREE SUPERIORIZATION summable (that is, ∞ (cid:88) k =0 β k < ∞ ), the sequence (cid:0) v k (cid:1) ∞ k =0 of vectors in R J is bounded and, for all k ≥ x k + β k v k ∈ ∆.Sufficient conditions for strong perturbation resilience appeared in [24, Theorem 1].With respect to the target function φ : ∆ → R , we adopt the convention thata point in ∆ for which the value of φ is smaller is considered superior to a point in∆ for which the value of φ is larger. The essential idea of the SM is to make useof the perturbations of (2.1) to transform a strongly perturbation resilient algorithmthat seeks a constraints-compatible solution (referred to as the Basic Algorithm ) intoa superiorized version whose outputs are equally good from the point of view ofconstraints-compatibility, but are superior (not necessarily optimal) with respect tothe target function φ . This can be done by making use of the following concept. Definition [24]
Nonascending vector
Given a function φ : ∆ → R and a point y ∈ R J , we say that a d ∈ R J is a nonascending vector for φ at y if (cid:107) d (cid:107) ≤ δ > λ ∈ [0 , δ ] we have φ ( y + λ d ) ≤ φ ( y ) . Obviously, the zero vector (all components are 0) is always such a vector, butfor the SM to work we need a strict inequality to occur in (2.2) frequently enough.Generation of nonascending vectors, used for target function reduction steps, hasbeen based mostly on the following theorem or its variants such as [19, Theorem 1]and [20, unnumbered Theorem on page 7], which provide sufficient conditions for anonascending vector. Theorem [24, Theorem 2]. Let φ : R J → R be a convex function and let x ∈ R J . Let g ∈ R J satisfy the property: For ≤ j ≤ J , if the j th component g j of g is not zero, then the partial derivative ∂φ∂x j ( x ) of φ at x exists and its value is g j .Define d to be the zero vector if (cid:107) g (cid:107) = 0 and to be − g / (cid:107) g (cid:107) otherwise. Then d is anonascending vector for φ at x . In order to use this theorem, φ must have at least one calculable partial derivative(which is nonzero) at points in the domain of φ . Otherwise, the theorem would applyonly to the zero vector, which is a useless nonascending vector because it rendersthe SM ineffective. To enable application of the SM to target functions that haveno calculable partial derivatives or subgradients, we proposed in [10] to search fora point in the neighborhood of x at which the target function exhibits nonascentby comparing function values at points of a fixed distance from x along the spacecoordinates. To obtain a sequence of nonascending points without making use ofTheorem 2.4, we replaced in [10] the notion of a nonascending vector by the followingalternative notion. Definition [10]
Nonascending δ -bound direction Given a target function φ : ∆ → R where ∆ ⊆ R J , a point y ∈ ∆, and apositive δ ∈ R , we say that d ∈ R J is a nonascending δ - bound direction for φ at y if (cid:107) d (cid:107) ≤ δ , y + d ∈ ∆ and φ ( y + d ) ≤ φ ( y ). The collection of all such vectors is calleda nonascending δ - ball and is denoted by B δ,φ ( y ), that is,(2.3) B δ,φ ( y ) := (cid:8) d ∈ R J | (cid:107) d (cid:107) ≤ δ, ( y + d ) ∈ ∆ , φ ( y + d ) ≤ φ ( y ) (cid:9) . The zero vector is contained in each nonascending δ -ball, that is, ∈ B δ,φ ( y ) for each δ > y ∈ ∆. The purpose of this definition is to allow the use, as a direction Y. CENSOR, E. GARDU ˜NO, E. S. HELOU AND G. T. HERMAN of target function decrease, of any vector d ∈ R J for which φ ( y + d ) ≤ φ ( y ) holdslocally only for d , and not throughout a certain interval as in Definition 2.3. Thevector d depends on the value of δ and they may be determined simultaneously in thesuperiorization process, as seen below. This kind of nonascent was referred to as localnonascent in [10, Subsection 2.3]. Obviously, local nonascent is a more general notionsince every nonascending vector according to Definition 2.3 is also a nonascending δ - bound direction according to Definition 2.5 but not vice versa. The advantage ofthis notion is that it is detectable by using only function value calculations.The following easily-proved proposition unifies these approaches in the convexcase. Proposition
Let φ : R J → R be a convex function and let x ∈ R J . If d ∈ R J is a nonascending δ -bound direction for φ at x , then either d = (and hence d is a nonascending vector for φ at x ) or d / (cid:107) d (cid:107) is a nonascending vector for φ at x . The idea of calculating δ (equivalently, the step-size γ (cid:96) in the superiorized algo-rithms presented in the next section) simultaneously with a direction of nonascentappeared in a completely different way in [27], where they use an additional internalloop of a penalized minimization to calculate the direction of nonascent; see also [28].
3. Specific superiorization approaches.
This section presents two specificapproaches to superiorizing a Basic Algorithm that operates by repeated applicationsof an algorithmic operator P T starting from some initial point. The first approachproduces the superiorized version that is named Algorithm 1 below, it has beenpublished in the literature previously [24, page 5537]. The second approach, named
Algorithm 2 below, is novel to this paper.The two superiorized versions have some things in common. They are both it-erative procedures in which k is used as the iteration index. The first two steps ofboth algorithms sets k to 0 and x to a given initial vector ¯ x ∈ ∆. They both assumethat we have available a summable sequence ( γ (cid:96) ) ∞ (cid:96) =0 of nonnegative real numbers (forexample, γ (cid:96) = a (cid:96) , where 0 < a < (cid:96) is initialized to − (cid:96) is increased by 1 before the first time γ (cid:96) is used). Inboth algorithms the iterative step that produces x k +1 from x k , as in (2.1), is specifiedwithin a repeat loop that first performs a user-specified number, N , of perturbationsteps followed by one feasibility-seeking step that uses the algorithmic operator P T .In more detail, the repeat loop in each of the algorithms has the following form.After initializing the loop index n to 0 and setting x k, to x k , it produces one-by-one x k, , x k, , . . . , x k,N (these are the iterations of the perturbation steps), followed byproducing x k +1 = P T x k,N (the feasibility-seeking step). The difference between thetwo algorithms is in how they perform the perturbations for getting from x k,n to x k,n +1 .We state an important property of Algorithm 1 ; for a proof see [24, SectionII.E].
Theorem
Suppose that the algorithm P for a problem structure ( T , P r ) isstrongly perturbation resilient. Suppose further that T ∈ T and ε ∈ R + are such that O (cid:16) T, ε, (cid:16) ( P T ) k x (cid:17) ∞ k =0 (cid:17) is defined for every x ∈ Ω . It is then the case that O ( T, ε (cid:48) , R ) is defined for every ε (cid:48) > ε and every sequence R = (cid:0) x k (cid:1) ∞ k =0 produced by Algorithm1.
The pseudo-code of
Algorithm 1 does not specify how the nonascending vectorin Step 8 is to be selected. In publications using
Algorithm 1 , such details are usually
ERIVATIVE-FREE SUPERIORIZATION Algorithm 1
Superiorization using nonascending vectors set k = 0 set x k = ¯ x set (cid:96) = − repeat set n = 0 set x k,n = x k while n < N set v k,n to be a nonascending vector for φ at x k,n set loop = true while loop set (cid:96) = (cid:96) + 1 set z = x k,n + γ (cid:96) v k,n if z ∈ ∆ and φ ( z ) ≤ φ (cid:0) x k (cid:1) then set x k,n +1 = z set n = n + 1 set loop = false set x k +1 = P T x k,N set k = k + 1based on a variant of Theorem 2.4, resulting in a not derivative-free algorithm.For the specification of Algorithm 2 we let, for 1 ≤ j ≤ J , e j be the vector in R J all of whose components are 0, except for the j th component, which is 1. Theset of coordinate directions is defined as Γ := (cid:8) e j | ≤ j ≤ J (cid:9) ∪ (cid:8) − e j | ≤ j ≤ J (cid:9) .We assume that ( c m ) ∞ m =0 is a given sequence of coordinate directions such that anysubsequence of length 2 J contains Γ.We make the following comments:1. Steps 15, 16 and 17 of Algorithm 2 implement nonascending γ (cid:96) -bound direc-tions, as in Definition 2.5. In doing so, Algorithm 2 realizes in a component-wise manner the algorithmic framework of [10] (specifically, as expressed inSteps 7 and 8 of Algorithm 1 in that paper).2. No partial derivatives are used by
Algorithm 2 .3. Step 16 of
Algorithm 2 is similar to Step 13 of
Algorithm 1 . One differenceis the use of strict inequality in
Algorithm 2 , the reason for this is that it wasfound advantageous in some applications of the algorithm. In addition, the while loop due to Step 8 of
Algorithm 2 is executed at most 2 J times, butthere is no upper bound on the (known to be finite) number of executionsof the while loop due to Step 10 of Algorithm 1 . Also, it follows fromthe pseudo-code of
Algorithm 2 that, for all k ≥ ≤ n ≤ N , φ (cid:0) x k,n (cid:1) < φ (cid:0) x k (cid:1) , even though there is no explicit check for this as in Step13 of Algorithm 1 .4. A desirable property of
Algorithm 2 is that it cannot get stuck in a partic-ular iteration k because the value of L increases in an execution of the while loop of Step 13 and the value of n increases in an execution of the while loopof Step 8.5. Algorithm 2 shares with
Algorithm 1 the important property in Theorem3.1. Stated less formally: “For a strongly perturbation resilient algorithm, iffor all initial points from Ω the infinite sequence produced by an algorithm
Y. CENSOR, E. GARDU ˜NO, E. S. HELOU AND G. T. HERMAN
Algorithm 2
Component-wise derivative-free superiorization set k = 0 set x k = ¯ x set (cid:96) = − set m = − repeat set n = 0 set x k,n = x k while n < N set x k,n +1 = x k,n set (cid:96) = (cid:96) + 1 set L = − while L < J set L = L + 1 set m = m + 1 set z = x k,n + γ (cid:96) c m if z ∈ ∆ and φ ( z ) < φ (cid:0) x k,n (cid:1) then set x k,n +1 = z set L = 2 J set n = n + 1 set x k +1 = P T x k,N set k = k + 1contains an ε -compatible point, then all perturbed sequences produced by thesuperiorized version of the algorithm contain an ε (cid:48) -compatible point, for any ε (cid:48) > ε .”6. At present there is no mathematical proof to guarantee that the output ofa superiorized version of a constraints-compatibility-seeking algorithm willhave smaller target function φ value than the output by the same constraints-compatibility-seeking algorithm without perturbations, everything else beingequal. A partial mathematical result toward coping with this lacuna, in theframework of weak superiorization, is provided by Theorem 4.1 in [12].
4. The proximity-target curve.
We now give a tool for deciding which oftwo iterative methods is “better” for solving a particular problem. Since an iterativemethod produces a sequence of points, our definition is based on such sequences.For incarnations of the definitions given in this section, the reader may wish tolook ahead to Figure 1. That figure illustrates the notions discussed in this sectionfor two particular finite sequences R := (cid:0) x k (cid:1) K hi k = K lo and S := (cid:0) y k (cid:1) L hi k = L lo . The detailsof how those sequences were produced are given below in Subsection 5.6. Definition
Monotone proximity of a finite sequence
For a problem structure ( T , P r ), a T ∈ T , positive integers K lo and K hi > K lo ,the finite sequence R := (cid:0) x k (cid:1) K hi k = K lo of points in Ω is said to be of monotone proximity if for K lo < k ≤ K hi , P r T (cid:0) x k − (cid:1) > P r T (cid:0) x k (cid:1) . The approach followed in the present paper was termed strong superiorization in [12, Section6] and [7] to distinguish it from weak superiorization , wherein asymptotic convergence to a point in C is studied instead of ε -compatibility.ERIVATIVE-FREE SUPERIORIZATION Definition
The proximity-target curve of a finite sequence
For a problem structure ( T , P r ), a T ∈ T , a target function φ : Ω → R , positiveintegers K lo and K hi > K lo , let R := (cid:0) x k (cid:1) K hi k = K lo be a sequence of monotone proximity.Then the proximity-target curve P ⊆ R associated with R is uniquely defined by:1. For K lo ≤ k ≤ K hi , (cid:0) P r T (cid:0) x k (cid:1) , φ (cid:0) x k (cid:1)(cid:1) ∈ P .2. The intersection { ( y, x ) ∈ R : P r T (cid:0) x k (cid:1) ≤ y ≤ P r T (cid:0) x k − (cid:1) } ∩ P is the linesegment from (cid:0) P r T (cid:0) x k − (cid:1) , φ (cid:0) x k − (cid:1)(cid:1) to (cid:0) P r T (cid:0) x k (cid:1) , φ (cid:0) x k (cid:1)(cid:1) . Definition
Comparison of proximity-target curves of sequences
For a problem structure ( T , P r ) , a T ∈ T , a target function φ : Ω → R , positiveintegers K lo , K hi > K lo , L lo , L hi > L lo , let R := (cid:0) x k (cid:1) K hi k = K lo and S := (cid:0) y k (cid:1) L hi k = L lo besequences of points in Ω of monotone proximity for which P and Q are their respectiveassociated proximity-target curves. Define (4.1) t := max (cid:0) P r T (cid:0) x K hi (cid:1) , P r T (cid:0) y L hi (cid:1)(cid:1) ,u := min (cid:0) P r T (cid:0) x K lo (cid:1) , P r T (cid:0) y L lo (cid:1)(cid:1) . Then R is better targeted than S if: t ≤ u and2. for any real number h , if t ≤ h ≤ u , ( h, v ) ∈ P and ( h, w ) ∈ Q , then v ≤ w .Let us see how this last definition translates into something that is intuitivelydesirable. Suppose that we have an iterative algorithm that produces a sequence, y , y , y , · · · , of which S := (cid:0) y k (cid:1) L hi k = L lo is a subsequence. An alternative algorithmthat produces a sequence of points of which R := (cid:0) x k (cid:1) K hi k = K lo is a subsequence that isbetter targeted than S has a desirable property: Within the range [ t, u ] of proximityvalues, the point that is produced by the alternative algorithm with that proximityvalue, is likely to have lower (and definitely not higher) value of the target functionas the point with that proximity value that is produced by the original algorithm.This property is stronger than what we stated before, namely that superiorizationproduces an output that is equally good from the point of view of proximity, but issuperior with respect to the target function. Here the single output determined by afixed ε is replaced by a set of potential outputs for any ε ∈ [ t, u ].
5. Experimental demonstration of derivative-free component-wise su-periorization.5.1. Goal and general methodology.
Our goal is to demonstrate that compo-nent-wise superiorization (
Algorithm 2 ) is a viable efficient DFS method to handledata of constrained-minimization problems (that is, a target function and a set ofconstraints), when the target function has no calculable partial derivatives.To ensure the meaningfulness and worthiness of our experiments, we generate theconstraints and choose a target function, that has no calculable partial derivatives,inspired by an application area of constrained optimization, namely image reconstruc-tion from projections in computerized tomography (CT).For the so-obtained data we consider two runs of
Algorithm 2 , one with andthe other without the component-wise perturbation steps. To be exact, by “withoutperturbation” we mean that Steps 10–18 in
Algorithm 2 are deleted so that x k,N = x k , which amounts to running the feasibility-seeking basic algorithm P T without anyperturbations. Everything else is equal in the two runs, such as the initialization point0 Y. CENSOR, E. GARDU ˜NO, E. S. HELOU AND G. T. HERMAN ¯ x and all parameters associated with the application of the feasibility-seeking basicalgorithm in Step 20. The results are presented below by plots of proximity-targetcurves that show that the target function values of Algorithm 2 when run “withperturbations” are systematically lower than those of the same algorithm without thecomponent-wise perturbations.The numerical behavior of
Algorithm 2 , as demonstrated by our experiment,makes it a meritorious choice for superiorization in situations involving a derivative-free target function and a set of constraints.To reach the goal described above we proceed in the following stages.1. Specification of a problem structure ( T , P r ) for the experimental demonstra-tion, and generation of constraints, simulated from the application of imagereconstruction from projections in computerized tomography.2. Choice of a ∆ and a derivative-free target function φ for the experiment.3. Specification of the algorithmic operator P T to be used in Algorithm 2 .This is chosen so that the Basic Algorithm that operates by repeated ap-plications of P T is a standard sequential iterative projections method forfeasibility-seeking of systems of linear equations; a version of the AlgebraicReconstruction Techniques (ART) [22, Chapter 11] that is equivalent to Kacz-marz’s projections method [25].4. Specification of algorithmic details and parameters, such as N and γ (cid:96) in Algorithm 2 . ∆ andof the target function. We generate the constraints and chose a target functionfrom the application area of image reconstruction from projections in computerizedtomography (CT). The problem structure ( T , P r ) for our demonstration has beenused in the literature for comparative evaluations of various algorithms for CT [20,22, 24, 29]. It is of the type described in Section 1 by (1.1) and (1.2). Specifically,vectors x in Ω = R J represent two-dimensional (2D) images, with each componentof x representing the density assigned to one of the pixels in the image. We use J = 235 , x represents a 485 ×
485 image. Our test image (phantom) isrepresented by the vector ˆ x , that is a digitization of a picture of a cross-section of ahuman head [22, Sections 4.1–4.4 and 5.2].In the problem T that we use for our illustration, each index i = 1 , , . . . , I isassociated with a line across the image and the corresponding d i is a vector in R J ,whose j th component is the length of intersection of that line with the j th of the J pixels. There are I = 498 ,
960 such lines (organized into 720 divergent projectionswith 693 lines in each; similar to the standard geometry in [22] but with more linesin each projection). The h i have been calculated by simulating the behavior of CTscanning of the head cross-section [22, Section 4.5]. All the above was generatedusing the SNARK14 programming system for the reconstruction of 2D images from1D projections [14], giving rise to a system of linear equations (1.1). For the resulting T , we calculated that the proximity of the phantom to the generated constraints is P r T ( ˆ x ) = 6 . R J .Our choice of the target function φ is as follows. We index the pixels (i.e., thecomponents of a vector x ) by j and let Θ denote the set of all indices of pixels thatare not in the rightmost column or the bottom row of the 2D pixel array that displays The term projection has in this field a different meaning than in convex analysis. It stands fora set of estimated line integrals through the image that has to be reconstructed, see [22, page 3].ERIVATIVE-FREE SUPERIORIZATION j ∈ Θ, let r ( j ) and b ( j ) be theindex of the pixel to its right and below it in the 2D pixel array, respectively. Denotingby med the function that selects the median value of its three arguments, we define(5.1) φ ( x ) := (cid:88) j ∈ Θ (cid:113)(cid:12)(cid:12) x j − med (cid:8) x j , x r ( j ) , x b ( j ) (cid:9)(cid:12)(cid:12) . Finding partial derivatives for this target function is problematic. On the other hand,when only one pixel value (that is, only one component of the vector) is changed invector x to get another vector y , then it is possible to obtain φ ( y ) from φ ( x ) bycomputing only three of the terms in the summation on the right-hand side of (5.1).These observations indicate that the use of the derivative-free approach of Steps 10–18in Algorithm 2 is a viable option whereas Step 8 of
Algorithm 1 is hard to performunless the trivial nonascending vector v k,n = is selected, which is ineffective. Forour chosen phantom we calculated φ ( ˆ x ) = 2 , . T . Our chosen operator, mapping x into P T x , is specified by Algorithm 3.
It depends on a real parameter λ in its Step 4. Algorithm 3
The algorithmic operator P T set i = 0 set y i = x while i < I set y i +1 = y i − λ (cid:10) d i , y i (cid:11) − h i (cid:107) d i (cid:107) d i set i = i + 1 set P T x = y I Algorithm 4
ART (as used in this paper) set k = 0 set x k = ¯ x repeat set x k +1 = P T x k set k = k + 1 Algorithm 4 is a special case of the general class of Algebraic ReconstructionTechniques as discussed in [22, Chapter 11] and is, for λ = 1, equivalent to theoriginal method of Kaczmarz in the seminal paper [25]. For further references onKaczmarz’s method and the Algebraic Reconstruction Techniques see, e.g., [5, page220], [9, Section 2] and [23]. Note that Algorithm 4 (ART) can be obtained fromeither
Algorithm 1 or Algorithm 2 by removing the perturbation steps in their while loops.
One possibility for doing a derivative-free constrained minimization algorithm is to follow the option of using the exteriorpenalty (EP) function approach mentioned in [13, Chapter 13, Section 13.1, page 242]as applied to the constrained problem(5.2) min (cid:8) φ ( x ) | (cid:10) d i , x (cid:11) = h i , i = 1 , , . . . , I (cid:9) , Y. CENSOR, E. GARDU ˜NO, E. S. HELOU AND G. T. HERMAN where φ is as in (5.1) and the constraints are as in (1.1). With a user-selected penal-ization parameter η , the exterior penalty function approach replaces the constrainedminimization problem (5.2) by the penalized unconstrained minimization:(5.3) min { ψ ( x ) | x ∈ R J } , (5.4) ψ ( x ) := φ ( x ) + η P r T ( x ) , with φ as in (5.1) and P r T ( x ) as defined in (1.2). By applying the coordinate-searchmethod of [13, Algorithm 7.1] to the penalized unconstrained minimization problem(5.3)–(5.4), we get the next algorithm. Algorithm 5
Derivative-free constrained minimization by the exterior penalty (EP)approach set k = 0 set x k = ¯ x set (cid:96) = − set m = − repeat set x k +1 = x k set (cid:96) = (cid:96) + 1 set L = − while L < J set L = L + 1 set m = m + 1 set z = x k + γ (cid:96) c m if z ∈ Ω and ψ ( z ) < ψ (cid:0) x k (cid:1) then set x k +1 = z set L = 2 J set k = k + 1From the point of view of keeping the computational cost low, Algorithm 5 can be much more of a challenge than
Algorithm 2 . The reason for this has beenindicated when we have stated, near the end of Subsection 5.2, that if only onecomponent is changed in vector x to get another vector y , then it is possible toobtain φ ( y ) from φ ( x ) by computing only three of the terms in the summation onthe right-hand side of (5.1). When we use ψ in (5.4) instead of φ , there seems tobe a need for many more computational steps. This is because the number of termsthat change on the right-hand side of (1.2) due to a change in one component of x is of the order of 1,000 for the dataset described in Subsection 5.2 (in the languageof image reconstruction from projections, there is at least one line i in each of of the720 projections for which there is a change in value of (cid:10) d i , x (cid:11) due to changing onecomponent of x ). Thus our advocated approach of component-wise superiorizationin Algorithm 2 is likely to be orders of magnitude faster for our application areathan the more traditional approach of derivative-free constrained minimization by theexterior penalty (EP) approach in
Algorithm 5.5.5. The progressive barrier (PB) approach.
The progressive barrier (PB)approach, described in Chapter 12 of [2], was originally published in [1] wherein thehistory of the approach as a development of the earlier filter methods of Fletcher and
ERIVATIVE-FREE SUPERIORIZATION h ( x ) [2,Definition 12.1] alongside with the target function φ ( x ) so that the iterates appear inan h versus φ plot called “a filter”, based on the pairs of their h and φ values. Theconstraint violation function of PB is similar in nature to our “proximity function”and the h versus φ filter plot of PB is similar to our proximity-target curve, bothspecified above. The difference between the PB approach of [1] and our DFS lies inhow these objects are used. The PB optimization algorithm defines at each iterationwhat is a “success” or a “failure” of an iterate based on the current filter plot, anddecides accordingly what will be the next iterate. We do not provide full details herebut, in a nutshell, the PB algorithm performs at each iteration sophisticated searchesof both the target function values and the constraint violation function values.In contrast to the PB approach, the DFS investigated here takes the world-viewof superiorization. It uses a feasibility-seeking algorithm whose properties are al-ready known and perturbs its iterates without loosing its feasibility-seeking abilityand properties. The component-wise derivative-free search for a nonascending direc-tion of the target function is done in a manner that makes it a perturbation to whichthe feasibility-seeking algorithm is resilient (that is, the underlying feasibility-seekingalgorithm retains its feasibility-seeking nature). Thus, our DFS method searches thetarget function in a derivative-free fashion and automatically produces feasibility-seeking steps that actively reduce the proximity function. This implies an advantagefor the DFS method in handling large-scale problems, because no expensive additionaltime and computing resources are needed for the feasibility-seeking phase of the DFSalgorithm proposed here. Validation of this claim requires further experimental work. Our experimentswere carried out using the public-domain software package SNARK14 [14]. In allexperiments the initial vector ¯ x was the 235 , Algorithm 3 was λ = 0 .
05. Another issue thatneeds specification is the ordering of the constraints in (1.1), because the output of
Algorithm 3 depends not only on the set of constraints, but also on their order. Weused in our experiments the so-called efficient ordering , since it has been demonstratedto lead to better results faster when incorporated into ART [22, page 209].In
Algorithm 2 , the number N of perturbation steps (for each feasibility-seekingstep) was 100,000 and we used γ (cid:96) = ba (cid:96) , with b = 0 .
02 and a = 0 . , c m ) ∞ m =0 was obtained by repetitions of the length-2 J subsequence (cid:0) e , e , . . . e J , − e , − e , . . . , − e J (cid:1) .We applied Algorithm 2 twice, thirty iterations in each case, with and with-out its component-wise perturbations steps, respectively, under otherwise completelyidentical conditions. The resulting finite sequences of iterates are both of monotoneproximity, the associated proximity-target curves are shown in Figure 1. The ◦ s and E s on the plots represent actually calculated values at iterations of each algorithm,that are connected by line segments. For any proximity value on the horizontal axiswe can read the target-function value associated with it from the curve. The plots4 Y. CENSOR, E. GARDU ˜NO, E. S. HELOU AND G. T. HERMAN
Fig. 1 . Proximity-target curves P and Q of the first 30 iterates of Algorithm 2 with perturba-tions ( E ) and without perturbations ( ◦ ) . indicate visually the behavior of the algorithms, initialized at the same point denotedby x = y that appears in the right-most side of the figure. The V-shaped formof the proximity-target curve for Algorithm 2 with perturbations is typical for thebehavior of superiorized feasibility-seeking algorithms, showing the initially strongeffect of the perturbations that diminishes as the iterations proceed.For a more precise interpretation, consider Definition 4.3. In the experiment eval-uating the two versions of
Algorithm 2 , K lo = L lo = 1 and K hi = L hi = 30. The R = (cid:0) x k (cid:1) K hi k = K lo and S = (cid:0) y k (cid:1) L hi k = L lo produced by Algorithm 2 , with and without per-turbations, respectively, are both of monotone proximity. We find that P r T (cid:0) x K lo (cid:1) = P r T (cid:0) y L lo (cid:1) = 35 . u = 35 . P r T (cid:0) x K hi (cid:1) = 3 . P r T (cid:0) y L hi (cid:1) = 4 . t = 4 . P and Q associated with R and S , respectively, Figure 1 clearly illustrates that R is bettertargeted than S .
6. Discussion and conclusions.
In this paper we investigated the general ap-plicability of derivative-free superiorization (DFS) as an alternative to previously pro-posed superiorization approaches. In our computational demonstration, we generatedthe constraints and chose the target function from the application area of image recon-
ERIVATIVE-FREE SUPERIORIZATION
Acknowledgments.
We thank Nikolaos Sahinidis and Katya Scheinberg for sev-eral informative mail exchanges that helped us see better the general picture. We aregrateful to S´ebastien Le Digabel for calling our attention to the work of Charles Audetand coworkers, particularly the Audet and Dennis paper [1] and the book of Audetand Hare [2].
REFERENCES[1] Audet, C. and Dennis J.E. JR., 2009. A progressive barrier for derivative-free nonlinear pro-gramming,
SIAM Journal on Optimization , 20, 445–472.[2] Audet, C. and Hare, W., 2017.
Derivative-Free and Blackbox Optimization , Springer Interna-tional Publishing, Cham, Switzerland.[3] Bargetz, C., Reich, S., and Zalas, R., 2018. Convergence properties of dynamic string-averagingprojection methods in the presence of perturbations,
Numerical Algorithms , 77, 185–209.[4] Butnariu, D., Reich, S., and Zaslavski, A. J., 2006. Convergence to fixed points of inexact orbitsof Bregman-monotone and of nonexpansive operators in Banach spaces, in: Proceedings ofFixed Point Theory and its Applications, Mexico , Yokohama, 11–32.[5] Cegielski, A., 2012.
Iterative Methods for Fixed Point Problems in Hilbert Spaces , Springer-Verlag.[6] Cegielski, A. and Al-Musallam, F., 2017. Superiorization with level control,
Inverse Problems ,33, 044009.[7] Censor, Y., 2015. Weak and strong superiorization: Between feasibility-seeking and minimiza-tion,
Analele Stiintifice ale Universitatii Ovidius Constanta-Seria Matematica , 23, 41–54.[8] Censor, Y., 2019.
Superiorization and Perturbation Resilience of Algorithms: A Bibliogra-phy compiled and continuously updated , http://math.haifa.ac.il/yair/bib-superiorization-censor.html, last updated: December 27, 2019.[9] Censor, Y. and Cegielski, A., 2015. Projection methods: An annotated bibliography of booksand reviews,
Optimization , 64, 2343–2358.[10] Censor, Y., Heaton, H., and Schulte, R.W., 2019. Derivative-free superiorization with Y. CENSOR, E. GARDU ˜NO, E. S. HELOU AND G. T. HERMANcomponent-wise perturbations,
Numerical Algorithms , 80, 1219–1240.[11] Censor, Y., Herman, G.T., and Jiang, M., (Editors) 2017. Special issue on Superiorization:Theory and Applications,
Inverse Problems , 33, 040301–044014.[12] Censor, Y. and Zaslavski, A., 2015. Strict Fej´er monotonicity by superiorization of feasibility-seeking projection methods,
Journal of Optimization Theory and Applications , 165, 172–187.[13] Conn, A.R., Scheinberg, K., and Vicente, L.N., 2009.
Introduction to Derivative-Free Opti-mization , Society for Industrial and Applied Mathematics (SIAM).[14] Davidi, R., Gardu˜no, E., Herman, G.T., Langthaler, O., Rowland, S.W., Sardana, S., and Ye,Z., 2019. SNARK14: A programming system for the reconstruction of 2D images from 1Dprojections, available from http://turing.iimas.unam.mx/SNARK14M/SNARK14.pdf.[15] Diniz-Ehrhardt, M., Mart´ınez, J., and Pedroso, L., 2011. Derivative-free methods for nonlinearprogramming with general lower-level constraints,
Computational and Applied Mathemat-ics , 30, 19–52.[16] Echeverr´ıa Ciaurri, D., Isebor, O., and Durlofsky, L., 2012. Application of derivative-freemethodologies to generally constrained oil production optimization problems,
ProcediaComputer Science , 1, 1301–1310.[17] Engl, H.W., Hanke, M., and Neubauer, A., 2000.
Regularization of Inverse Problems , KluwerAcademic Publishers.[18] Fletcher, R. and Leyffer, S., 2002. Nonlinear programming without a penalty function,
Mathe-matical Programming, Series A , 91, 239–269.[19] Gardu˜no, E. and Herman, G.T., 2014. Superiorization of the ML-EM algorithm,
IEEE Trans-actions on Nuclear Science , 61, 162–172.[20] Gardu˜no, E. and Herman, G.T., 2017. Computerized tomography with total variation and withshearlets,
Inverse Problems , 33, 044011.[21] He, H. and Xu, H.K., 2017. Perturbation resilience and superiorization methodology of averagedmappings,
Inverse Problems , 33, 044007.[22] Herman, G.T., 2009.
Fundamentals of Computerized Tomography: Image Reconstruction fromProjections , Springer-Verlag, 2nd ed.[23] Herman, G.T., 2019. Iterative reconstruction techniques and their superiorization for the inver-sion of the Radon transform, in : R. Ramlau and O. Scherzer, eds., The Radon Transform:The First 100 Years and Beyond , De Gruyter, 217–238.[24] Herman, G.T., Gardu˜no, E., Davidi, R., and Censor, Y., 2012. Superiorization: An optimizationheuristic for medical physics,
Medical Physics , 39, 5532–5546.[25] Kaczmarz, S., 1937. Angen¨aherte Aufl¨osung von Systemen linearer Gleichungen,
Bulletin del’Acad´emie Polonaise des Sciences et Lettres , A35, 355–357.[26] Li, L., Chen, Y., Liu, Q., Lazic, J., Luo, W., and Li, Y., 2017. Benchmarking and evaluatingMATLAB derivative-free optimisers for single-objective applications, in : D.S. Huang, K.H.Jo, and J. Figueroa-Garc´ıa, eds., Intelligent Computing Theories and Application. ICIC2017 , Springer, 75–88.[27] Luo, S., Zhang, Y., Zhou, T., and Song, J., 2016. Superiorized iteration based on prox-imal point method and its application to XCT image reconstruction,
ArXiv e-prints ,https://arxiv.org/abs/1608.03931.[28] Luo, S., Zhang, Y., Zhou, T., Song, J., and Wang, Y., 2018. XCT image reconstruction by amodified superiorized iteration and theoretical analysis,
Optimization Methods and Soft-ware , to appear.[29] Nikazad, T., Davidi, R., and Herman, G.T., 2012. Accelerated perturbation resilient block-iterative projection methods with application to image reconstruction,
Inverse Problems ,28, 035005.[30] Reich, S. and Zaslavski, A.J., 2017. Convergence to approximate solutions and perturbationresilience of iterative algorithms,
Inverse Problems , 33, 044005.[31] Rios, L.M. and Sahinidis, N.V., 2013. Derivative-free optimization: A review of algorithms andcomparison of software implementations,
Journal of Global Optimization , 56, 1247–1293.[32] Zaslavski, A.J., 2018.