Estimating the Probability that a Function Observed with Noise is Convex
aa r X i v : . [ s t a t . O T ] J u l Submitted to
INFORMS Journal on Computing manuscript (Please, provide the manuscript number!)
Authors are encouraged to submit new papers to INFORMS journals by means ofa style file template, which includes the journal title. However, use of a templatedoes not certify that the paper has been accepted for publication in the named jour-nal. INFORMS journal templates are for the exclusive purpose of submitting to anINFORMS journal and should not be used to distribute the papers in print or onlineor to submit the papers to another publication.
Estimating the Probability that a Function Observedwith Noise is Convex
Nanjing Jian, Shane G. Henderson
School of Operations Research and Information Engineering, Cornell [email protected], [email protected]
Consider a real-valued function that can only be observed with stochastic noise at a finite set of design pointswithin a Euclidean space. We wish to determine whether there exists a convex function that goes throughthe true function values at the design points. We develop an asymptotically consistent Bayesian sequentialsampling procedure that estimates the posterior probability of this being true. In each iteration, the posteriorprobability is estimated using Monte Carlo simulation. We offer three variance reduction methods – changeof measure, acceptance-rejection, and conditional Monte Carlo. Numerical experiments suggest that theconditional Monte Carlo method should be preferred.
Key words : convexity detection; Bayesian sequential models; variance reduction; likelihood ratio;conditional Monte Carlo
1. Introduction
Our goal in this paper is to develop a method to determine whether a real-valued function g : S ⊆ R d → R , observed at a finite set of points x , x , . . . , x r in S with noise from astochastic simulation model, is convex in the sense that a convex function f exists thatcoincides with g at x , x , . . . , x r . That is, does there exist a convex function f that goesthrough the points ( x , g ( x )) , ( x , g ( x )) , . . . , ( x r , g ( x r ))? For example, g could be theexpected profit in an inventory problem where the demand ξ is random, the startinginventory x of the day can take integer values S = { , , . . . , ∞} , and we only observe asimulation estimate of g at x . We might choose a few integer values in S to test whether theexpected profit is convex as a function of the starting inventory. Or g might represent the(true) expected waiting time for an ambulance as a function of base locations representedin latitude-longitude coordinates, where we observe g with noise through a stochastic ian and Henderson: Estimating the Probability of Convexity Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) simulation model of ambulance operations. We might then choose a finite set of base-location options in the city to test whether the waiting time is convex with regard to thelocation of the ambulance bases.Convexity is a key structural property that can be exploited in many ways. Assumingthe minimum is attained, one can use gradient-based methods (for smooth functions) ora cutting-plane based method (for nonsmooth functions) to quickly find the minimum, orbounds on the minimum, e.g., Nesterov (2004), Glynn and Infanger (2013). Even if a func-tion is not globally convex, one might use our methodology to identify regions around localminima in which the restriction of the objective function is convex (“basins of attraction”).Such basins can provide information on the stability of a solution (Vogel 1988).Our methodology is computationally demanding, so we do not envisage it being usedonly once prior to the selection of an optimization algorithm, which is then applied tosolve an optimization problem only once. Rather, we contend that it is usually the casethat optimization models are repeatedly solved, often with the same model structure butwith different data. In such cases, one would expect that the presence of convexity (or not)would usually be preserved from one data set to another. So then we believe it appropriateto explore convexity once on one data set using our methodology, and to use the resultsto inform subsequent effort on many instances of the optimization problem that differ inmodest ways, perhaps in terms of input parameters or problem-specific data. In this view,the computational cost of exploring the function once with our methodology is amortizedover the subsequent analysis.Beyond optimization, convexity can also provide insights into qualitative model behaviorin simulation applications and elsewhere. This is especially useful when we have a sequenceof similar simulation models with different inputs but the same structural properties, asmentioned earlier.Most studies on convexity detection use a frequentist hypothesis-testing framework, andcan be categorized with regard to how the null hypothesis is defined. The first categoryuses an infinite-dimensional functional approach. It defines the null hypothesis as g ∈ C ,where C is the cone of convex functions in an appropriate function space, whereas thealternative hypothesis is g / ∈ C . The representative paper Juditsky and Nemirovski (2002)models g as a Gaussian process assuming smoothness under the null hypothesis, and usesthe L r distance between g and C as the test statistic. ian and Henderson: Estimating the Probability of Convexity
Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) The second category is closer to our paper in that it works with finite-dimensionalvectors. The null hypothesis is g ∈ C , where g is the restriction of g to a finite set of points,and C is the set of convex functions restricted to the same set of points. In this case, thenoisy evaluations of g are modeled by a Gaussian random vector. Although not targetedspecifically for testing convexity, Silvapulle and Sen (2001) describes a general approachfor testing whether a finite-dimensional Gaussian vector lies within a closed convex cone.They use the distance between the Gaussian vector and the cone as the test statistic, andshow that this distance follows a so-called chi-bar-square distribution, the tail probabilityof which can be evaluated using simulation.The third category fits a regression model with Gaussian noise on the observations of g and tests the hypothesis of convexity through the estimated model parameters. Thisapproach is essentially testing whether there exists a convex function that could havegenerated the observed finite set of function values. In one dimension and under regu-larity conditions, Baraud et al. (2005) show that testing for the regression parameter isequivalent to testing g ∈ C , when C is defined with regard to nonnegative second orderVandermonde determinants. Their test is based on the idea that if a one-dimensional func-tion is convex, then the sample mean of the function values in a partition should be lowerthan certain linear combinations of the function values in neighboring partitions. Others fitcubic splines on the observations and use the second order derivatives at the knots to testfor convexity, e.g. Diack and Thomas-Agnan (1998), Wang and Meyer (2011) and Meyer(2012). For higher dimensions, Abrevaya and Jiang (2005) work with small and localizedsets of data points and count all the possible convex and concave simplices to construct atest statistic. Lau (1978) uses a second-order parametric model for the data points, and isa good survey of early literature.A closely related field to convexity tests is convex regression, where one fits a regressionmodel to observations under the constraint that the fitted model is convex. Work in thisdirection includes Judge and Takayama (1966), Allon et al. (2007), Seijo and Sen (2010),Hannah and Dunson (2011), Lim and Glynn (2012), among which Seijo and Sen (2010)provides a review of past work.In this paper, we give a Bayesian sequential algorithm that iteratively collects noisyevaluations of an unknown function g on a fixed and finite set of design points x , and usesthem to estimate the posterior probability that the function, when restricted to the design ian and Henderson: Estimating the Probability of Convexity Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) points, is convex. Our approach differs from previous research in two main ways. First,since function estimates are obtained via Monte Carlo simulation, we can only observethe function g on a finite set of points x . Thus the best we can provide is a statisticalguarantee that there exists a convex function that coincides with the unknown function g at those points. Even if there exists such a convex function, it does not imply that g itself is convex, although nonexistence does imply nonconvexity of g . Second, we use aBayesian conjugate prior model that updates a posterior on the function values every timewe collect a set of new samples. Then the posterior probability of convexity is estimatedseparately through Monte Carlo simulation. Instead of having a fixed running time, as isthe case with hypothesis testing, our algorithm can be stopped at any stage to output anestimated probability of convexity. Indeed, the Bayesian approach avoids the difficulty infrequentist sequential hypothesis testing where one should condition on the outcomes ofprevious hypothesis tests when considering the distribution of a test statistic. The Bayesianframework seems to us to be more straightforward in both concept and implementation.Our overall approach is to successively update a posterior distribution on the vector of(true) function values g . In doing so we assume that the noise in the estimated functionvalues is normally distributed and adopt a conjugate prior so that we can use standardposterior updates. In simulation the normality assumption is common and usually rea-sonable, since one can average multiple (finite variance) replications to obtain a singleapproximately normally distributed replication through the central limit theorem. Nor-mality is not essential to our approach; other distributions could be assumed, but the useof a conjugate prior is key to keeping the computation manageable. We restrict attentionto the normal assumption for brevity and simplicity, and because our primary intendedapplication area is in simulation, where the assumption can almost always be at leastapproximately satisfied (through averaging).For a given posterior distribution, computing the probability of convexity for a samplefrom the posterior appears to be difficult. We use Monte Carlo to estimate this proba-bility, providing three methods for reducing the variance of the Monte Carlo estimatorof the posterior probability of convexity. The change of measure and acceptance-rejectionmethods reuse samples obtained in an earlier iteration to construct an unbiased estimatorfor the current iteration. These two methods can be useful in any sequential algorithm ina Bayesian framework, but need to be used with caution due to heavy-tailed behavior of ian and Henderson: Estimating the Probability of Convexity
Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) the likelihood ratio that is needed in these methods. The conditional Monte Carlo methodtakes advantage of the spherical property of Gaussian and t distributions. It can be appliedto the more general problem of estimating the probability that a Gaussian or t -distributedrandom vector lies in a polyhedron, which might arise, e.g., in solving linear feasibilityproblems described in Szechtman and Y¨ucesan (2016).We view the Bayesian methodology developed herein, where we repeatedly estimate theposterior probability of convexity as n , the number of simulation replications, increases, asbeing well suited to exploratory analysis aimed at developing knowledge of the structure ofthe function g . Certainly, a key strength of the Bayesian approach, relative to hypothesistests, is that it allows an analyst to explore several values of n , interactively increasing n until satisfied with the results. Hypothesis tests do not afford this flexibility, so in ouropinion are not as well suited to the exporatory analysis we envisage. Our statisticalguarantees are weaker than those of formal hypothesis tests, in that the confidence intervalswe generate on the posterior probability of convexity hold only marginally in n and not jointly in n (they are not confidence bands). However, plotting the confidence intervals asa function of n , as we do, provides a visual sense of whether the function g (restrictedto the selected points) is convex or not, along with a sense of the sample size needed toestablish convexity or not with some reliability. It is conceivable that one could furtherdevelop the Bayesian methodology to develop a more rigorous test, but we do not pursuethat line, again because of our focus on an exploratory tool.This paper is built upon Jian et al. (2014), with the addition of new results, completeproofs, new variance reduction methods, and more extensive numerical results. Full proofsof several results can be found in the online supplement. Notation:
We use upper case Latin letters for random variables or sets, and lower caseLatin letters for deterministic variables. Vectors are in bold, and matrices are in uppercase Greek letters. We use A T to denote the transpose of the matrix A . For a set S , S ◦ isits interior. We use ⇒ for convergence in distribution, and { B } is the indicator functionthat takes the value 1 on the event B and 0 otherwise.
2. Problem Statement and Assumptions
Suppose we can obtain noisy evaluations of a real-valued function g : S → R over a fixed andfinite set of design points x = { x i : i = 1 , , . . . , r } in S ⊆ R d . Ideally, we would like to know ian and Henderson: Estimating the Probability of Convexity Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) whether the function g is convex on S or not. In the absence of any regularity assumptionson g , such as are assumed in Juditsky and Nemirovski (2002), it appears that this questioncannot be addressed in finite time. The rest of this paper is focused on giving a statisticalguarantee on the convexity of the finite-dimensional vector obtained by restricting thefunction g to the r points in x , i.e., we restrict attention to g = ( g ( x ) , g ( x ) , . . . , g ( x r )) ∈ R r . Definition 1.
Given a finite set of points x , we define a vector g to be convex if andonly if there exists a convex function g whose values on x coincide with g , i.e. g ( x ) = g . Definition 2.
Define C = C ( x , x , . . . , x r ) ⊆ R r to be the set of all convex vectors on x = ( x , x , . . . , x r ), so that g is convex if and only if g ∈ C .The notion of vector convexity is weaker than (the usual) functional convexity, since if afunction g is convex then its restriction g on x is convex under our definition. The converseis not true since we can arbitrarily extend g .The set C is a convex cone. In Section 5, we will see that g ∈ C if and only if a certainlinear system is feasible, and the linear system is then a tool to verify vector convexity.We will also show that g is strictly convex (in the sense that a strictly convex function g exists that agrees with g on x ) iff g ∈ C ◦ , the interior of C , in the online supplement.We use a Bayesian approach, regarding g as an unknown realization from the prior dis-tribution of a random vector f . Let ( ξ j : j = 1 , , . . . ) be an i.i.d. sequence of r -dimensionalGaussian random vectors, each with mean vector and covariance matrix Γ, that is inde-pendent of f . Our j th observation is then Y j , where Y j = f + ξ j , j = 1 , , . . . . We denotethe i th component of the vector Y j by Y ij , and interpret it as the j th sampled functionvalue at the point x i .To elaborate, we view g as a deterministic vector of function values. In the Bayesianstructure within which we work, the vector g is modeled through a random vector f that issampled from a prior distribution at the outset, and we subsequently accumulate evidenceon the value of f (and thus g ) through the samples Y , Y , . . . , updating our beliefs throughthe posterior distribution. In principle, with sufficient replications we can recover the valueof f ; our progress towards this goal is reflected through the posterior distribution.We observe the value of f with additive noise, through the vector outputs of successivesimulation replications Y , Y , . . . . Consistent with much simulation literature, conditionalon f , the observed values Y j , j = 1 , , . . . are assumed to be Gaussian, which can at least ian and Henderson: Estimating the Probability of Convexity
Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) be approximately satisfied by averaging multiple simulation replications to yield a single Y j . This is an approximation, but it affords considerable computational advantages, as wewill see later.The covariance matrix Γ is not necessarily diagonal, i.e., we do not necessarily constrainthe observations to be (conditionally) independent between sampled points, conditionalon f . Thus, Common Random Numbers (CRN) can be employed within our framework.CRN induces positive correlation on the r dimensions of the noise ξ , so that the struc-ture of the underlying “true” function f can be better preserved than would be possiblewith conditionally independent observations (Chen et al. 2012). For simplicity, we assumethroughout that the covariance matrix Γ is positive definite.
3. Sequential Algorithm
Initially, before any sampling, we fix the r points x i , i = 1 , . . . , r and the prior mean µ andcovariance matrix Λ of the assumed Gaussian prior distribution of f . At the beginning ofthe n th iteration ( n = 1 , , . . . ), we obtain a new observation Y n , and use that to updatethe posterior distribution on the function values, as described in Section 4. The infor-mation collected thus far is denoted A n , which is the sigma field generated from A and { Y j , j = 1 , . . . , n } . Thus, { A n } n =0 , , ,... is a filtration. Once the posterior distribution hasbeen updated, we separately estimate the posterior probability of convexity, P ( f ∈ C | A n )as discussed in Section 5. At the end of each iteration, we can choose either to stop, or tocontinue with the current posterior as the prior of the next iteration. More precisely, thealgorithm is as follows. ian and Henderson: Estimating the Probability of Convexity Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!)
Algorithm 1
A sequential method for testing for convexity of the function
Require:
The Gaussian prior P ( f ∈ ·| A ), with hyperparameters { µ , Λ } of the functionvalues f . Initialize n = 0. repeat Set n = n + 1. Obtain a new vector y n of r noisy function values at x , x , . . . , x r . Update the posterior distribution of f | A n from the new samples y n using f | A n − asthe prior, as in Section 4. Estimate p n = P ( f ∈ C | A n ) from the distribution of f | A n using the Monte Carlomethod described in Section 5, obtaining a confidence interval [ ˆ p n − h n , ˆ p n + h n ]. until stopped return A confidence interval [ ˆ p n − h n , ˆ p n + h n ] of p .In Step 6, the posterior probability that f is convex is estimated using Monte Carlo sim-ulation by sampling m times from the posterior distribution f | A n . Compared to obtainingthe samples Y j , which involves running the full simulation model, sampling from the pos-terior distribution is computationally inexpensive, entailing sampling from a Gaussian or t distribution, depending on whether the variance is known or unknown. Depending on thecomputational cost of running the full simulation model, we can also skip the estimationof p n for some n and enter Step 6 only for selected values of n .
4. Posterior Updates
We use a conjugate prior to update our belief about f | A n in each iteration n . Since weassumed that Y − f ∼ N ( , Γ), this conjugate prior is normal-normal when Γ is known, andnormal-inverse-Wishart when Γ is unknown. In this section we give the updating formulaof the posterior distribution of f | A n in Step 5 of Algorithm 1 under these two scenarios,given the prior f | A n − . The formulae given here are standard; see, e.g., DeGroot (1970),Gelman et al. (2003), or Bernardo and Smith (2008). First, before any sampling we select a non-informative Gaussian prior with zero mean andlarge variance, i.e. f | A ∼ N ( µ , Λ ) in which µ = ∈ R r and Λ ∈ R r × r is a diagonal ian and Henderson: Estimating the Probability of Convexity
Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) matrix with diagonal values that are large relative to the sampling variance (the diagonalof Γ). Alternative parameters for the Gaussian prior could be used in the presence of moreinformation, and would not change the algorithm.At iteration n >
1, the posterior from the last iteration f | A n − ∼ N ( µ n − , Λ n − ) isused as the prior for the current iteration. We then obtain s ≥ y ij , j = 1 , , . . . , s ) at each of the design points x i , i = 1 , , . . . , r . The mean µ n andcovariance Λ n of the posterior f | A n ∼ N (( µ n , Λ n ) are updated byΛ − n = Λ − n − + s Γ − µ n = Λ n (Λ − n − µ n − + s Γ − ¯ y ) , (1)where the i -th component of the r -dimensional vector ¯ y is s − P sj =1 y ij . One can adaptivelychoose the sample size s in each iteration, but for simplicity we use s = 1, meaning thatonly one new sample is obtained in each iteration.The updating equation (1) involves matrix inversion. To reduce the computational effort,we use Cholesky factorization and the Sherman-Morrison-Woodbury formula as detailedin the online supplement. When the sampling variance Γ is unknown, the inverse-Wishart distribution provides aconjugate prior. First, we use an uninformative Jeffrey’s prior, where the prior joint distri-bution on f and Γ is proportional to | Γ | − ( r +1) / (Gelman et al. 2003) and | A | denotes thedeterminant of the matrix A . To construct Jeffrey’s prior, an initial set of r -dimensionalsamples y j = ( y ij , i = 1 , , . . . , s ) , j = 1 , , . . . , s are used to estimate the parameters of thenormal distribution for the mean f and the Inverse-Wishart distribution (Inv-Wishart)for the variance Γ. The initial sample size s can be any positive integer. For a prior thatreflects the data without being too costly, we choose s = r + 1, where r is the numberof design points. This choice also ensures that the inverse-Wishart distribution is concen-trated on covariance matrices that are positive definite. More specifically (Gelman et al.2003), Γ | A , y ∼ Inv-Wishart υ (Ξ − ) f | Γ , A , y ∼ N ( µ , Γ /κ ) , (2) ian and Henderson: Estimating the Probability of Convexity Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) where µ = 1 s s X j =1 y j = ¯ y ; κ = s ; υ = s −
1; Ξ = s X j =1 ( y j − ¯ y )( y j − ¯ y ) T ! − . In iteration n ≥
1, we obtain s samples y ij on each of the points x i , i = 1 , . . . , r andupdate the posterior of Γ | A n ∼ Inv-Wishart υ n (Ξ − n ) f | Γ , A n ∼ N ( µ n , Γ /κ n ) (3)by (Gelman et al. 2003): µ n = κ n − κ n − + s µ n − + sκ n − + s ¯ y ; κ n = κ n − + s ; υ n = υ n − + s ;Ξ n = Ξ n − + S + κ n − sκ n − + s (¯ y − µ n − )(¯ y − µ n − ) T , (4)where the i -th component of the r dimensional vector ¯ y is defined as ¯ y i = P sj =1 y ij /s, i =1 , . . . , r , and the r × r matrix S is the sum of squared errors P sj =1 ( y j − ¯ y )( y j − ¯ y ) T . Forsimplicity we again choose s = 1 when updating, so that y j = ¯ y and S = 0.If a random r × r matrix Γ has the Inverse-Wishart distribution with parameters υ andΞ − , whose density is proportional to | Ξ | υ/ | Γ | − ( υ − r − / exp {− tr(ΞΓ − ) / } , where tr( · ) isthe trace of a matrix, the inverse Γ − has the Wishart distribution with parameters υ andΞ. The Wishart distribution is a higher-dimensional generalization of the χ distribution,and thus can be expressed similarly as the sum of squares of Gaussian random vectors.To generate a Wishart υ (Ξ) distributed random matrix Γ, we generate υ > r independent, r -dimensional random vectors W i distributed as N (0 , Ξ), and return Γ = P υi =1 W i W Ti .With the posterior covariance distributed as Inverse-Wishart and the posterior meandistributed as Gaussian conditioning on the covariance, the marginal distribution of theposterior mean f | A n is f | A n ∼ t ( υ n − r +1) ( µ n , Ξ n / ( κ n ( υ n − r + 1))) , (5)where t ( υ n − r +1) ( µ n , Ξ n / ( κ n ( υ n − r + 1))) is a multivariate Student- t distribution with ( υ n − r + 1) degrees of freedom, location parameter µ n , and scale matrix Ξ n / ( κ n ( υ n − r + 1)).The density function of f | A n is thus proportional to | Ξ n / ( κ n ( υ n − r + 1)) | − / { f − µ n ) T [Ξ n / ( κ n ( υ n − r + 1))] − ( f − µ n ) } − ( υ n +1) / (Gelman et al. 2003). One can generate sucha random vector by exploiting the elliptical nature of the distribution; see Section 7.3.(Although that section does not explicitly give a generation algorithm, an algorithm shouldbe clear from the arguments given there.) ian and Henderson: Estimating the Probability of Convexity
Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!)
5. Convexity
Recall that g = ( g ( x ) , g ( x ) , . . . , g ( x r )), and the vector g is defined to be convex if andonly if there exists a convex function that coincides with g on the set of points x , . . . , x r .Equivalently, for each i = 1 , . . . , r there exists a hyperplane { a Ti x + b i : x ∈ R d } that goesthrough ( x i , g ( x i )) and lies at or below all the other points ( x j , g ( x j )) , j = i (Murty 1988,p.539; Atlason et al. 2004). That is, g is convex if and only if there exists feasible solutions a i ∈ R d , i = 1 , . . . , r and b ∈ R r to the linear system a Ti x i + b i = g ( x i ) , for all i ∈ { , . . . , r } a Ti x j + b i ≤ g ( x j ) , for all i ∈ { , . . . , r } and j = i , j ∈ { , . . . , r } , (LS)with b i being the i -th component of b . Let the set of all g such that the corresponding LSis feasible be C , which denotes the set of all convex vectors g with regard to the r designpoints x , . . . , x r .This large linear system can also be decomposed into r sub-systems, indexed by i =1 , . . . , r : a Ti x i + b i = g ( x i ) a Ti x j + b i ≤ g ( x j ) , for all j = i and j ∈ { , . . . , r } , (LS( i ))each with the variables a i ∈ R d and b i ∈ R .Transforming the question of whether a vector is convex to the feasibility of r linearsystems allows us to use Monte Carlo simulation to estimate the posterior probability ofconvexity at the end of each iteration n . We first simulate m random samples from theposterior distribution of f | A n . Then, for each generated sample, we determine feasibility(or lack thereof) for the linear systems (LS( i ), i = 1 , . . . , r ) in sequence. If any linear systemis infeasible then we stop (skip the rest of the systems) and conclude that this generatedsample is not convex, since one cannot define an appropriate hyperplane. The probability P ( f ∈ C | A n ) is then estimated by the sample average of the indicators of convexity foreach sample as described more formally in Algorithm 2.
6. Asymptotic Validity of the Main Algorithm
We now establish that the posterior probability of convexity converges to 1 or 0, dependingon whether g is convex or not, with one qualification. If g is convex but not strictly convexthen it lies on the boundary of C , and then certain arbitrarily small perturbations of the ian and Henderson: Estimating the Probability of Convexity Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!)
Algorithm 2
Subroutine used in Step 6 of Algorithm 1 to estimate P ( f ∈ C | A n ). Require:
The posterior marginal density of f | A n from (1) or (5). Generate independent samples { y n , y n , . . . , y mn } from the posterior marginal density of f | A n . for k from 1 to m do Set (cid:8) y kn ∈ C (cid:9) = 1. for i from 1 to r do Set g ( x i ) as the i -th component of y kn , i = 1 , . . . , r . Solve for the feasibility of LS( i ). if LS( i ) is infeasible then Set (cid:8) y kn ∈ C (cid:9) = 0. BREAK the inner loop and go to next k . end if end for end for return The center ˆ p n = P mk =1 (cid:8) y kn ∈ C (cid:9) /m and half-width h n = 1 . s n / √ m of a95% confidence interval for P ( f ∈ C | A n ), where s n is the sample standard deviation of (cid:8) y kn ∈ C (cid:9) , k = 1 , . . . , m .function values g will yield points outside C . Since we estimate the function values g using simulation, we cannot rule out such perturbations, and so we should not expect theposterior probability of convexity to converge to 1 or 0.The formal statement of convergence is with respect to the probability space containingboth the prior from which f is sampled, and the data. We show that when f is strictlyconvex the posterior probability of convexity converges to 1, and when f is not convexthe posterior probability of convexity converges to 0. The remaining case where f lies onthe boundary of C has probability 0 under our prior, which has a density with respect toLebesgue measure. Theorem 1.
Let p n = P ( f ∈ C | A n ) be the n -iteration posterior probability that f isconvex as in Algorithm 1. As the number of iterations n → ∞ , p n − { f ∈ C } → a.s. ian and Henderson: Estimating the Probability of Convexity
Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) Jian et al. (2014) has a sketch of the proof in the known variance case. We providea complete proof that covers both the known and unknown variance cases in the onlinesupplement.The result of Theorem 1 relates to the exact posterior probability of convexity, which weestimate using Monte Carlo. We next show that the Monte Carlo estimator from Section5 of the exact probability converges to the same indicator provided that the Monte Carlosample sizes increase without bound, through a uniform law of large numbers.
Corollary 1.
Let p mn be the m -sample estimator of P ( f ∈ C | A n ) from Algorithm 2.As n → ∞ and m = m ( n ) → ∞ , p mn − { f ∈ C } → in probability.Proof. We have | p mn − { f ∈ C } | ≤ | p mn − p n | + | p n − { f ∈ C } | , where p n = P ( f ∈ C | A n ) and p mn = m P mk =1 { y kn ∈ C } . Let ǫ > P ( | p mn − p n | > ǫ ) = EP (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m m X k =1 { y kn ∈ C } − P ( f ∈ C | A n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ǫ (cid:12)(cid:12)(cid:12) A n ! ≤ E (cid:18) V ar ( { y kn ∈ C }| A n ) mǫ (cid:19) ≤ mǫ → n → ∞ since m = m ( n ) → ∞ as n → ∞ . This shows that p mn − p n → n → ∞ . Also Theorem 1 shows that | p n − { f ∈ C } | → n → ∞ .
7. Variance Reduction Methods
In this section, we improve the vanilla Monte Carlo method through three variance-reduction methods. The change of measure and acceptance-rejection methods arelikelihood-ratio-based methods that reuse samples generated in an earlier iteration, andthe conditional Monte Carlo method reduces the variance through smoothing.
Algorithm 2 can be computationally costly due to the need to solve up to mr linear feasi-bility problems LS( i ), where m is the number of Monte Carlo samples and r is the numberof design points. We can reduce the computational effort by reusing samples generatedin a previous iteration through a change-of-measure method. The resulting estimator isbased on the same principle used in the score-function method for simulation optimiza-tion (Rubinstein and Shapiro 1990), and that used in “green simulation” (Feng and Staum ian and Henderson: Estimating the Probability of Convexity Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) n , Algorithm 2 generates m i.i.d. samples { y kn : k = 1 , , . . . , m } from the posterior marginal distribution of f | A n and produces m indicators { (cid:8) y kn ∈ C (cid:9) : k = 1 , , . . . , m } of convexity. To reuse these samples, in iteration n + ℓ , we instead outputˆ p n + ℓ = 1 m m X k =1 (cid:8) y kn ∈ C (cid:9) L n + ℓ,n ( y kn ) (6)as an estimate of p n + ℓ = P ( f ∈ C | A n + ℓ ), where L n + ℓ,n ( · ) = φ n + ℓ ( · ) /φ n ( · ) is the likelihoodratio of the densities of f | A n + ℓ and f | A n . Theorem 2.
The change of measure estimator ˆ p n + ℓ = { Y n ∈ C } L n + ℓ,n ( Y n ) is (condi-tionally) unbiased and has finite conditional variance, conditional on A n + ℓ for any n and ℓ ≥ . The proof for the known Γ case can be found in Jian et al. (2014), and we provide aproof for the unknown Γ case in the online supplement.Given that the change of measure estimator is unbiased and has finite variance, it istempting to generate a single sample and re-use it for many iterations to save computationaleffort. Unfortunately, such an estimator has poor empirical performance. Figure 3 givesan example where the estimated probability of convexity is greater than 1. This happensespecially later in the run when all of the linear systems are feasible, and the likelihoodratios L n + ℓ,n ( y nk ) occasionally take very large values.Occasional large values of the likelihood ratio L n + ℓ,n ( y ) might arise when the sample y is generated within the tail of φ n . Indeed, Proposition 1 below shows that in at least onespecial case, L n + ℓ,n has a heavy tail given any sampling trajectory A n . At first sight, thismay appear to contradict Theorem 2, which states that given the posterior information A n + ℓ in iteration n + ℓ , the change of measure estimator is bounded. But notice thatin Theorem 2 we are conditioning on more information than in Proposition 1. In effect,Proposition 1 shows that given the posterior information A n in iteration n , the change ofmeasure estimator in iteration n + ℓ could have poor behavior, depending on the (random)samples that are used to update f | A n + ℓ from f | A n . Thus there is no contradiction betweenthese two results. Proposition 1 shows that the change of measure estimator could exhibitvolatile behavior when extreme values of the likelihood ratio arise at values that were ian and Henderson: Estimating the Probability of Convexity
Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) sampled in iteration n . However, we make no claim about how likely such values are toarise. Numerical experiments given later show that indeed the change of measure estimatoris volatile. Proposition 1.
When Γ is known and r = 1 , given A n , C n = sup y ∈ R r L n +1 ,n ( y ) asymp-totically (as n → ∞ ) has the same distribution as e χ , where χ is a non-central chi-squarerandom variable with 1 degree of freedom. In fact, the proof for Proposition 1 in the online supplement also applies when n isfinite. In that case ln C n , conditional on A n , is a non-central chi-square random variablescaled by a constant of order O (1 /n ) and shifted by another constant of order O (1 /n ).The conclusion of Proposition 1 can be generalized to r > C n =ln(sup y ∈ R r L n +1 ,n ( y )) = sup y ∈ R r ln( Q ri =1 L n +1 ,n ( y i )) = P ri =1 sup y i ∈ R ln L n +1 ,n ( y i ). Proposi-tion 1 then allows us to conclude that, conditional on A n , this is asymptotically condi-tionally distributed as χ r /
2, where χ r is a non-central chi-square random variable with r degress of freedom. Since the tail probability of χ r / r , weexpect this heavy tail behavior to be more significant as r increases, i.e., as the number ofdesign points increases. We conjecture that the likelihood ratio is similarly heavy-tailed inthe cases where Γ is known but not necessarily diagonal and when Γ is unknown.In summary, conditional on A n + ℓ , the estimator ˆ p n + ℓ is unbiased and has finite variance,but its distribution may be heavy tailed given A n only, depending on the samples obtainedto update to f | A n + ℓ . Thus this estimator needs to be used with caution. We suggest thatif the method is to be used, then one should do so with small ℓ , e.g., ℓ <
5, based onsimulation experiments described later.
The change of measure estimator reuses all the samples obtained in an earlier iterationby outputting a Monte Carlo estimator that scales each indicator { I kn = { Y kn ∈ C } : k =1 , , . . . , m } by a likelihood ratio L n + ℓ,n ( Y kn ) = φ n +1 ( Y kn ) /φ n ( Y kn ), where φ n is the posteriordensity. An alternative is to reuse a subset of the samples from the previous iterationthrough acceptance-rejection.Suppose that in iteration n , we have m i.i.d. Monte Carlo samples { y kn : k = 1 , , . . . , m } from f | A n , together with the indicators { I kn = { y kn ∈ C } : k = 1 , , . . . , m } . Then, at iter-ation n + 1, the k -th sample y nk will be accepted (reused) with probability L n +1 ,n ( y nk ) /c , ian and Henderson: Estimating the Probability of Convexity Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) where c ≥ sup { L n +1 ,n ( y ) : y ∈ R r } . If the accepted indices are A ⊆ { , , . . . , m } , then m − | A | additional samples can be generated from f | A n +1 to ensure a total of m samples.The estimator is then just the usual Monte Carlo estimator based on all m samples, i.e.,ˆ p n +1 = (cid:16)P k ∈ A { y nk ∈ C } + P m −| A | k =1 (cid:8) y n +1 k ∈ C (cid:9)(cid:17) /m .When the sampling variance Γ is known, optimization shows that c is given by (cid:26) | Λ n || Λ n +1 | exp (cid:2) (Λ − n +1 µ n +1 − Λ − n µ n ) T Γ(Λ − n +1 µ n +1 − Λ − n µ n ) + µ T n Λ − n µ n − µ T n +1 Λ − n +1 µ n +1 (cid:3)(cid:27) / , where the parameters µ n , Λ n , µ n +1 , Λ n +1 are defined as in Section 3. When Γ is unknown, c is the maximum of a ratio of polynomials and does not have a closed form, so we calculateit numerically.The acceptance-rejection estimator is simply an average of i.i.d. samples, like the pureMonte Carlo estimator. The difference lies in how the samples are obtained. The probabilityof accepting a sample generated in iteration n is 1 /c , so the efficiency of this method isrelated to the constant c . According to Proposition 1, the likelihood L n +1 ,n ( y ) can takevery large values, meaning that c can often be large. When c is large, very few of theearlier samples might be reused, so the majority of the m samples needed in the ( n + 1)thiteration are new. This may lower the efficiency of the acceptance-rejection method. Denote the upper hemisphere of the ( r −
1) spherical shell, { z ∈ R r : k z k = 1 , z r ≥ } by S r − . We can view a sample from the posterior distribution as consisting of both a direction Z chosen from S r − and a step size T taking both positive and negative values alongthat direction, along with the linear transformation to obtain the appropriate scale matrixand then a translation by the mean. We condition on the direction Z , and integrate theposterior over the interval of step sizes [ t min , t max ] that yield points inside the convexitycone C . Averaging the results over a number of uniformly generated directions gives thedesired estimator.For convenience, let E n ( · ) = E ( ·| A n ) and P n ( · ) = P ( ·| A n ). We write X = T Z , so thatin the known variance case, X ∼ N (0 , I ), and in the unknown variance case X ∼ t ν n (0 , I ).Then P ( f ∈ C | A n ) = E n (cid:0) (cid:8) Λ / n X + µ n ∈ C (cid:9)(cid:1) = E n (cid:0) (cid:8) T Λ / n Z + µ n ∈ C (cid:9)(cid:1) , for Z uniform on S r − ian and Henderson: Estimating the Probability of Convexity
Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) = E n (cid:0) E n (cid:0) (cid:8) T Λ / n Z + µ n ∈ C (cid:9) | Z (cid:1)(cid:1) = E n ( P n ( T ∈ [ t min ( Z ) , t max ( Z )] | Z ))= E n ( F T | Z ( t max ( Z )) − F T | Z ( t min ( Z ))) . Here F T | Z is the conditional distribution function of T given A n and Z . (We shall see that T is independent of Z .) Thus, the posterior probability P ( f ∈ C | A n ) can be estimatedusing F T | Z and a way to calculate t max ( Z ) and t min ( Z ). Theorem 3 gives the former, andlinear programs LS( i ) (below) give the latter. Theorem 3 (Distribution of T | Z ) . When the sampling variance Γ is known, F T | Z ( t ) = (1 + sign ( t ) F χ r ( t )) / , where F χ r ( · ) is the (cumulative) distribution func-tion of a χ r.v. with r degrees of freedom. When Γ is unknown, F T | Z ( t ) = (1 + sign ( t ) F F ( r,ν n ) ( t /r )) / , where F F ( r,ν n ) is the distribution function of the F distribution with r and ν n degrees of freedom.Proof Sketch. A detailed proof of Theorem 3 based on the “change of variables” tech-nique is provided in the online supplement. Here we give a short proof that provides richerinsight into the result, but relies on a step that is essentially a consequence of the changeof variables argument. F T | Z ( t ) = P n ( T ≤ t | Z ) = P n ( T ≤ | Z ) + P n (0 ≤ T ≤ t | Z ) , when t ≥ P n ( T ≤ | Z ) − P n (0 ≤ T ≤ − t | Z ) , when t <
0= 1 / t ) P ( || X || ≤ t | Z ) / , for X = T Z = 1 / t ) P ( || X || ≤ t ) / . (7)The last step in (7) depends on the independence of || X || and Z , as established in theproof in the online supplement. Intuitively, this result is a consequence of the structure ofelliptical distributions, as discussed in, e.g., Joe (2014), in that such random vectors canbe generated by independently generating the direction X , scaling by a square root of thescale matrix, selecting the distance T along the scaled direction independently of X , andfinally adding on the mean.When Γ is known, X ∼ N (0 , I ), so || X || ∼ χ r .When Γ is unknown, X ∼ t ν n (0 , I ) = N/ p Y /ν n for independent N ∼ N (0 , I ) and Y ∼ χ ν n . Therefore || X || = N T NY /ν n ian and Henderson: Estimating the Probability of Convexity Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) where N T N ∼ χ r , so || X || /r ∼ F ( r, ν n ). (cid:3) To find t min ( Z ) and t max ( Z ), we can solve linear programs with objectives minimizingor maximizing t , with decision variables t ∈ R , a ∈ R r × d , b ∈ R d , and the constraints (LS),replacing g ( x ) by µ + (Λ / Z ) t : t min = min t ( t max = max t )s.t. a T x + b = µ + (Λ / Z ) t a Ti x j + b i ≤ µ j + (Λ / Z ) j t, for all i ∈ { , . . . , r } and j = i , j ∈ { , . . . , r } (LP)The linear program LP can be decomposed into r smaller LP’s, with constraints LS( i )and variables t ∈ R , a i ∈ R r , b i ∈ R : t min ( i ) = min t ( t max ( i ) = max t )s.t. a Ti x i + b i = µ + (Λ / Z ) t a Ti x j + b i ≤ µ j + (Λ / Z ) j t, for all j = i , j ∈ { , . . . , r } , (LP(i))and then t min = max i =1 , ,...,r t min ( i ) , and t max = min i =1 , ,...,r t max ( i ) . This decomposition does not bring as much speed improvement as LS( i ) does, because allthe decomposed linear programs must be solved.Now we have all the pieces needed for the conditional Monte Carlo method. Algorithm 3
A conditional Monte Carlo estimator e p n for p n = P ( f ∈ C | A n ). Require:
Posterior distribution of f | A n obtained from Algorithm 1 with mean µ n andcovariance Λ n ; Number of Monte Carlo samples m needed for k = 1 , . . . , m do Uniformly generate a vector z k on the surface of a unit sphere (by generating astandard Gaussian and normalizing it to a unit vector). Determine integration boundaries t min ( z k ) and t max ( z k ). Set e P n ( k ) = F T | Z ( t max ( z k )) − F T | Z ( t min ( z k )). end for Calculate the mean p mn and standard deviation s mn of ( e P n ( k ) : k = 1 , , . . . , m ). return e p n = p mn as an estimator of P ( f ∈ C | A n ), along with the half-width ˜ h n =1 . s mn / √ m of a 95% confidence interval. ian and Henderson: Estimating the Probability of Convexity
Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) Relative to the other variance-reduction methods, conditional Monte Carlo takes muchlonger to produce an estimate in each iteration because it needs to solve two linear programsLP and cannot “skip” any of them as can be done when solving the decomposed feasibilityproblems LS( i ).
8. Numerical Results
In this section we show numerical results on some test functions, assuming the more realisticcase that the sampling variance is unknown. To select the r sample points for a testfunction in d dimensions, we first sample d + 1 points uniformly at random within the(assumed compact) sample space S , and for each such random point, we generate a uniformrandom direction on the surface of the unit sphere. Each point-direction pair defines a linesegment within S . Then we sample 3 points uniformly at random on each line segment. Thismethod generates r = 3( d + 1) sample points. We select the points to lie on line segmentsbecause doing so seems to improve the performance of the convexity test relative to justsampling points uniformly within S . We have used 3( d + 1) sample points partly to keepthe computation minimal, thereby enabling us to relatively easily explore the behaviorof the algorithms and estimators in this section. In practice, one would likely use morepoints, though it is unclear exactly how many points should be chosen. The number ofpoints is likely related to how certain one wishes to be about convexity or lack thereof. Ineach iteration of the sequential algorithm, we use m = 100 Monte Carlo samples from theposterior predictive distribution to estimate a 95% confidence interval for p n . With eachof our estimators, one can easily adjust the sample size m to achieve a desired accuracyin the confidence interval widths of the estimators of p n ; using m = 100 gave reasonableresults in our experiments. We discuss the choice of n in Section 9.Our procedure is implemented in Matlab and freely available in an online repository (Jian2017). The repository contains two versions. The first version uses only a standard Mat-lab installation, solving linear programs using the built-in linprog function (Mathworks2016). The second version requires the installation of the packages CVX (Grant and Boyd2014, 2008), which is a package for specifying and solving convex programs, and
Gurobi (Gurobi Optimization 2016), a commercial optimization solver. We suggest the second ver-sion if a user has the requisite licenses, since
Gurobi seems more robust than linprog .For example, we have found cases where linprog was not able to find a feasible solution, ian and Henderson:
Estimating the Probability of Convexity Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) whereas
Gurobi did. However, because of the overhead of
CVX in setting up the linear pro-gram in a format that
Gurobi is able to read, linprog is usually faster when the problemdimension is low. When the problem dimension is high, the inefficiency of the interior-point-method used by linprog outweighs the overhead of
CVX . Figure 1 compares solvingtimes in seconds by these two solvers for the linear programs (LP), tested with the samplefunction f ( x ) = || x || , x ∈ [ − , d for different values of the problem dimension d . T i m e ( s e c ond s ) Gurobilinprog
Figure 1 The solving time vs. testing function dimension for the two linear programs in the conditional MonteCarlo method using
Gurobi and linprog . Gurobi is faster when the dimension d exceeds 10, where r = 33 sample points are used. All test cases are run on a desktop with a 4-core Intel Core i7-3770 3.40 GHz processorwith 16G memory, running Matlab R2013a on 64-bit Windows 7.
We use f ( x ) = || x || in this section as the test function.First, we compare vanilla Monte Carlo with the variance reduction methods in Section 7,showing 95% confidence intervals for the estimated probability of convexity, the time periteration, and the efficiency per iteration. Here the efficiency of the Monte Carlo estimator˜ p n is defined as the inverse of the product of the computational time per replication and thevariance of one replication; see, e.g., Glynn and Whitt (1992). We first take the dimension d = 1, on the sample space [ − , .
01 on the diagonal, and we use a Gaussian kernel of 10 − exp {−|| x i − x j || / } for the off-diagonal components. Hence the noise at different points is positively correlated,and the correlation is stronger between closer points (Rasmussen and Williams 2005). ian and Henderson: Estimating the Probability of Convexity
Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) We use linprog instead of
Gurobi to avoid the time overhead incurred by
CVX , since thedimension d = 1. For the change of measure method, a new set of samples is obtained everyiteration for the first 30 iterations, and every 5 iterations thereafter. For the acceptance-rejection method we start to reuse samples only after the first 30 iterations. Thus the first30 iterations of these two methods are exactly the same as vanilla Monte Carlo. E s t i m a t e d p r obab ili t y o f c on v e x It e r a t i on t i m e It e r a t i on e ff i c i en cy ( l og10 sc a l e ) E s t i m a t ed p r obab ili t y o f c on v e x It e r a t i on t i m e It e r a t i on e ff i c i en cy ( l og10 sc a l e ) E s t i m a t ed p r obab ili t y o f c on v e x It e r a t i on t i m e It e r a t i on e ff i c i en cy ( l og10 sc a l e ) E s t i m a t ed p r obab ili t y o f c on v e x It e r a t i on t i m e It e r a t i on e ff i c i en cy ( l og10 sc a l e ) Figure 2 Comparisons of the estimated probability of convexity, the iteration time (in seconds), and the log(base 10) efficiency (left to right) of vanilla Monte Carlo, change of measure, acceptance-rejection,and conditional-Monte-Carlo (top to bottom) methods applied to a one-dimensional strictly convexfunction. ian and Henderson:
Estimating the Probability of Convexity Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!)
Figure 2 shows that the estimated probability of convexity increases to 1 for all methods.Later in the iterations, the change of measure method can return greater-than-one esti-mates due to the poor behavior of the likelihood ratio as discussed in Section 7.1. Amongall methods, the conditional Monte Carlo method has the smallest variance but takes thelongest time to compute. (This difference in the computational time becomes more signif-icant for higher-dimensional test functions when we later experiment on a 30-dimensionalfunction.) For the 1-dimensional convex function here, taking both computational timeand variance into consideration, we observe that conditional Monte Carlo has the highestoverall efficiency. The efficiency of vanilla Monte Carlo is the lowest. The efficiency plotsoccasionally break when the sample variance of the estimator is 0, where all the linearsystems (LS) are feasible. This also happens for the change of measure method becausethat method corresponds with vanilla Monte Carlo every 5 iterations after the first 30.The efficiency of the change of measure method and the acceptance-rejection method bothincrease whenever they reuse the samples from a previous iteration because the linear feasi-bility problems need not be solved. The change of measure method occasionally has a verylarge efficiency because of the small sample variance of the estimate. This happens in lateriterations when the posterior density is very concentrated. In this case all the Monte Carlosamples are close to the mean, giving almost identical posterior densities and similar like-lihood ratios. When all reused samples are convex (corresponding to the iterations wherethe vanilla Monte Carlo method has infinite estimated efficiency), the change of measureestimator has almost 0 variance. However, due to the heavy tail behavior of the likelihoodratio, it is risky to trust the change of measure estimator values, as we see when the changeof measure method estimates a probability greater than 1. The acceptance-rejection esti-mator has slightly lower estimated efficiency, but the estimator is more trustworthy in thatit is statistically identical to vanilla Monte Carlo.Consider now the 30-dimensional test function f ( x ) = || x || , x ∈ [ − , with r =3( d + 1) = 93 sample points. The covariance matrix Γ has diagonal entries Γ ii = 0 . f ( x i ),and off-diagonal entries Γ ij = 10 − exp {−|| x i − x j || / } . f ( x i ) f ( x j ). Hence the variancedepends on the function value, and there is also modest positive correlation between anytwo design points depending on the distance between them. As before, for the change ofmeasure method, a new set of samples is obtained every iteration for the first 30 iterations,and every 5 iterations thereafter, and for the acceptance-rejection method we start to ian and Henderson: Estimating the Probability of Convexity
Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) reuse samples only after the first 30 iterations. We find that the change of measure andacceptance-rejection methods do not work very well on this example. Indeed, accordingto Proposition 1, the heavy-tail behavior of the likelihood ratio becomes more severe withmore design points. With the likelihood ratio often taking very large values, the changeof measure estimates evaluate to large values with wide confidence intervals, as shown inFigure 3 (notice the y-axis scale). Due to the same reason, the acceptance-rejection methodreduces to vanilla Monte Carlo by rejecting almost all previous samples, so we omit thatmethod from the results in Figure 3. In early iterations, conditional Monte Carlo takesmore than 6 minutes to generate an estimate using CVX with
Gurobi ( linprog takes over 1hour), and the iteration efficiency is around 0.20. In comparison, the vanilla Monte Carlomethod only takes 80 seconds per iteration at the beginning of the iteration by solving thedecomposed LS( i ), giving around the same level of iteration efficiency. However, towardsthe end of the 100 iterations, conditional Monte Carlo is able to reduce the variance ofthe estimated probability so well that the efficiency improves beyond that of vanilla MonteCarlo. Therefore we recommend using conditional Monte Carlo (with CVX + Gurobi ) ifone can afford the running time, and vanilla Monte Carlo otherwise or when
CVX is notinstalled.
Consider the function f ( x ) = −|| x || , x ∈ [ − , d . In order to make the problem “harder,”we choose the covariance matrix Γ to be d / any function will appear to be nonconvex, while after many iterations, thenonconvexities of the (true) function dominate and are detected. ian and Henderson: Estimating the Probability of Convexity Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) E s t i m a t ed p r obab ili t y o f c on v e x It e r a t i on t i m e It e r a t i on e ff i c i en cy ( l og10 sc a l e ) E s t i m a t ed p r obab ili t y o f c on v e x It e r a t i on t i m e It e r a t i on e ff i c i en cy ( l og10 sc a l e ) E s t i m a t ed p r obab ili t y o f c on v e x It e r a t i on t i m e It e r a t i on e ff i c i en cy ( l og10 sc a l e ) Figure 3 The estimated probability of convexity, the iteration time (in seconds), and the log (base 10) efficiency(left to right) of the vanilla Monte Carlo, change of measure, and conditional Monte Carlo methods(top to bottom) applied to a 30-dimensional strictly convex function. E s t i m a t ed p r obab ili t y o f c on v e x E s t i m a t ed p r obab ili t y o f c on v e x E s t i m a t ed p r obab ili t y o f c on v e x Figure 4 The estimated probability of convexity for the simple strictly non-convex function in dimensions 3 and 5and 10 (from left to right). The estimates when the function is one-dimensional are not shown becausethey were effectively identically 0.
Linear functions are convex but lie on the boundary of the cone C . Theorem 1 does notinform us of the likely behavior of our algorithm in this case, because the event that the ian and Henderson: Estimating the Probability of Convexity
Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) function lies on the boundary of C has measure 0 in the context of that result. Thus theposterior probability of convexity could converge to any number between 0 and 1 or notconverge at all. Here we use a one-dimensional linear function f ( x ) = 0 , x ∈ [ − , − on the diagonal and 0 on the off-diagonal. E s t i m a t ed p r obab ili t y o f c on v e x Figure 5 The estimated probability of convexity for a 1-dimensional linear function. The mean does not appearto converge.
As shown in Figure 5, the estimated probability does not converge to 0 or 1, but staysclose to 0. When we increase the dimension, e.g., to 5, and keep the sampling variance thesame, the estimated probabilities stay at 0 throughout the first 100 iterations. Changingthe sampling variance does not change the qualitative nature of results because when thefunction is zero-valued the sampling variance only changes the “scale” of the observations.These results are perhaps to be expected because a linear function would only appearconvex when the function noise at all design points “happens to” form a strictly convexfunction.
Finally, we have also tested our algorithm on a more realistic example similar to the“Ambulances in a Square” problem from SimOpt (Pasupathy and Henderson 2007). In thisproblem, patient calls arrive in a one kilometer unit square [0 , according to a Poissonprocess at a constant rate of 1 call every 2 hours. The ( x, y ) locations of the calls are i.i.d.and distributed with a density proportional to 1 . − ( | x − . | + | y − . | ). Upon receivinga call, a nearest ambulance is dispatched, traveling to the scene at a constant speed of 60km/h. Once arriving at the scene, the ambulance spends a Gamma-distributed scene time ian and Henderson: Estimating the Probability of Convexity Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) with mean 45 minutes and standard deviation 15 minutes, then returns to the base at aspeed of 40 km/h if no other call is received. We are interested in the mean response time(time from when the call is received until the ambulance arrives at the call location) as afunction of the location(s) of the ambulance base(s).We sampled the base locations of the ambulance along (4 × the number of bases +1) random lines in the unit square, with 3 points sampled on each line. Each base hastwo coordinates, so this is equivalent to 3(2 d + 1) design points, where d is the dimensionof the sample space. We are using more design points than in our previous test casesbecause we wanted to try more points (and consequently more computation) on a realcase. Similar to our other experiments, we obtain a sample of the mean response time oneach set of sampled base locations from running the simulation until 360 calls receive aresponse (approximately 30 days). The mean response times of the sampled base locationsare evaluated using common random numbers, which compares the locations using theexact same random call arrivals and scene times. The convexity of the mean response timeas a function of the ambulance base locations is tested with one, two, and three ambulancebases, using the conditional Monte Carlo method. The estimated probabilities vs. iterationare plotted in Figure 6. E s t i m a t ed p r obab ili t y o f c on v e x E s t i m a t ed p r obab ili t y o f c on v e x E s t i m a t ed p r obab ili t y o f c on v e x Figure 6 The estimated probability of convexity for the mean response time as a function of the base locations,when the number of bases is one, two, and three.
It seems that the mean response time is convex as a function of the base location whenthere is only one base, while it is not convex for more than one ambulance base. Thisagrees with our intuition that the location of a single ambulance base should have oneglobal minimizer in the unit square. By plotting the posterior mean function, we foundthat the minimizer is located near the point [0 . , . . , . , ian and Henderson: Estimating the Probability of Convexity
Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) is more than one ambulance base, the objective does not have a single minimizer due tosymmetry and the interactions between bases.
9. Conclusion
Given a function that can be observed on a finite number of points in the presence of noise,we have suggested a sequential algorithm to estimate the posterior probability that thefunction is convex. The method models the function values on a fixed set of design pointsusing a Bayesian conjugate model, and estimates the probability of convexity by MonteCarlo simulation, using samples of the function values from the posterior distribution. ThisBayesian procedure gives sequential estimates for function convexity. It is useful when afunction is expensive to evaluate, e.g., the output of a large simulation, or when its valuescan only be obtained on a constrained set of points, e.g., a function defined on a discretedomain, and is primarily an exploratory tool to help an analyst develop an understandingof the geometry of an optimization problem.To improve the efficiency of our algorithm we introduced three variance reduction meth-ods - change of measure, acceptance-rejection, and conditional Monte Carlo. The first twomethods reuse samples obtained in an earlier iteration to calculate an estimator in thecurrent iteration. However, they both rely on the likelihood ratio of normal or Student- t posterior densities, which we prove could take extreme values due to its heavy-tail behav-ior. In our computational results, we observe that the change of measure method maygive poor (e.g., greater than 1) estimates of the probability, and the acceptance-rejectionmethod rejects most of the earlier samples and reduces to vanilla Monte Carlo when thenumber of design points is large. Finally, the conditional Monte Carlo method takes thelongest time to compute but is the most effective in variance reduction, giving the highestefficiency among all methods. We recommend using it with CVX and
Gurobi , especially forhigh-dimensional functions, to ensure reasonable computational time.How should one choose n , the simulation runlength at each of the r design points? In ourexperiments, we increased n until the confidence intervals for p n appeared to remain near1 (suggesting convexity) or 0 (non-convexity). We suggest this exploratory procedure as areasonable rule of thumb, recognizing that it is only a heuristic. More advanced stoppingrules that offer some kind of overall statistical guarantee might be possible, which mighteven lead to the development of an hypothesis testing procedure. However, such a goal is ian and Henderson: Estimating the Probability of Convexity Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!) not in line with the exploratory aims of the present paper. Moreover, developing such ruleswould likely require considerable effort that we view to be beyond the scope of this paper.In our experimentation, we also found the location of design points to be important whenexploring an unknown function. Despite the fact that we only determine the convexityof a vector based on the pre-chosen design points, it would be helpful if the points arerepresentative of the sample space S . Without knowing anything about the underlyingfunction, a good starting point is to choose the design points such that they span the entirespace. As we gain better knowledge with the sequential procedure, it is possible to expandthe design points dynamically, and consequently have different sample sizes at each point.The method of choosing where and how much to sample in each iteration is left as an openproblem.A package containing the main algorithm and all variance reduction alternatives is avail-able on Github (Jian 2017). Acknowledgments
We thank the editorial team for very helpful comments. This work was partially supported by NationalScience Foundation grants CMMI-1200315 and CMMI-1537394, and Army Research Office grant W911NF-17-1-0094.
References
Abrevaya, Jason, Wei Jiang. 2005. A nonparametric approach to measuring and test-ing curvature.
Journal of Business & Economic Statistics http://EconPapers.repec.org/RePEc:bes:jnlbes:v:23:y:2005:p:1-19 .Allon, Gad, Michael Beenstock, Steven Hackman, Ury Passy, Alexander Shapiro. 2007. Nonparametricestimation of concave production technologies by entropic methods. Journal of Applied Econometrics http://dx.doi.org/10.1002/jae.918 .Atlason, J. M. A. Epelman, S. G. Henderson. 2004. Call center staffing with simulation and cutting planemethods. Annals of Operations Research
The Annals of Statistics Bayesian Theory . John Wiley & Sons, Inc., 240–376. doi:10.1002/9780470316870.ch5. URL http://dx.doi.org/10.1002/9780470316870.ch5 .Chen, X., B. E. Ankenman, B. L. Nelson. 2012. The effects of common random numbers on stochastic krigingmetamodels.
ACM TOMACS Article 7. ian and Henderson:
Estimating the Probability of Convexity
Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!)
Optimal Statistical Decisions . McGraw-Hill, New York, NY. URL http://gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT=YOP&IKT=1016&TRM=ppn+021834997&sourceid=fbw_bibsonomy .Diack, C. A. T., C. Thomas-Agnan. 1998. A nonparametric test of the non-convexity of regression.
Non-parametric Statistics Proceedings ofthe 2015 Winter Simulation Conference . WSC ’15, IEEE Press, Piscataway, NJ, USA, 403–413. URL http://dl.acm.org/citation.cfm?id=2888619.2888663 .Gelman, Andrew, John B. Carlin, Hal S. Stern, Donald B. Rubin. 2003.
Bayesian Data Analysis .2nd ed. Chapman & Hall/CRC Texts in Statistical Science, Chapman and Hall/CRC. URL .Glasserman, Paul. 2004.
Monte Carlo methods in financial engineering . Springer, New York. URL .Glynn, Peter W., Gerd Infanger. 2013. Simulation-based confidence bounds for two-stage stochas-tic programs.
Mathematical Programming http://dx.doi.org/10.1007/s10107-012-0621-0 .Glynn, Peter W., Ward Whitt. 1992. The asymptotic efficiency of simulation estimators.
Oper. Res. http://dx.doi.org/10.1287/opre.40.3.505 .Golub, Gene H., Charles F. Van Loan. 1996. Matrix Computations . 3rd ed. JohnsHopkins Studies in Mathematical Sciences, The Johns Hopkins University Press. URL .Grant, Michael, Stephen Boyd. 2008. Graph implementations for nonsmooth convex programs. V. Blondel,S. Boyd, H. Kimura, eds.,
Recent Advances in Learning and Control . Lecture Notes in Control and Infor-mation Sciences, Springer-Verlag Limited, 95–110. http://stanford.edu/~boyd/graph_dcp.html .Grant, Michael, Stephen Boyd. 2014. CVX: Matlab software for disciplined convex programming, version2.1. http://cvxr.com/cvx .Gurobi Optimization, Inc. 2016. Gurobi optimizer reference manual. URL .Hannah, L. A., D. B. Dunson. 2011. Multivariate Convex Regression with Adaptive Partitioning.
ArXive-prints .Jian, N. 2017. Matlab package for convexity detection. Created Dec. 22, 2016. https://github.com/njian/convexity .Jian, N., S. G. Henderson, S. R. Hunter. 2014. Sequential detection of convexity from noisy function evalu-ations. A. Tolk, S. D. Diallo, I. O. Ryzhov, L. Yilmaz, S. Buckley, J. A. Miller, eds.,
Proceedings of the2014 Winter Simulation Conference . Institute of Electrical and Electronics Engineers, Inc., Piscataway,NJ, 3892–3903. doi:10.1109/WSC.2014.7020215. ian and Henderson:
Estimating the Probability of Convexity Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!)
Joe, Harry. 2014.
Dependence Modeling with Copulas . Chapman and Hall/CRC, New York.Judge, G. G., T. Takayama. 1966. Inequality restrictions in regression analysis.
Journal of the AmericanStatistical Association pp. 166–181. URL .Juditsky, A., A. Nemirovski. 2002. On nonparametric tests of positivity/monotonicity/convexity. The Annalsof Statistics Advanced Calculus (4th Edition) . 4th ed. Addison Wesley Publishing Company. URL .Lau, Lawrence J. 1978. Testing and imposing monoticity, convexity, and quasi-convexity constraints.
Elec-tronic Journal of Statistics Operations Research http://dblp.uni-trier.de/db/journals/ior/ior60.html .Mathworks. 2016. Documentation for linprog. Accessed Feb. 17, 2017. .Meyer, Mary C. 2012. Constrained penalized splines. Canadian Journal of Statistics http://dx.doi.org/10.1002/cjs.10137 .Murty, K. G. 1988. Linear Complementarity, Linear and Nonlinear Programming . Heldermann Verlag,Berlin.Nesterov, Yurii. 2004.
Introductory Lectures on Convex Optimization: A Basic Course . Kluwer Academic.Pasupathy, Raghu, Shane G. Henderson. 2007. Ambulance bases. Accessed May. 15, 2014. http://simopt.org/wiki/index.php?title=Ambulances_in_a_square .Rasmussen, Carl Edward, Christopher K. I. Williams. 2005.
Gaussian Processes for Machine Learning(Adaptive Computation and Machine Learning) . The MIT Press.Rubinstein, R. Y., A. Shapiro. 1990. Optimization of static simulation models by the score function method.
Mathematics and Computers in Simulation ArXiv e-prints .Silvapulle, Mervyn J., Pranab K. Sen. 2001.
Constrained Statistical Inference: Order, Inequality, andShape Restrictions , chap. 3. John Wiley & Sons, Inc., 59–141. doi:10.1002/9781118165614.ch3. URL http://dx.doi.org/10.1002/9781118165614.ch3 .Szechtman, Roberto, Enver Y¨ucesan. 2016. A bayesian approach to feasibility determination.
Proceedings ofthe 2016 Winter Simulation Conference . WSC ’16, IEEE Press, Piscataway, NJ, USA, 782–790. URL http://dl.acm.org/citation.cfm?id=3042094.3042203 .Vogel, S. 1988. Stability results for stochastic programming problems.
Optimization ian and Henderson: Estimating the Probability of Convexity
Article submitted to
INFORMS Journal on Computing ; manuscript no. (Please, provide the manuscript number!)
Canadian Journal of Statistics http://dx.doi.org/10.1002/cjs.10094 .Williams, D. 1991. Probability with Martingales . Cambridge mathematical textbooks, Cambridge UniversityPress. URL http://books.google.com/books?id=RnOJeRpk0SEChttp://books.google.com/books?id=RnOJeRpk0SEC