Sample Efficient Graph-Based Optimization with Noisy Observations
Tan Nguyen, Ali Shameli, Yasin Abbasi-Yadkori, Anup Rao, Branislav Kveton
SSample Efficient Graph-Based Optimization with Noisy Observations
Tan Nguyen Ali Shameli Yasin Abbasi-Yadkori
Queensland University of Technology Stanford University Adobe Research
Anup Rao Branislav Kveton
Adobe Research Google Research
Abstract
We study sample complexity of optimizing “hill-climbing friendly” functions defined on a graphunder noisy observations. We define a notion ofconvexity, and we show that a variant of best-arm identification can find a near-optimal solutionafter a small number of queries that is independentof the size of the graph. For functions that havelocal minima and are nearly convex, we showa sample complexity for the classical simulatedannealing under noisy observations. We showeffectiveness of the greedy algorithm with restartsand the simulated annealing on problems of graph-based nearest neighbor classification as well as aweb document re-ranking application.
Note: The first version of this paper appeared in AISTATS2019. Thank to community feedback, some typos and aminor issue have been identified. These are fixed in thisupdated version.
Stochastic optimization of a function defined on a large finiteset frequently arises in many practical problems. Instancesof this problem include finding the most attractive design fora webpage, the node with maximum influence in a social net-work, etc. There are a number of approaches to this problem.At one extreme, we can use global optimization methodssuch as simulated annealing, genetic algorithms, or cross-entropy methods (Rubinstein, 1997, Christian and Casella,1999). Although limited theoretical performance guaranteesare available, these methods perform well in a number ofpractical applications. In the other extreme, we can use
Proceedings of the 22 nd International Conference on Artificial In-telligence and Statistics (AISTATS) 2019, Naha, Okinawa, Japan.PMLR: Volume 89. Copyright 2019 by the author(s). the best-arm identification algorithms from bandit literatureor hypothesis testing from statistics community (Mannorand Tsitsiklis, 2004, Even-Dar et al., 2006, Audibert andBubeck, 2010). We have stronger sample complexity re-sults for this family of algorithms. The sample complexitystates that, with high probability, the algorithm will returna near-optimal solution after a small number of observa-tions, and this number typically grows polynomially withthe size of the decision set. These methods however oftenperform poorly when the size of the decision set is large.In such problems, it is important to exploit the structureof the specific problems to speed up the optimization. Ifappropriate features are available and it can be assumedthat the objective function is linear in these features, thenalgorithms designed for bandit linear optimization are appli-cable (Auer, 2002). More generally, kernel-based or neuralnetwork based bandit algorithms are available for problemswith non-linear objective functions (Srinivas et al., 2010).We are particularly interested in problems where the itemsimilarity is captured by a graph structure. In general, andwith no further conditions, we cannot hope to show non-trivial sample complexity rates. We observe that in manyreal-world applications, the objective function is easy and hill-climbing friendly , in the sense that from many nodes,there exist a monotonic path to the global minima. We makethis property explicit by defining a notion of convexity forfunctions defined on graphs. Under this condition, we showthat a hill-climbing procedure that uses a variant of best-armidentification as a sub-routine has a small sample complex-ity that is independent of the size of graph. In the presenceof local minima, this greedy procedure might require manyrestarts, and might not be efficient in practice. Simulatedannealing is commonly used in practice for such problems.We also define a notion of nearly convex functions that al-lows for existence of shallow local minima. We show thatfor nearly convex functions and using an appropriate esti-mation of function values, the classical simulated annealingprocedure finds near optimal nodes after a small number offunction evaluations.While asymptotic convergence of simulated annealing is a r X i v : . [ c s . L G ] J un ample Efficient Graph-Based Optimization with Noisy Observations studied extensively in the literature, there are only few finite-time convergence rates for this important algorithm. Sreeni-vas Pydi et al. (2018) show finite-time convergence rates forthe algorithm, but they consider only deterministic functionsand their rates scale with the size of the graph. These re-sults, that are obtained under very general conditions, do notquite explain success of the simulated annealing algorithmin large-scale problems. In practice, simulated annealingfinds a near-optimal solution with a much smaller numberof function evaluations. Our bounds are in terms of theconvexity of the function, and the rates do not scale with thesize of the graph. Additionally, our results hold in the noisysetting. Bouttier and Gavra (2017) show convergence ratesfor simulated annealing applied to noisy functions. Theirresults, however, appear to have several gaps. A best-arm identification algorithm from bandit literaturecan find a near-optimal node, and the time and sample com-plexity is linear in the number of nodes (Mannor and Tsitsik-lis, 2004, Even-Dar et al., 2006, Audibert and Bubeck, 2010,Jamieson and Nowak, 2014, Kaufmann et al., 2016). Suchcomplexity is not acceptable when the size of the graph isvery large. We are interested in designing and analyzingalgorithms that find a near-optimal node, and the sampleand computational complexity is sublinear in the number ofnodes.Bandit algorithms for graph functions are studied in Valkoet al. (2014). The sample complexity of these algorithmsscale with the smoothness of the function, but the computa-tional complexity scales linearly with the number of nodes.We are interested in problems where the graph is very large,and we might have access to only local information. As wewill see in the experiments, the S
PECTRAL B ANDIT algo-rithm of Valko et al. (2014) is not applicable to problemswith large graphs. Further, we study a different notion of reg-ularity that is inspired by convexity in continuous optimiza-tion. Bandit problems for large action sets are also studiedunder a number of different notions of regularity (Bubecket al., 2009, Abbasi-Yadkori et al., 2011, Carpentier andValko, 2015, Grill et al., 2015).
We cannot hope to achieve sublinear rate without furtherconditions on the function. We define a notion of strongconvexity for functions defined on a graph. For stronglyconvex functions, we design an algorithm, called E
XPLORE -D ESCEND , to find the optimal node with guaranteed errorprobability. The E
XPLORE -D ESCEND algorithm uses a best-arm identification algorithm as a submodule. The best-arm identification problem is the problem of finding theoptimal action given a sampling budget. For nearly convexfunctions , we show that the classical simulated annealing algorithm finds the global minima in a reasonable time.For convex functions, the E
XPLORE -D ESCEND has lowersample complexity, but simulated annealing can handle non-convex functions.We study the empirical performance of E
XPLORE -D ESCEND and simulated annealing in two applications.First, we consider the problem of content optimization in adigital marketing application. In each round, a user visits awebpage that is composed of a number of components. Thelearning agent chooses the contents of these componentsand shows the resulting page to the user. The user feedback,in the form of click/no click or conversion/no conversion,is recorded and used to update the decision making policy.The objective is to return a page configuration with near-optimal click-through rate. We would like to find such aconfiguration as quickly as possible with a small numberof user interactions. In this problem, each page configura-tion is a node of the graph, and two nodes are connectedif the corresponding page configurations differ in only onecomponent.Our second application is the problem of nearest neigh-bor search and classification. Given a set of points V = { x , x , . . . , x n } ⊂ R d , a query point y ∈ R d , and a dis-tance function (cid:96) : R d × R d → R , we are interested insolving argmin x ∈V (cid:96) ( x, y ) . A trivial solution to this prob-lem is to examine all points in V and return the point withthe smallest distance value. The computational complexityof this method is O ( n ) , and is not practical in large-scaleproblems. An approximate nearest neighbor method returnsan approximate solution in a sublinear time. A class ofapproximate nearest neighbor methods that is particularlysuitable for big-data applications is the graph-based nearestneighbor search Arya and Mount (1993). In a graph-basedsearch, we construct a proximity graph in the training phase,and perform a hill-climbing search in the test phase. To im-prove the performance, we perform an additional smoothingprocedure that replaces the value of a node by the aver-age function values in a vicinity of the node. In practice,these average values are estimated by performing randomwalks, and hence the problem is a graph optimization prob-lem with noisy observations. We show that the proposedgraph optimization technique outperforms popular nearestneighbor search methods such as KDTree and LSH in twoimage classification problems. Interestingly, and comparedto these methods, the computational complexity of the pro-posed technique appears to be less sensitive to the size ofthe training data. This property is particularly appealing inbig data applications. We use [ K ] to denote the set { , , . . . , K } . Let G = ( V , E ) be a graph with n nodes V = { , . . . , n } and the set of edges E . Let f : V → [0 , be some unknown function defined on an Nguyen, Ali Shameli, Yasin Abbasi-Yadkori V . Let x ∗ = argmin x ∈V f ( x ) be the global minimizer of f .The goal is to find a node x with small loss f ( x ) − f ( x ∗ ) . Inthis paper, we study the problem where the evaluation of f is noisy, such that we can only observe g ( x ; η ) = f ( x ) + η ,where η is a zero-mean R -sub-Gaussian random variable,meaning that for any λ ∈ R , E ( e λη ) ≤ e λ R / .Let N x denote the set of neighbors of node x ∈ V . Let N x = N x ∪ { x } . For simplicity, we assume all nodes havethe same degree and we let d = |N x | denote the number ofneighbors. We sometimes write f x to denote the functionvalue f ( x ) . For two nodes x, y ∈ V , we use P x,y to denoteall paths from x to y in the graph. We use P x to denote allpaths starting from node x . We use (cid:96) ( p ) to denote the lengthof a path. The general discrete optimization problem is hard for anunrestricted function f and graph G . As such, we studya restricted class of problems that allow for efficient algo-rithms. Let ∆ x,z = f x − f z be the amount of improvementif we move from node x to the neighbor z ∈ N x . We saypath p = { x → x → · · · → x k } is m -strongly convex if ∆ x i − ,x i − ∆ x i ,x i +1 ≥ m ∆ x i ,x i +1 > , for all i = 1 , . . . , k .Sometimes we use ∆ x i to denote ∆ x i ,x i +1 if the m -stronglyconvex path is clear from the context. We use ∆ x to denote max z ∈N x ∆ x,z . Definition 1 (Convex Functions)
A function f : V → R defined on a graph G = ( V , E ) is (strongly) convex if fromany node x ∈ V , there exists a (strongly) convex path to theglobal minima x ∗ .For a nearly convex function, as defined next, the convexitycondition is not satisfied at all points. Definition 2 (Nearly Convex Functions)
Let α > bea parameter. Let C be the set of points such that ∆ x ≥ α ( f x − f x ∗ ) . We say function f : V → R is ( α, c, r ) -nearlyconvex if for any x ∈ V \ C , there exists a y ∈ C and a p ∈ P xy such that (cid:96) ( p ) ≤ r and max z ∈ p f ( z ) − f ( x ) ≤ c .A path that satisfies the above conditions is called a lowenergy path. A convex function is also a ( mm +1 , , -nearlyconvex function. Lemma 1.
We have that f x − f x ∗ ≤ m +1 m ∆ x .Proof. Let p = { y = x → y → · · · → y k } be a stronglyconvex path. Then ∆ y k ≤
11 + m ∆ y k − ≤ · · · ≤ m ) k ∆ x . We have that f x − f x k = k (cid:88) i =1 ( f y i − − f y i ) = k (cid:88) i =1 ∆ y i − = ∆ x + k − (cid:88) i =1 ∆ y i ≤ k − (cid:88) i =0 ∆ x (1 + m ) i . We conclude that f x − f y k ≤ m +1 m ∆ x , from which thestatement follows.Concave and nearly concave functions are similarly defined.The convexity condition allows for efficient algorithms. Theintuition behind our analysis is that, by strong convexity, ifnode x is far from the global minima the improvement ∆ x is large. When degree d is relatively small, the local searchmethods have a sufficiently large probability of hitting agood direction (the strongly convex path). So either we arealready close to the global minima, or we are far from theglobal minima and with a constant probability we make alarge improvement. We analyze two algorithms for the graph-based optimization.The first algorithm is a local search procedure where in eachround, the learner attempts to identify a good neighbor andmove there. This algorithm is analyzed under the strong con-vexity condition. The second algorithm is the well-knownsimulated annealing procedure with an exponential transi-tion function. We provide high probability error bounds forthe greedy method, while the sample complexity of the morecomplicated simulated annealing is analyzed in expectation.
The greedy approach is an intuitive approach to the graphoptimization problem: We start from a random node, andat each node and given a sampling budget, we explore itsneighbors to find the best neighbor. The problem that wesolve in each node can be viewed as a fixed-budget best-armidentification problem Audibert and Bubeck (2010). In afixed-budget setting, the learner plays actions for a numberof rounds in some fashion and returns an approximately op-timal arm at the end. An example of an algorithm designedfor the fixed-budget setting is the S
UCCESSIVE R EJECT al-gorithm of Audibert and Bubeck (2010).Before describing the best-arm identification algorithm, weintroduce some notation. Let K be an integer, [ K ] be the setof arms in a bandit problem, µ i be the mean value of arm i ,and (cid:98) µ i,t be the empirical mean of arm i after t observations.Let i ∗ = argmax i ∈ [ K ] µ i be the optimal arm (break tiearbitrarily) and µ ∗ = µ i ∗ be its mean, and let ∆ i be the optimality gap of arm i (cid:54) = i ∗ , i.e. ∆ i = µ ∗ − µ i . Withoutloss of generality assume that ∆ ≤ ∆ ≤ · · · ≤ ∆ K . Let B be the budget. Define H = max i ∈ [ K ] i ∆ − i , ample Efficient Graph-Based Optimization with Noisy Observations log( K ) = 12 + K (cid:88) i =2 i ,B k = (cid:24) K ) B − KK + 1 − k (cid:25) , B = 0 . The S
UCCESSIVE R EJECT algorithm is a procedure with K − rounds where in each round one arm is eliminated.Let A k be the remaining arms at the beginning of round k ∈ [ K ] . In round k , each arm i ∈ A k is selected for B k − B k − rounds. At the end of this round, the arm withthe smallest empirical mean is removed. Theorem 1 (Audibert and Bubeck (2010)) . The probabilityof error of the S UCCESSIVE R EJECT algorithm with a budgetof B is smaller than e B = K ( K − (cid:18) − B − K log( K ) H (cid:19) . In the E
XPLORE -D ESCEND procedure, we use the abovebandit algorithm to find the best neighbor in each round.More specifically, for a node x ∈ V , we solve a best-armidentification problem with action values µ i = f ( x ) − f ( z i ) ,where z i is the i th neighbor of node x . We then move to thechosen neighbor and repeat the process until the budget isconsumed. We call this approach “Explore and Descend”and it is detailed in Algorithm 1. Algorithm 1:
Explore and Descend
Input : graph G , starting node x , budget T Output : x such that x = x ∗ with high probability Per round budget { T , . . . , T S } such that (cid:80) Ss =1 T s = T ; for s = 0 , . . . , S − do x s +1 ← DescentOracle( G , x s , T s ) ; return x S The algorithm depends on the subroutine
DescentOracle ( G , x s − , T s ) . This subroutine isan implementation of the best-arm bandit algorithmsdescribed earlier with the decision set N x and budget T s . Corollary 1.
Let f be a m -strongly convex function, let x ∈ V be an arbitrary starting node, and let { T , . . . , T S } , (cid:80) Ss =1 T s = T , be per round budgets. Assume that S issufficiently large for the steepest descent path from x toreach x ∗ . Let e T = P ( x S +1 (cid:54) = x ∗ ) be the probability oferror of the E XPLORE -D ESCEND algorithm. Then e T isupper bounded by the following inequality: e T ≤ d ( d − S (cid:88) s =1 exp (cid:32) − ( T s − d )∆ ,s d log( d ) (cid:33) , (1) where ∆ i,s is the i -th optimality gap at node x s .Proof. Let H s = max i ∈ [ d ] i ∆ − i,s . In round s , theE XPLORE -D ESCEND algorithm uses S
UCCESSIVE R EJECT for the subroutine
DescentOracle ( f, G , x s , T s ) . Thus,by Theorem 1, the probability of error in round s is upperbounded by e T s ≤ d ( d − (cid:18) − T s − dH s log( d ) (cid:19) . Using the union bound on the above inequality for round s = 1 , . . . , S , and the loose upper bound H s ≤ d ∆ − ,s , weobtain Equation 1.On the other hand, we can also apply the S UCCESSIVE R E - JECT algorithm directly on V , the set of all nodes of thegraph. In this case, each node in the graph is one arm, andthe graph structure is disregarded. The probability of errorof S UCCESSIVE R EJECT , using Theorem 1 with K = n and B = T , is upper bounded by the following inequality: e (cid:48) T ≤ n ( n − (cid:18) − T − nH log( n ) (cid:19) . (2)Using the loose upper bound H ≤ n ∆ − , the above can bewritten as: e (cid:48) T ≤ n ( n − (cid:18) − ( T − n )∆ n log( n ) (cid:19) . (3)Note that the error bounds of S UCCESSIVE R EJECT givenin Equations 2 and 3 is vacuous when T ≤ n . In contrast,Equation 1 provides meaningful error bounds for E XPLORE -D ESCEND , even in this small budget regime, when T ≤ n .Additionally, we see that the error bound for E XPLORE -D ESCEND are independent of the size of the graph. Rather,it depends on S , which in turn depends on the convexityconstant m . Larger m means the function is steeper andfewer steps (smaller S ) are required to reach the globaloptimum. Given that we have access only to noisy observations, weconsider the following Metropolis-Hastings Algorithm withexponential weights: P ( x → y ) = (cid:40) d min (cid:16) , e γ ( (cid:98) f x,s − (cid:98) f y,s ) (cid:17) if y ∈ N x − (cid:80) z ∈N x P ( x → z ) if y = x To simplify the analysis, we use a fixed time-independentinverse temperature, although in practice a time-varyinginverse temperature might be preferable. We estimate eachfunction value by s = 2 rγ R samples. Next, we pro-vide sample complexity bounds for the above procedure inexpectation. Theorem 2.
For a m -strongly convex function f , let α = mm +1 and { x → x → x → . . . } be the path gener-ated by S IM A NNEALING from an arbitrary initial node x . an Nguyen, Ali Shameli, Yasin Abbasi-Yadkori Given a constant ε and with the choice of γ = deαε , after t ≥ log ( α ( f x ∗ − f x ) / ( εd ) ) log( d/ ( d − α )) rounds, we have E [ f ( x t +1 ) − f ( x ∗ )] ≤ ε . For a nearly convex function, let β = 1 − αe − crγ d r +1 and let F = max y ∈V f ( y ) − f ( x ∗ ) . With the choice of γ = 1 /c and after t ≥ r log(1 /β ) log ( F αγ ) rounds, we have that E ( f ( x t + r +1 ) − f ( x ∗ )) ≤ αγ d r +1 e r . Proof.
Let y ( x t ) be the closest point in C on a low energypath from x t and let r ( x t ) = r t be the distance to this pointfrom x t . Let P (cid:48) x t be the set of paths of length less than r t +1 and starting at x t such that at least one node on the pathis not the same as x t . Consider the low energy sub-pathstarting at x t : p = { z = x t → z → · · · → z r ( x t ) = y ( x t ) } . Let z ( p ) be the terminating node in a path p . Given thatfunction f is ( α, c, r ) -nearly convex, and given that noiseis R -sub-Gaussian, probability that this low-energy path istaken by the algorithm is P ( p ) ≤ d r t +1 E (cid:16) e − (cid:80) rtk =1 γ ( (cid:98) f ( z k ) − (cid:98) f ( z k − )) (cid:17) = 1 d r t +1 E (cid:32) exp (cid:16) − r t (cid:88) k =1 γ ( (cid:98) f ( z k ) − f ( z k ))+ r t (cid:88) k =1 γ ( (cid:98) f ( z k − ) − f ( z k − ))+ r t (cid:88) k =1 γ ( f ( z k − ) − f ( z k )) (cid:17)(cid:33) ≤ d r t +1 e − γ ( f ( z ( p )) − f ( z ))+2 r t γ R /s . Because p is a low energy path, we have P ( p ) ≥ d rt +1 e − cr t γ . Let c t = f ( y ( x t )) − f ( x t ) , and P (cid:48) = (cid:80) p (cid:48) ∈P (cid:48) x P ( p (cid:48) ) . Notice that given x t , c t and y ( x t ) are deter-ministic. We write E ( f ( x t + r t +1 ) | x t ) ≤ P ( p ) (cid:0) f ( x t ) + c t − ∆ y ( x t ) (cid:1) + (cid:88) p (cid:48) ∈P (cid:48) x P ( p (cid:48) ) f ( z ( p (cid:48) )) + (1 − P ( p ) − P (cid:48) ) f ( x t ) . The first term on the right side is related to the event thatthe state follows the low energy path to the state y ( x t ) for r t rounds, and then goes to the best immediate neighbor atstate y ( x t ) . The second term is related to the event that thestate is not the same as x t after r t + 1 rounds. Finally, thelast term is related to the event that the state stays in x t forthe next r t + 1 rounds. If ∆ y ( x t ) ≤ c t , then by Definition 2,we already have f ( y ( x t )) − f ( x ∗ ) ≤ ∆ y ( x t ) α ≤ cα ⇒ f ( x t ) − f ( x ∗ ) ≤ ( r t + 1 /α ) c . Otherwise if c t ≤ ∆ y ( x t ) , we continue as follows: E ( f ( x t + r t +1 ) | x t ) ≤ f ( x t ) + 1 d r t +1 e − cr t γ (cid:0) c t − ∆ y ( x t ) (cid:1) + (cid:88) p (cid:48) ∈P (cid:48) x P ( p (cid:48) )( f ( z ( p (cid:48) )) − f ( x t )) ≤ f ( x t ) + 1 d r t +1 e − cr t γ (cid:0) c t − ∆ y ( x t ) (cid:1) + (cid:88) p (cid:48) ∈P (cid:48) x ,f ( z ( p (cid:48) )) ≥ f ( x t ) P ( p (cid:48) )( f ( z ( p (cid:48) )) − f ( x t )) ≤ (cid:88) p (cid:48) ∈P (cid:48) x ,f z ( p (cid:48) ) ≥ f xt e − γ ( f ( z ( p (cid:48) )) − f ( x t ))+2 r t γ R /s d r t +1 ( f z ( p (cid:48) ) − f x t )+ f ( x t ) + 1 d r t +1 e − cr t γ (cid:0) c t − ∆ y ( x t ) (cid:1) ≤ f ( x t ) + 1 d r t +1 e − cr t γ (cid:0) c t − ∆ y ( x t ) (cid:1) + e − r t γ R /s γ , where the last step follows from inequality e − γb b ≤ / ( γe ) .By Definition 2, ∆ y ( x t ) ≥ α ( f ( y ( x t )) − f ( x ∗ )) . Thus, c t − ∆ y ( x t ) ≤ f ( y ( x t )) − f ( x t ) − α ( f ( y ( x t )) − f ( x ∗ ))= − ( f ( x t ) − f ( x ∗ )) − f ( x ∗ ) + αf ( x ∗ )+ (1 − α ) f ( y ( x t ))= − ( f ( x t ) − f ( x ∗ ))+ (1 − α )( f ( y ( x t )) − f ( x ∗ )) ≤ − α ( f ( x t ) − f ( x ∗ )) + (1 − α ) cr t , where we used f ( y ( x t )) ≤ f ( x t ) + cr t in the last step. Let k = t + r t + 1 . We have, E ( f ( x k ) | x t ) − f ( x ∗ ) ≤ (cid:18) − αe − cr t γ d r t +1 (cid:19) ( f ( x t ) − f ( x ∗ ))+ cr t e − cr t γ (1 − α ) d r t +1 + e − r t γ R /s γ ≤ (cid:18) − αe − crγ d r +1 (cid:19) ( f ( x t ) − f ( x ∗ ))+ ce (1 + log d ) + e − rγ R /s γ ≤ β (cid:98) tr (cid:99) ( f ( x ) − f ( x ∗ ))+ 11 − β (cid:32) ce (1 + log d ) + e − rγ R /s γ (cid:33) , where the second step holds by r t ≤ r , γc = 1 , and max r cre − γcr d − r ≤ ( c/e ) / (1 + log d ) , and β is defined inthe theorem statement. Using the fact that s = 2 rγ R and γc = 1 , and a simple calculation shows that after ample Efficient Graph-Based Optimization with Noisy Observations t ≥ r log(1 /β ) log ( F αγ ) rounds, we have that E ( f ( x t + r +1 ) − f ( x ∗ )) ≤ d r +1 e r αγ . If c = 0 , then we get that E ( f ( x k ) | x t ) − f ( x ∗ ) ≤ (cid:16) − αd r +1 (cid:17) ( f ( x t ) − f ( x ∗ )) + 1 γ ≤ β (cid:98) tr (cid:99) ( f ( x ) − f ( x ∗ )) + 1(1 − β ) γ . After t ≥ r log(1 /β ) log ( F αγ ) rounds, we have that E ( f ( x t + r +1 ) − f ( x ∗ )) ≤ d r +1 αγ . For a strongly convex function, following a similar argu-ment, we have that E ( f ( x t +1 ) − f ( x ∗ )) ≤ (cid:16) − αd (cid:17) ( f ( x t ) − f ( x ∗ )) + 1 γe . Thus, given ε and with the choice of γ in the theoremstatement, after t ≥ log ( α ( f x ∗ − f x ) / ( εd ) ) log( d/ ( d − α )) rounds, we have E [ f ( x t +1 ) − f ( x ∗ )] ≤ ε .For nearly convex problems, the error bound in this the-orem is meaningful only if parameter r is small. In ourexperiments data, this value is r = 1 or r = 2 . We implemented E
XPLORE -D ESCEND and S
IMULATED A NNEALING algorithms and compare them with S
PEC - TRAL B ANDIT of Valko et al. (2014) and S
UCCESSIVE R E - JECT of Audibert and Bubeck (2010). The S
UCCESSIVE R E - JECT algorithm works by considering all nodes of the graphas a big multi-arm bandit problem. The S
PECTRAL B ANDIT uses the adjacency matrix to calculate the Laplacian andEigenvectors. Both of these algorithms, therefore, requireglobal information of the graph, while our algorithms as-sume only local information: from one node one can onlyaccess its neighbors.We note that the S
PECTRAL B ANDIT algorithm is originallydesigned to minimize the cumulative regret, which partlyexplains the poor performance in a best-arm identificationproblem. In the fixed budget best-arm identification setting,we run S
PECTRAL B ANDIT until the budget is consumedand output the most frequently pulled arm, which is, inour experiments, better than taking the arm with the bestempirical mean. The algorithm has a high time complexitybecause of its reliance on matrix operations on matricesand vectors of dimension n . As S PECTRAL B ANDIT doesnot scale well with the size of the graph due to its matrixoperations, we generated a small synthetic graph in order toevaluate it.
First, we evaluated different algorithms on synthetic graphs,which are generated as follows. Each node of the graph is apoint ( x, y ) , where x, y ∈ {− D, . . . , − , , , . . . , D } forsome D ∈ Z + . Each node is connected to all of its eightimmediate neighbors on the plain. Additionally, randomedges are added in, such that the degree of each node is 15.The mean function value is f ( x, y ) = 0 . − x + y D ) ∈ [0 , . . This mean is unknown to the algorithms, i.e. whenthe algorithm requests the value of f ( x, y ) , it is returnedwith a stochastic value: 1 with probability p = f ( x, y ) and0 with probability − p . It is easy to see that this graph isconcave by Definition 1.The results of the experiment is presented in Figure 1. Theperformance measure is the average sub-optimality gap, i.e. f ( x ∗ ) − f (ˆ x ) , where ˆ x is the solution returned by the al-gorithm. The average is taken over 1000 trials. Overall,our algorithms significantly outperform S PECTRAL B ANDIT and S
UCCESSIVE R EJECT both in term of time and sub-optimality gap, especially when the budget is smaller thanthe graph size, which is our intended setting. For S
IMU - LATED A NNEALING , we use a fixed inverse temperature γ = 250 . The result could be further improved by optimiz-ing a schedule for this parameter. Interestingly, the numberof pulls for each function evaluation also has significantimpact on the performance, which can be seen from theplots for Sim Annealing 1 (1 pull) and
Sim Annealing 5 (5pulls) in Figure 1. As for E
XPLORE -D ESCEND , we simplyallocate the budget equally for each node in the descendingpath, with the maximum path length set to for D = 10 and for D = 100 . This algorithm is the fastest and alsooffers the best performance. Source code is available at https://github.com/tan1889/graph-opt To demonstrate the performance of our algorithm on real-world non-concave problems, we used data from YandexPersonalized Web Search Challenge to build a graph asfollows. Each query forms a graph, whose nodes correspondto lists of 5 items (documents) for the given query. Twonodes are connected if they have 4 or more items in commonat the same positions. The value of a node is a Bernoullirandom variable with mean equal to the click-through rateof the associated list. The goal is to find the node withmaximum click-through rate, i.e. the most relevant list. Wechose the query that generated the largest possible graph(query no. 8107157) of 4514 nodes. As there were manysmall partitions in this graph, we took the largest partitionas the input for our experiment. The resulting graph has3992 nodes with degree varying from 1 to 171 (mean=35)and a maximum function value at . . an Nguyen, Ali Shameli, Yasin Abbasi-Yadkori S u b - o p t i m a l G a p R un - t i m e ( s e c o n d s p e r t r i a l ) Graph Size | |Explore DescendSim Annealing 1Sim Annealing 5Successive RejectSpectral Bandit 0 20 40 60Budget x 10 S u b - o p t i m a l G a p R un - t i m e ( s e c o n d s p e r t r i a l ) Figure 1: Comparing the performance of different algorithms on synthetic data. Two left figures: Small synthetic graph D = 10 (441 nodes). Two right figures: Large synthetic graph D = 100 (40401 nodes). Figures show average sub-optimalitygaps over 1000 trials and run-time per trial of the algorithms. Same legend (on the second figure) for all figures. S u b - o p t i m a l G a p R un - t i m e ( s e c o n d s p e r t r i a l ) Graph Size | |Explore DescendSim Annealing 1Sim Annealing 5Successive Reject
Figure 2: Comparing the performance of algorithms onweb document reranking data. Left: Average sub-optimalgap over 1000 trials. Right: Run-time (s) per trial. Samelegend for both figures.For non-concave graphs, E
XPLORE -D ESCEND needs tomake multiple restarts. We set the number of restarts to budget/ and allocate the budget equally betweenall restarts, then, for each restart, equally between each nodein the path. All other parameters are the same as before.The results of the experiment is presented in Figure 2. In theintended setting, our algorithms significantly outperformS
UCCESSIVE R EJECT . For very small budget, S
IMULATED A NNEALING is better than E
XPLORE -D ESCEND , but thisis reversed as the budget gets bigger. Additionally, for thisgraph we don’t see the big advantage of
Sim Annealing 5 over
Sim Annealing 1 as was the case before. Outside of theintended setting, S
UCCESSIVE R EJECT quickly becomesthe best algorithm when the budget gets larger than the graphsize. Although, this algorithm requires global informationabout the graph, which may not be always feasible.
In this section, we use the proposed graph-based optimiza-tion methods in a graph-based nearest neighbor search prob-lem. The graph-based nearest neighbor method takes atraining set, constructs a proximity graph on this set, andwhen queried in the test phase, performs a hill-climbingsearch to find an approximate nearest neighbor. More de-tails are given in Appendix A. This procedure is particularlywell suited to big-data problems; In such problems, pointswill have close neighbors, and so the geodesic distance withrespect to even a simple metric such as Euclidean metricshould provide an appropriate metric. Further, and as wewill show, computational complexity of the graph-basedmethod in the test phase scales gracefully with the size ofthe training set, a property that is particularly importantin big data applications. The intuition is that, in big-dataproblems, although a descend path might be longer, the ob-jective function is generally more smooth and hence easierto minimize.We apply the local search and simulated annealing methodsalong with an additional smoothing to the problem of near-est neighbor search. For S
IMULATED A NNEALING , we callthe resulting algorithm SGNN for Smoothed Graph-basedNearest Neighbor search. The pseudocode of the algorithmand more details are in Appendix A. The Explore-Descendis denoted by E&D in these experiments. We compared theproposed methods with the state-of-the-art nearest neighborsearch methods (KDTree and LSH) in two image classifica-tion problems (MNIST and COIL-100). In an approximatenearest neighbor search problem, it is crucial to have sub-linear time complexity, and thus S
PECTRAL B ANDIT andS
UCCESSIVE R EJECT are not applicable here.Figure 3 (a-d) shows the accuracy of different methods ondifferent portions of MNIST dataset. The graphs in theseexperiments are not concave, but ( α = 0 . , c = 0 , r = ample Efficient Graph-Based Optimization with Noisy Observations (a) (b) (c) (d)(e) (f) (g) (h) Figure 3: Prediction accuracy and running time of different methods on MNIST dataset as the size of training set increases.(a,e) Using 25% of training data. (b,f) Using 50% of training data. (c,g) Using 75% of training data. (d,h) Using 100% oftraining data. –nearly concave by Definition 2. The results for COIL-100 are shown in Appendix A. As the size of training setincreases, the prediction accuracy of all methods improve.Figure 3 (e-h) shows that the test phase runtime of the SGNNmethod has a more modest growth for larger datasets. Incontrast, KDTree becomes much slower for larger trainingdatasets. The LSH method generally performs poorly, andit is hard to make it competitive with other methods. Whenusing all training data, the SGNN method has roughly thesame accuracy, but it has less than 20% of the test phaseruntime of KDTree. We studied sample complexity of stochastic optimizationof graph functions. We defined a notion of convexity forgraph functions, and we showed that under the convexitycondition, a greedy algorithm and the simulated annealingenjoy sample complexity bounds that are independent of thesize of the graph. An interesting future work is the study ofcumulative regret in this setting.We showed effectiveness of the proposed techniques ina web document re-ranking problem as well as a graph-based nearest neighbor search problem. The computationalcomplexity of the resulting nearest neighbor method scalesgracefully with the size of the dataset, which is particularlyappealing in big-data applications. Further quantification ofthis property remains for future work.
Acknowledgement
TN was supported by the Australian Research Council Cen-tre of Excellence for Mathematical and Statistics Frontiers(ACEMS). an Nguyen, Ali Shameli, Yasin Abbasi-Yadkori
References
Yasin Abbasi-Yadkori, David Pál, and Csaba Szepesvári.Improved algorithms for linear stochastic bandits. In
NIPS , 2011.Sunil Arya and David M. Mount. Algorithms for fast vectorquantization. In
IEEE Data Compression Conference ,1993.Jean-Yves Audibert and Sébastien Bubeck. Best Arm Iden-tification in Multi-Armed Bandits. In
COLT - 23thConference on Learning Theory - 2010 , page 13 p.,Haifa, Israel, June 2010. URL https://hal-enpc.archives-ouvertes.fr/hal-00654404 .Peter Auer. Using confidence bounds for exploitation-exploration trade-offs.
JMLR , (3):397–422, 2002.C. Bouttier and I. Gavra. Convergence rate of a simulatedannealing algorithm with noisy observations.
ArXiv e-prints , March 2017.M.R. Brito, E.L. Chavez, A.J. Quiroz, and J.E. Yukich.Connectivity of the mutual k-nearest-neighbor graph inclustering and outlier detection.
Statistics & ProbabilityLetters , 35(1):33–42, 1997.Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and CsabaSzepesvári. Online optimization in § -armed bandits. In NIPS , 2009.Alexandra Carpentier and Michal Valko. Simple regret forinfinitely many armed bandits. In
International Confer-ence on Machine Learning , 2015.Jie Chen, Haw ren Fang, and Yousef Saad. Fast approximateknn graph construction for high dimensional data viarecursive lanczos bisection.
Journal of Machine LearningResearch , 10:1989–2012, 2009.P. R. Christian and G. Casella.
Monte Carlo StatisticalMethods . Springer New York, 1999.Michael Connor and Piyush Kumar. Fast construction ofk-nearest neighbor graphs for point clouds.
IEEE Trans-actions on Visualization and Computer Graphics , 16(4):599–608, 2010.Wei Dong, Moses Charikar, and Kai Li. Efficient k-nearestneighbor graph construction for generic similarity mea-sures. In
WWW , 2011.David Eppstein, Michael S. Paterson, , and F. Frances Yao.On nearest-neighbor graphs.
Discrete & ComputationalGeometry , 17(3):263–282, 1997.Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Actionelimination and stopping conditions for the multi-armedbandit and reinforcement learning problems.
JMLR , 7:1079–1105, 2006.Jean-Bastien Grill, Michal Valko, and Rémi Munos. Black-box optimization of noisy functions with unknownsmoothness. In
NIPS , 2015. Kiana Hajebi, Yasin Abbasi-Yadkori, Hossein Shahbazi, andHong Zhang. Fast approximate nearest-neighbor searchwith k-nearest neighbor graph. In
IJCAI , 2011.Kevin Jamieson and Robert Nowak. Best-arm identificationalgorithms for multi-armed bandits in the fixed confi-dence setting. In , pages 1–6,2014.Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. Onthe complexity of best-arm identification in multi-armedbandit models.
JMLR , 17:1–42, 2016.Shie Mannor and John N. Tsitsiklis. The sample complexityof exploration in the multi-armed bandit problem.
JMLR ,5:623–648, 2004.Gary L. Miller, Shang-Hua Teng, William Thurston, , andStephen A. Vavasis. Separators for sphere-packings andnearest neighbor graphs.
Journal of the ACM , 44(1):1–29,1997.Erion Plaku and Lydia E. Kavraki. Distributed computationof the knn graph for large high-dimensional point sets.
Journal of Parallel and Distributed Computing , 67(3):346–359, 2007.R.Y. Rubinstein. Optimization of computer simulation mod-els with rare events.
European Journal of OperationsResearch , (99):89–112, 1997.M. Sreenivas Pydi, V. Jog, and P.-L. Loh. Graph-basedascent algorithms for function maximization.
ArXiv e-prints , 2018.Niranjan Srinivas, Andreas Krause, Sham Kakade, andMatthias Seeger. Gaussian process optimization in thebandit setting: No regret and experimental design. In
ICML , 2010.Michal Valko, Rémi Munos, Branislav Kveton, and TomásKocák. Spectral bandits for smooth graph functions. In
International Conference on Machine Learning , 2014.Jing Wang, Jingdong Wang, Gang Zeng, Zhuowen Tu, RuiGan, and Shipeng Li. Scalable k-nn graph constructionfor visual descriptors. In
CVPR , 2012. ample Efficient Graph-Based Optimization with Noisy Observations
Input:
Number of random restarts I , number of hill-climbing steps J , length ofrandom walks T .Initialize set U = {} for i := 1 , . . . , I do Initialize x = random point in S . x (cid:48) = S MOOTHED -S IMULATED -A NNEALING ( x, T, J ) U = U ∪ { x (cid:48) } end for Return the best element in U Figure 4: The Optimization Method with Random Restarts
Input:
Starting point x ∈ S , number of hill-climbing steps J , length of randomwalks T . for j := 1 , . . . , J do Perform a random walk of length T from x . Let y be the stopping state.Let (cid:98) f ( x, s ) = f ( y ) Let u be a neighbor of x chosen uniformly at random.Perform a random walk of length T from u . Let v be the stopping state.Let (cid:98) f ( u, s ) = f ( v ) . if f ( v ) ≤ f ( y ) then Update x = u else Temperature τ = 1 − j/J With probability e ( f ( y ) − f ( v )) /τ , update x = u end ifend for Return x Figure 5: The S
MOOTHED -S IMULATED -A NNEALING
Subroutine
A More Details for Experiments
First, we explain the graph-based nearest neighbor search. Let N be a positive integer. Let G be a proximity graphconstructed on dataset V in an offline phase, i.e. V is the set of nodes of G , and each point in V is connected to its N nearestneighbors with respect to some distance metric (cid:96) : R d × R d → R . In our experiments, we use the Euclidean metric. Giventhe graph G and the query point y ∈ R d , the problem is reduced to minimizing function f = (cid:96) ( ., y ) over a graph G . Thealgorithm is shown in Figures 4 and 5. The SGNN continues for a fixed number of iterations. In our experiments, we runthe simulated annealing procedure for log n rounds, where n is the size of the training set. See Figure 5 for a pseudo-code.Finally, the SGNN runs the simulated annealing procedure several times and returns the best outcome of these runs. Theresulting algorithm with random restarts is shown in Figure 4. The above algorithm returns an approximate nearest neighborpoint. To find K nearest neighbors for K > , we simply return the best K elements in the last line in Figures 4. Weuse K = 50 approximate nearest neighbors to predict a class for each given query. We construct a directed graph G byconnecting each node to its N = 30 closest nodes in Euclidean distance. For smoothing, we tried random walks of length T = 1 and T = 2 . This means that, to evaluate a node, we run a random walk of length T from that node and return theobserved value at the stopping point as an estimate of the value of the node. This operation smoothens the function, andgenerally improves the performance. The SGNN method with T = 1 is denoted by SGNN(1), and SGNN with T = 0 ,i.e. pure simulated annealing on the graph, is denoted by SGNN(0). For the SGNN algorithm, the number of rounds is J = log( training size ) in each restart.The graph based nearest neighbor search has been studied by Arya and Mount (1993), Brito et al. (1997), Eppstein et al.(1997), Miller et al. (1997), Plaku and Kavraki (2007), Chen et al. (2009), Connor and Kumar (2010), Dong et al. (2011),Hajebi et al. (2011), Wang et al. (2012). In the worst case, construction of the proximity graph has complexity O ( n ) , butthis is an offline operation. Choice of N impacts the prediction accuracy and computation complexity; smaller N means an Nguyen, Ali Shameli, Yasin Abbasi-Yadkori (a) (b) (c) (d)(e) (f) (g) (h) Figure 6: Prediction accuracy and running time of different methods on COIL-100 dataset as the size of training set increases.(a,e) Using 25% of training data. (b,f) Using 50% of training data. (c,g) Using 75% of training data. (d,h) Using 100% oftraining data.lighter training phase computation, and heavier test phase computation (as we need more random restarts to achieve a certainprediction accuracy). Having a very large N will also make the test phase computation heavy.We used the MNIST and COIL-100 datasets, that are standard datasets for image classification. The MNIST dataset is ablack and white image dataset, consisting of 60000 training images and 10000 test images in 10 classes. Each image is × pixels. The COIL-100 dataset is a colored image dataset, consisting of 100 objects, and 72 images of each object atevery 5x angle. Each image is × pixels, We use 80% of images for training and 20% of images for testing.For LSH and KDTree algorithms, we use the implemented methods in the scikit- learn library with the following parameters.For LSH, we use LSHForest with min hash match=4, K =50, meaning that indices of 50 closest neighbors are returned. TheKDTree method always significantly outperforms LSH. For SGNN, we pick the number of restarts so that all methods havesimilar prediction accuracy.Figure 6 (a-d) shows the accuracy of different methods on different portions of COIL-100 dataset. As the size of training setincreases, the prediction accuracy of all methods improve. Figure 6 (e-h) shows that the test phase runtime of the SGNNmethod has a more modest growth for larger datasets. In contrast, KDTree becomes much slower for larger training datasets.When using all training data, the proposed method has roughly the same accuracy, while having less than 50% of the testphase runtime of KDTree. Using the exact nearest neighbor search, we get the following prediction accuracy results (theerror bands are 95% bootstrapped confidence intervals): with full data, accuracy is . ± . ; with 3/4 of data, accuracyis . ± . ; with 1/2 of data, accuracy is . ± . ; and with 1/4 of data, accuracy is . ± . .Next, we study how the performance of SGNN changes with the length of random walks. We choose T = 2 and comparedifferent methods on the same datasets. The results are shown in Figure 7. The SGNN(2) method outperforms thecompetitors. Interestingly, SGNN(2) also outperforms the exact nearest neighbor algorithm on the MNIST dataset. Thisresult might appear counter-intuitive, but we explain the result as follows. Given that we use a simple metric (Euclideandistance), the exact K -nearest neighbors are not necessarily appropriate candidates for making a prediction; Although theexact nearest neighbor algorithm finds the global minima, the neighbors of the global minima on the graph might have largevalues. On the other hand, the SGNN(2) method finds points that have small values and also have neighbors with smallvalues. This stability acts as an implicit regularization in the SGNN(2) algorithm, leading to an improved performance.These results show the advantages of using graph-based nearest neighbor algorithms; as the size of training set increases, theproposed method is much faster than KDTree. ample Efficient Graph-Based Optimization with Noisy Observations (a) (b) (c) (d)(a) (b) (c) (d)