[PDF] Distributional Analysis

Abstract

In distributional or average-case analysis, the goal is to design an algorithm with good-on-average performance with respect to a specific probability distribution. Distributional analysis can be useful for the study of general-purpose algorithms on "non-pathological" inputs, and for the design of specialized algorithms in applications in which there is detailed understanding of the relevant input distribution. For some problems, however, pure distributional analysis encourages "overfitting" an algorithmic solution to a particular distributional assumption and a more robust analysis framework is called for. This chapter presents numerous examples of the pros and cons of distributional analysis, highlighting some of its greatest hits while also setting the stage for the hybrids of worst- and average-case analysis studied in later chapters.

Full PDF

aa r X i v : . [ c s . D S ] J u l Distributional Analysis ∗ Tim Roughgarden † July 28, 2020

Abstract In distributional or average-case analysis , the goal is to design an algorithm with good-on-average performance with respect to a speciﬁc probability distribution. Distributional analysiscan be useful for the study of general-purpose algorithms on “non-pathological” inputs, andfor the design of specialized algorithms in applications in which there is detailed understandingof the relevant input distribution. For some problems, however, pure distributional analysisencourages “overﬁtting” an algorithmic solution to a particular distributional assumption anda more robust analysis framework is called for. This chapter presents numerous examples of thepros and cons of distributional analysis, highlighting some of its greatest hits while also settingthe stage for the hybrids of worst- and average-case analysis studied in later chapters. Part I of this book covered reﬁnements of worst-case analysis which do not impose any assumptionson the possible inputs. Part II described several deterministic models of data, in which inputs toa problem were restricted to those with properties that are plausibly shared by all “real-world”inputs. This chapter, and a majority of the remaining chapters in the book, consider models thatinclude a probability distribution over inputs.

In its purest form, the goal in distributional analysis is to analyze the average performance ofalgorithms with respect to a speciﬁc input distribution, and perhaps also to design new algorithmsthat perform particularly well for this distribution. What do we hope to gain from such an analysis? • In applications in which the input distribution is well understood (e.g., due to lots of recentand representative data), distributional analysis is well suited both to predict the performanceof existing algorithms and to design algorithms specialized to the input distribution. • When there is a large gap between the empirical and worst-case performance of an algorithm,an input distribution can serve as a metaphor for “non-pathological” inputs. Even if the inputdistribution cannot be taken literally, a good average-case bound is a plausibility argumentfor the algorithm’s empirical performance. The three examples in Section 2 are in this spirit. ∗ Chapter 8 of the book

Beyond the Worst-Case Analysis of Algorithms (Roughgarden, 2020). † Department of Computer Science, Columbia University. Supported in part by NSF award CCF-1813188 andARO award W911NF1910294. Email: [email protected]. Optimizing performance with respect to a speciﬁc input distribution can lead to new algo-rithmic ideas that are useful much more broadly. The examples in Sections 3 and 4 have thisﬂavor.And what could go wrong? • Carrying out an average-case analysis of an algorithm might be analytically tractable onlyfor the simplest (and not necessarily realistic) input distributions. • Optimizing performance with respect to a speciﬁc input distribution can lead to “overﬁtting,”meaning algorithmic solutions that are overly reliant on the details of the distributionalassumptions and have brittle performance guarantees (which may not hold if the distributionalassumptions are violated). • Pursuing distribution-speciﬁc optimizations can distract from the pursuit of more robust andbroadly useful algorithmic ideas.This chapter has two goals. The ﬁrst is to celebrate a few classical results in the average-caseanalysis of algorithms, which constitute some of the earliest work on alternatives to worst-caseanalysis. Our coverage here is far from encyclopedic, with the discussion conﬁned to a samplingof relatively simple results for well-known problems that contribute to the chapter’s overarchingnarrative. The second goal is to examine critically such average-case results, thereby motivatingthe more robust models of distributional analysis outlined in Section 5 and studied in detail laterin the book.

The pros and cons of distributional analysis are evident in a famous example from optimal stoppingtheory, which is interesting in its own right and also relevant to some of the random-order modelsdescribed in Chapter 11. Consider a game with n stages. Nonnegative prizes arrive online, with v i denoting the value of the prize that appears in stage i . At each stage, an algorithm must decidebetween accepting the current prize (which terminates the game) and proceeding to the next stageafter discarding it. This involves a diﬃcult trade-oﬀ, between the risk of being too ambitious (andskipping over what turns out to be the highest-value prize) and not ambitious enough (settling fora modest-value prize instead of waiting for a better one).Suppose we posit speciﬁc distributions D , D , . . . , D n , known in advance to the algorithmdesigner, such that the value v i of the stage- i prize is drawn independently from D i . (The D i ’s mayor may not be identical.) An algorithm learns the realization v i of a prize value only at stage i .We can then speak about an optimal algorithm for the problem, meaning an online algorithm thatachieves the maximum-possible expected prize value, where the expectation is with respect to theassumed distributions D , D , . . . , D n .The optimal algorithm for a given sequence of prize value distributions is easy enough to specify,by working backward in time. If an algorithm ﬁnds itself at stage n without having accepted aprize, it should deﬁnitely accept the ﬁnal prize. (Recall all prizes have nonnegative values.) At anearlier stage i , the algorithm should accept the stage- i prize if and only if v i is at least the expectedprize value obtained by the (inductively deﬁned) optimal strategy for stages i + 1 , i + 2 , . . . , n . The solution above illustrates the primary advantages of distributional analysis: an unequivocaldeﬁnition of an “optimal” algorithm, and the possibility of a crisp characterization of such an2lgorithm (as a function of the input distributions).The disadvantages of average-case analysis are also on display, and there are several reasonswhy one might reject this optimal algorithm.1. The algorithm takes the distributional assumptions literally and its description depends in adetailed way on the assumed distributions. It is unclear how robust the optimality guaranteeis to misspeciﬁcations of these distributions, or to a reordering of the distributions.2. The algorithm is relatively complicated, in that it is deﬁned by n diﬀerent parameters (onethreshold for each stage).3. The algorithm does not provide any qualitative advice about how to tackle similar problems(other than “work backwards”).The third point is particularly relevant when studying a problem chosen as a deliberate simpliﬁ-cation of a “real-world” problem that is too messy to analyze directly. In this case, an optimalsolution to the simpler problem is useful only inasmuch as it suggests a plausibly eﬀective solutionto the more general problem.For our optimal stopping problem, could there be non-trivial guarantees for simpler, moreintuitive, and more robust algorithms? Returning to the optimal stopping problem of Section 1.2, a threshold stopping rule is deﬁned by asingle parameter, a threshold t . The corresponding online algorithm accepts the ﬁrst prize i withvalue satisfying v i ≥ t (if any). Such a rule is clearly suboptimal, as it doesn’t even necessarilyaccept the prize at stage n . Nevertheless, the following prophet inequality proves that there isa threshold strategy with an intuitive threshold that performs approximately as well as a fullyclairvoyant prophet. Theorem 1.1 (Samuel-Cahn (1984)) . For every sequence D = D , D , . . . , D n of independent prizevalue distributions, there is a threshold rule that guarantees expected reward at least E v ∼ D [max i v i ] ,where v denotes ( v , . . . , v n ) . This guarantee holds, in particular, for the threshold t at which there is a 50/50 chance thatthe rule accepts one of the n prizes. Proof.

Let z + denote max { z, } . Consider a threshold strategy with threshold t (to be chosenlater). The plan is to prove a lower bound on the expected value of this strategy and an upperbound on the expected value of a prophet such that the two bounds are easy to compare.What value does the t -threshold strategy obtain? Let q ( t ) denote the probability of the failuremode where the threshold strategy accepts no prize at all; in this case, it obtains zero value. Withthe remaining probability 1 − q ( t ), the rule obtains value at least t . To improve this lower bound,consider the case in which exactly one prize i satisﬁes v i ≥ t ; then, the rule also gets “extra credit”of v i − t above and beyond its baseline value of t . See Chapter 11 for an analogous result for a related problem, the secretary problem . The diﬃculty when two prizes i and j exceed the threshold is that this extra credit is either v i − t or v j − t (whichever appeared earlier). The proof avoids reasoning about the ordering of the distributions by crediting the ruleonly with the baseline value of t in this case. t -threshold strategy from below by(1 − q ( t )) · t + n X i =1 E v [ v i − t | v i ≥ t, v j < t ∀ j = i ] · Pr [ v i ≥ t ] · Pr [ v j < t ∀ j = i ] (1)= (1 − q ( t )) · t + n X i =1 E v [ v i − t | v i ≥ t ] · Pr [ v i ≥ t ] | {z } = E [( v i − t ) + ] · Pr [ v j < t ∀ j = i ] | {z } ≥ q ( t ) (2) ≥ (1 − q ( t )) · t + q ( t ) n X i =1 E v (cid:2) ( v i − t ) + (cid:3) , (3)where we use the independence of the D i ’s in (1) to factor the two probability terms and in (2)to drop the conditioning on the event that v j < t for every j = i . In (3), we use that q ( t ) = Pr [ v j < t ∀ j ] ≤ Pr [ v j < t ∀ j = i ].Now we produce an upper bound on the prophet’s expected value E v ∼ D [max i v i ] that is easyto compare to (3). The expression E v [max i v i ] doesn’t reference the strategy’s threshold t , so weadd and subtract it to derive E v (cid:20) n max i =1 v i (cid:21) = E v (cid:20) t + n max i =1 ( v i − t ) (cid:21) ≤ t + E v (cid:20) n max i =1 ( v i − t ) + (cid:21) ≤ t + n X i =1 E v (cid:2) ( v i − t ) + (cid:3) . (4)Comparing (3) and (4), we can complete the proof by setting t so that q ( t ) = , with a 50/50chance of accepting a prize. The drawback of this threshold rule relative to the optimal online algorithm is clear: it does notguarantee as much expected value. Nonetheless, this solution possesses several attractive propertiesthat are not satisﬁed by the optimal algorithm:1. The threshold rule recommended by Theorem 1.1 depends on the prize value distributions D , D , . . . , D n only inasmuch as it depends on the number t for which there is a 50/50 probability that atleast one realized value exceeds t . For example, reordering the distributions arbitrarily doesnot change the recommended threshold rule.2. A threshold rule is simple in that it is deﬁned by only one parameter. Intuitively, a single-parameter rule is less prone to “overﬁtting” to the assumed distributions than a more highlyparameterized algorithm like the ( n -parameter) optimal algorithm.

3. Theorem 1.1 gives ﬂexible qualitative advice about how to approach such problems: Startwith threshold rules, and don’t be too risk-averse (i.e., choose an ambitious enough thresholdthat receiving no prize is a distinct possibility). If there is no such t because of point masses in the D i ’s, then a minor extension of the argument yields the sameresult (Exercise 1). See Chapter 29 on data-driven algorithm design for a formalization of this intuition. Average-Case Justiﬁcations of Classical Algorithms

Distributional assumptions can guide the design of algorithms, as with the optimal stopping problemintroduced in Section 1.2. Distributional analysis can also be used to analyze a general-purposealgorithm, with the goal of explaining mathematically why its empirical performance is much betterthan its worst-case performance. In these applications, the assumed probability distribution overinputs should not be taken literally; rather, it serves as a metaphor for “real-world” or “non-pathological” inputs. This section gives the ﬂavor of work along these lines by describing one resultfor each of three classical problems: sorting, hashing, and bin packing.

Recall the QuickSort algorithm from undergraduate algorithms which, given an array of n elementsfrom a totally ordered set, works as follows: • Designate one the n array entries as a “pivot” element. • Partition the input array around the pivot element p , meaning rearrange the array entries sothat all entries less than p appear before p in the array and all entries greater than p appearafter p in the array. • Recursively sort the subarray comprising the elements less than p . • Recursively sort the subarray comprising the elements greater than p .The second step of the algorithm is easy to implement in Θ( n ) time. There are many ways to choosethe pivot element, and the running time of the algorithm varies between Θ( n log n ) and Θ( n ),depending on these choices. One way to enforce the best-case scenario is to explicitly computethe median element and use it as the pivot. A simpler and more practical solution is to choose thepivot element uniformly at random; most of the time, it will be close enough to the median thatboth recursive calls are on signiﬁcantly smaller inputs. A still simpler solution, which is commonin practice, is to always use the ﬁrst array element as the pivot element. This deterministicversion of QuickSort runs in Θ( n ) time on already-sorted arrays, but empirically its running timeis Θ( n log n ) on almost all other inputs. One way to formalize this observation is to analyze thealgorithm’s expected running time on a random input. As a comparison-based sorting algorithm,the running time of QuickSort depends only on the relative order of the array entries, so we canassume without loss of generality that the input is a permutation of { , , . . . , n } and identify a“random input” with a random permutation. With any of the standard implementations of thepartitioning subroutine, the average-case running time of this deterministic QuickSort algorithm isat most a constant factor larger than its best-case running time. Theorem 2.1 (Hoare (1962)) . The expected running time of the deterministic QuickSort algorithmon a random permutation of { , , . . . , n } is O ( n log n ) .Proof. We sketch one of the standard proofs. Assume that the partitioning subroutine only makescomparisons that involve the pivot element; this is the case for all of the textbook implementations.Each recursive call is given a subarray consisting of the elements from some interval { i, i + 1 , . . . , j } ;conditioned on this interval, the relative order of its elements in the subarray is uniformly random. In the best-case scenario, every pivot element is the median element of the subarray, leading to the recurrence T ( n ) = 2 T ( n ) + Θ( n ) with solution Θ( n log n ). In the worst-case scenario, every pivot element is the minimum ormaximum element of the subarray, leading to the recurrence T ( n ) = T ( n −

1) + Θ( n ) with solution Θ( n ). i and j with i < j . These elements are passed to the same sequence of recursivecalls (along with i + 1 , i + 2 , . . . , j − { i, i + 1 , . . . , j } is chosen as a pivot element. At this point, i and j are either compared to each other (if i or j wasthe chosen pivot) or not (otherwise); in any case, they are never compared to each other again inthe future. With all subarray orderings equally likely, the probability that i and j are compared isexactly j − i +1 . By the linearity of expectation, the expected total number of comparisons is then P n − i =1 P nj = i +1 2 j − i +1 = O ( n log n ), and the expected running time of the algorithm is at most aconstant factor larger. A hash table is a data structure that supports fast insertions and lookups. Under the hood, mosthash table implementations maintain an array A of some length n and use a hash function h tomap each inserted object x to an array entry h ( x ) ∈ { , , . . . , n } . A fundamental issue in hashtable design is how to resolve collisions, meaning pairs x, y of distinct inserted objects for which h ( x ) = h ( y ). Linear probing is a speciﬁc way of resolving collisions:1. Initially, all entries of A are empty.2. Store a newly inserted object x in the ﬁrst empty entry in the sequence A [ h ( x )], A [ h ( x ) +1] , A [ h ( x ) + 2] , . . . , wrapping around to the beginning of the array, if necessary.3. To search for an object x , scan the entries A [ h ( x )] , A [ h ( x ) + 1] , A [ h ( x ) + 2] , . . . until encoun-tering x (a successful search) or an empty slot (an unsuccessful search), wrapping around tothe beginning of the array, if necessary.That is, the hash function indicates the starting position for an insertion or lookup operation, andthe operation scans to the right until it ﬁnds the desired object or an empty position. The runningtime of an insertion or lookup operation is proportional to the length of this scan.The bigger the fraction α of the hash table that is occupied (called its load ), the fewer emptyarray entries and the longer the scans. To get calibrated, imagine searching for an empty arrayentry using independent and uniformly random probes. The number of attempts until a success isthen a geometric random variable with success probability 1 − α , which has expected value − α .With linear probing, however, objects tend to clump together in consecutive slots, resulting inslower operation times. How much slower?Non-trivial mathematical guarantees for hash tables are possible only under assumptions thatrule out data sets that are pathologically correlated with the table’s hash function; for this reason,hash tables have long constituted one of the killer applications of average-case analysis. Commonassumptions include asserting some amount of randomness in the data (as in average-case analysis),in the choice of hash function (as in randomized algorithms), or both (as in Chapter 26). Forexample, assuming that the data and hash function are such that every hash value h ( x ) is anindependent and uniform draw from { , , . . . , n } , the expected time of insertions and lookupsscales with − α ) . This result played an important role in the genesis of the mathematical analysis of algorithms. Donald E. Knuth,its discoverer, wrote: “I ﬁrst formulated the following derivation in 1962. . . Ever since that day, the analysis ofalgorithms has in fact been one of the major themes in my life.” .3 Bin Packing The bin packing problem played a central role in the early development of the average-case analysisof algorithms; this section presents one representative result. Here, the average-case analysis is ofthe solution quality output by a heuristic (as with the prophet inequality), not its running time(unlike our QuickSort and linear probing examples).In the bin packing problem, the input is n items with sizes s , s , . . . , s n ∈ [0 , N P -hard, so every polynomial-time algorithm produces suboptimal solutions in some cases (assuming P = N P ).Many practical bin packing heuristics have been studied extensively from both worst-case andaverage-case viewpoints. One example is the ﬁrst-ﬁt decreasing (FFD) algorithm: • Sort and reindex the items so that s ≥ s ≥ · · · s n . • For i = 1 , , . . . , n : – If there is an existing bin with room for item i (i.e., with current total size at most 1 − s i ),add i to the ﬁrst such bin. – Otherwise, start a new bin and add i to it.For example, consider an input consisting of 6 items with size + ǫ , 6 items with size + 2 ǫ , 6jobs with size + ǫ , and 12 items with size − ǫ . The FFD algorithm uses 11 bins while anoptimal solution packs them perfectly into 9 bins (Exercise 3). Duplicating this set of 30 jobs asmany times as necessary shows that there are arbitrarily large inputs for which the FFD algorithmuses times as many bins as an optimal solution. Conversely, the FFD algorithm never uses morethan times the minimum-possible number of bins plus an additive constant (see the Notes fordetails).The factor of ≈ .

22 is quite good as worst-case approximation ratios go, but empiricallythe FFD algorithm usually produces a solution that is extremely close to optimal. One approachto better theoretical bounds is distributional analysis. For bin-packing algorithms, the naturalstarting point is the case in which item sizes are independent draws from the uniform distributionon [0 , Theorem 2.2 (Frederickson (1980)) . For every ǫ > , for n items with sizes distributed indepen-dently and uniformly in [0 , , with probability − o (1) as n → ∞ , the FFD algorithm uses lessthan (1 + ǫ ) times as many bins as an optimal solution. In other words, the typical approximation ratio of the FFD algorithm tends to 1 as the inputsize grows large.We outline a two-step proof of Theorem 2.2. The ﬁrst step shows that the guarantee holds fora less natural algorithm that we call the truncate and match (TM) algorithm. The second stepshows that the FFD algorithm never uses more bins than the TM algorithm.The truncate and match algorithm works as follows: • Pack every item with size at least 1 − n / in its own bin. See Chapter 11 for an analysis of bin packing heuristics in random-order models. For clarity, we omit ceilings and ﬂoors. See Exercise 4 for the motivation behind this size cutoﬀ. Sort and reindex the remaining k items so that s ≥ s ≥ · · · ≥ s k . (Assume for simplicitythat k is even.) • For each i = 1 , . . . , k/

2, put items i and k − i + 1 into a common bin if possible; otherwise,put them in separate bins.To explain the intuition behind the TM algorithm, consider the expected order statistics (i.e.,expected minimum, expected second-minimum, etc.) of n independent samples from the uniformdistribution on [0 , ,

1] evenly into n + 1 subintervals; theexpected minimum is n +1 , the expected second-minimum n +1 , and so on. Thus at least in anexpected sense, the ﬁrst and last items together should ﬁll up a bin exactly, as should the secondand second-to-last items, and so on. Moreover, as n grows large, the diﬀerence between the realizedorder statistics and their expectations should become small. Setting aside a small number of thelargest items in the ﬁrst step then corrects for any (small) deviations from these expectations withnegligible additional cost. See Exercise 4 for details.We leave the second step of the proof of Theorem 2.2 as Exercise 5. Lemma 2.1.

For every bin packing input, the FFD algorithm uses at most as many bins as theTM algorithm.

The description of the general-purpose FFD algorithm is not tailored to a distributional as-sumption, but the proof of Theorem 2.2 is fairly speciﬁc to uniform-type distributions. This isemblematic of one of the drawbacks of average-case analysis: Often, it is analytically tractable onlyunder quite speciﬁc distributional assumptions.

Another classical application domain for average-case analysis is in computational geometry, withthe input comprising random points from some subset of Euclidean space. We highlight two repre-sentative results for fundamental problems in two dimensions, one concerning the running time ofan always-correct convex hull algorithm and one about the solution quality of an eﬃcient heuristicfor the

N P -hard Traveling Salesman Problem.

A typical textbook on computational geometry begins with the

2D convex hull problem. The inputconsists of a set S of n points in the plane (in the unit square [0 , × [0 , S that lie on the convex hull of S . There are severalalgorithms that solve the 2D convex hull problem in Θ( n log n ) time. Can we do better—perhapseven linear time—when the points are drawn from a distribution, such as the uniform distributionon the square? Theorem 3.1 (Bentley and Shamos (1978)) . There is an algorithm that solves the 2D convex hullproblem in expected O ( n ) time for n points drawn independently and uniformly from the unit square. The algorithm is a simple divide-and-conquer algorithm. Given points S = { p , p , . . . , p n } drawn independently and uniformly from the plane: Recall that the convex hull of a set of points is the smallest convex set containing them, or equivalently the set ofall convex combinations of points from S . In two dimensions, imagine the points as nails in a board, and the convexhull as a taut rubber band that encloses them. If the input S contains at most 5 points, compute the convex hull by brute force. Return thepoints of S on the convex hull, sorted by x -coordinate. • Otherwise, let S = { p , . . . , p n/ } and S = { p ( n/ , . . . , p n } denote the ﬁrst and secondhalves of S . (Assume for simplicity that n is even.) • Recursively compute the convex hull C of S , with its points sorted by x -coordinate. • Recursively compute the convex hull C of S , with its points sorted by x -coordinate. • Merge C and C into the convex hull C of S . Return C , with the points of C sorted by x -coordinate.For every set S and partition of S into S and S , every point on the convex hull of S is on theconvex hull of either S or S . Correctness of the algorithm follows immediately. The last step iseasy to implement in time linear in | C | + | C | ; see Exercise 6. Because the subproblems S and S are themselves uniformly random points from the unit square (with the sorting occurring only afterthe recursive computation completes), the expected running time of the algorithm is governed bythe recurrence T ( n ) ≤ · T ( n ) + O ( E [ | C | + | C | ]) . Theorem 3.1 follows immediately from this recurrence and the following combinatorial bound.

Lemma 3.1 (R´enyi and Sulanke (1963)) . The expected size of the convex hull of n points drawnindependently and uniformly from the unit square is O (log n ) .Proof. Imagine drawing the input points in two phases, with n points S i drawn in phase i for i = 1 ,

2. An elementary argument shows that the convex hull of the points in S occupies, inexpectation, at least a 1 − O ( log nn ) fraction of the unit square (Exercise 7). Each point of thesecond phase thus lies in the interior of the convex hull of S (and hence of S ∪ S ) except withprobability O ( log nn ), so the expected number of points from S on the convex hull of S ∪ S is O (log n ). By symmetry, the same is true of S . In the

Traveling Salesman Problem (TSP) , the input consists of n points and distances betweenthem, and the goal is to compute a tour of the points (visiting each point once and returningto the starting point) with the minimum-possible total length. In Euclidean

TSP, the points liein Euclidean space and all distances are straight-line distances. This problem is

N P -hard, evenin two dimensions. The main result of this section is analogous to Theorem 2.2 in Section 2.3for the bin packing problem—a polynomial-time algorithm that, when the input points are drawnindependently and uniformly from the unit square, has approximation ratio tending to 1 (with highprobability) as n tends to inﬁnity.The algorithm, which we call the Stitch algorithm, works as follows: • Divide the unit square evenly into s = n ln n subsquares, each with side length p (ln n ) /n . • For each subsquare i = 1 , , . . . , s , containing the points P i : Again, we ignore ceilings and ﬂoors. If | P i | ≤ n , compute the optimal tour T i of P i using dynamic programming. – Otherwise, return an arbitrary tour T i of P i . • Choose an arbitrary representative point from each non-empty set P i , and let R denote theset of representatives. • Construct a tour T of R by visiting points from left-to-right in the bottommost row ofsubsquares, right-to-left in the second-to-bottom row, and so on, returning to the startingpoint after visiting all the points in the topmost row. • Shortcut the union of the subtours ∪ si =0 T i to a single tour T of all n points, and return T . This algorithm runs in polynomial time with probability 1 and returns a tour of the input points.As for the approximation guarantee:

Theorem 3.2 (Karp (1977)) . For every ǫ > , for n points distributed independently and uniformlyin the unit square, with probability − o (1) as n → ∞ , the Stitch algorithm returns a tour withtotal length less than (1 + ǫ ) times that of an optimal tour. Proving Theorem 3.2 requires understanding the typical length of an optimal tour of randompoints in the unit square and then bounding from above the diﬀerence between the lengths ofthe tour returned by the Stitch algorithm and of the optimal tour. The ﬁrst step is not diﬃcult(Exercise 8).

Lemma 3.2.

There is a constant c > such that, with probability − o (1) as n → ∞ , the lengthof an optimal tour of n points drawn independently and uniformly from the unit square is at least c √ n . Lemma 3.2 implies that proving Theorem 3.2 reduces to showing that (with high probability)the diﬀerence between the lengths of Stitch’s tour and the optimal tour is o ( √ n ).For the second step, we start with a simple consequence of the Chernoﬀ bound (see Exercise 9). Lemma 3.3.

In the Stitch algorithm, with probability − o (1) as n → ∞ , every subsquare containsat most n points. It is also easy to bound the length of the tour T of the representative points R in the Stitchalgorithm (see Exercise 10). Lemma 3.4.

There is a constant c such that, for every input, the length of the tour T in theStitch algorithm is at most c · √ s = c · r n ln n . The key lemma states that an optimal tour can be massaged into subtours for all of the sub-squares without much additional cost. Given k points, label them { , , . . . , k } . There is one subproblem for each subset S of points and point j ∈ S ,whose solution is the minimum-length path that starts at the point 1, ends at the point j , and visits every point of S exactly once. Each of the O ( k k ) subproblems can be solved in O ( k ) time by trying all possibilities for the ﬁnal hopof the optimal path. When k = O (log n ), this running time of O ( k k ) is polynomial in n . The union of the s +1 subtours can be viewed as a connected Eulerian graph, which then admits a closed Eulerianwalk (using every edge of the graph exactly once). This walk can be transformed to a tour of the points with onlysmaller length by skipping repeated visits to a point. emma 3.5. Let T ∗ denote an optimal tour of the n input points, and let L i denote the lengthof the portion of T ∗ that lies within the subsquare i ∈ { , , . . . , s } deﬁned by the Stitch algorithm.For every subsquare i = 1 , , . . . , s , there exists a tour of the points P i in the subsquare of length atmost L i + 6 r ln nn . (5)The key point in Lemma 3.5 is that the upper bound in (5) depends only on the size of thesquare, and not on the number of times that the optimal tour T ∗ crosses its boundaries.Before proving Lemma 3.5, we observe that Lemmas 3.2–3.5 easily imply Theorem 3.2. Indeed,with high probability:1. The optimal tour has length L ∗ ≥ c √ n .2. Every subsquare in the Stitch algorithm contains at most 6 ln n points, and hence the algo-rithm computes an optimal tour of the points in each subsquare (with length at most (5)).3. Thus, recalling that s = n ln n , the total length of Stitch’s tour is at most s X i =1 L i + 6 r ln nn ! + c · r n ln n = L ∗ + O (cid:18)r n ln n (cid:19) = (1 + o (1)) · L ∗ . Finally, we prove Lemma 3.5.

Proof. (Lemma 3.5) Fix a subsquare i with a non-empty set P i of points. The optimal tour T ∗ visitsevery point in P i while crossing the boundary of the subsquare an even number 2 t of times; denotethese crossing points by Q i = { y , y , . . . , y t } , indexed in clockwise order around the subsquare’sperimeter (starting from the lower left corner). Now form a connected Eulerian multi-graph G =( V, E ) with vertices V = P i ∪ Q i by adding the following edges: • Add the portions of T ∗ that lie inside the subsquare (giving points of P i a degree of 2 andpoints of Q i a degree of 1). • Let M (respectively, M ) denote the perfect matching of Q i that matches each y j with j odd(respectively, with j even) to y j +1 . (In M , y t is matched with y .) Add two copies of thecheaper matching to the edge set E and one copy of the more expensive matching (boostingthe degree of points of Q i to 4 while also ensuring connectivity).The total length of the edges contributed by the ﬁrst ingredient is L i . The total length of the edgesin M ∪ M is at most the perimeter of the subsquare, which is 4 q ln nn . The second copy of thecheaper matching adds at most 2 q ln nn to the total length of the edges in G . As in footnote 12,because G is connected and Eulerian, we can extract from it a tour of P i ∪ Q i (and hence of P i )that has total length at most that of the edges of G , which is at most L i + 6 q ln nn . To what extent are the two divide-and-conquer algorithms of this section tailored to the distri-butional assumption that the input points are drawn independently and uniformly at random11rom the unit square? For the convex hull algorithm in Section 3.1, the consequence of an in-correct distributional assumption is mild; its worst-case running time is governed by the recur-rence T ( n ) ≤ T ( n ) + O ( n ) and hence is O ( n log n ), which is close to linear. Also, analogs ofLemma 3.1 (and hence Theorem 3.1) can be shown to hold for a number of other distributions.The Stitch algorithm in Section 3.2, with its ﬁxed dissection of the unit square into equal-sizesubsquares, may appear hopelessly tied to the assumption of a uniform distribution. But minormodiﬁcations to it result in more robust algorithms, for example by using an adaptive dissec-tion, which recursively divides each square along either the median x -coordinate or the median y -coordinate of the points in the square. Indeed, this idea paved the way for later algorithms thatobtained polynomial-time approximation schemes (i.e., (1 + ǫ )-approximations for arbitrarily smallconstant ǫ ) even for the worst-case version of Euclidean TSP (see the Notes).Zooming out, our discussion of these two examples touches on one of the biggest risks of average-case analysis: distributional assumptions can lead to algorithms that are unduly tailored to theassumptions. On the other hand, even when this is the case, the high-level ideas behind thealgorithms can prove useful much more broadly. Most of our average-case models so far concern random numerical data. This section studies randomcombinatorial structures, and speciﬁcally diﬀerent probability distributions over graphs.

This section reviews the most well-studied model of random graphs, the

Erd˝os-R´enyi random graphmodel. This model is a family {G n,p } of distributions, indexed by the number n of vertices and theedge density p . A sample from the distribution G n,p is a graph G = ( V, E ) with | V | = n and eachof the (cid:0) n (cid:1) possible edges present independently with probability p . The special case of p = is theuniform distribution over all n -vertex graphs. This is an example of an “oblivious random model,”meaning that it is deﬁned independently of any particular optimization problem.The assumption of uniformly random data may have felt like cheating already in our previousexamples, but it is particularly problematic for many computational problems on graphs. Notonly is this distributional assumption extremely speciﬁc, it also fails to meaningfully diﬀerentiatebetween diﬀerent algorithms. We illustrate this point with two problems that are discussed atlength in Chapters 9 and 10.

Example: Minimum bisection.

In the graph bisection problem, the input is an undirectedgraph G = ( V, E ) with an even number of vertices, and the goal is to identify a bisection (i.e., apartition of V into two equal-size groups) with the fewest number of crossing edges.To see why this problem is algorithmically uninteresting in the Erd˝os-R´enyi random graphmodel, take p = and let n tend to inﬁnity. In a random sample from G n,p , for every bisection ( S, ¯ S )of the set V of n vertices, the expected number of edges of E crossing it is n . A straightforwardapplication of the Chernoﬀ bound shows that, with probability 1 − o (1) as n → ∞ , the number ofedges crossing every bisection is (1 ± o (1)) · n (Exercise 11). Thus even an algorithm that computesa maximum bisection is an almost optimal algorithm for computing a minimum bisection! It also fails to replicate the statistical properties commonly observed in “real-world” graphs; see Chapter 28 forfurther discussion. xample: Maximum clique. In the maximum clique problem, the goal (given an undirectedgraph) is to identify the largest subset of vertices that are mutually adjacent. In a random graph inthe G n, / model, the size of the maximum clique is very likely to be ≈ n . To see heuristicallywhy this is true, note that for an integer k , the expected number of cliques on k vertices in a randomgraph of G n, / is exactly (cid:18) nk (cid:19) − ( k ) ≈ n k − k / , which is 1 precisely when k = 2 log n . That is, 2 log n is approximately the largest k for whichwe expect to see at least one k -clique.On the other hand, while there are several polynomial-time algorithms (including the obviousgreedy algorithm) that compute, with high probability, a clique of size ≈ log n in a random graphfrom G n, / , no such algorithm is known to do better. The Erd˝os-R´enyi model fails to distinguishbetween diﬀerent eﬃcient heuristics for the Maximum Clique problem. Chapters 5 and 6 study deterministic models of data in which the optimal solution to an optimiza-tion problem must be “clearly optimal” in some sense, with the motivation of zeroing in on theinstances with a “meaningful” solution (such as an informative clustering of data points).

Plantedgraph models implement the same stability idea in the context of random graphs, by positing prob-ability distributions over inputs which generate (with high probability) graphs in which an optimalsolution “sticks out.” The goal is then to devise a polynomial-time algorithm that recovers theoptimal solution with high probability, under the weakest-possible assumptions on the input dis-tribution. Unlike an oblivious random model such as the Erd˝os-R´enyi model, planted models aregenerally deﬁned with a particular computational problem in mind.Algorithms for planted models generally fall into three categories, listed roughly in order ofincreasing complexity and power.1.

Combinatorial approaches.

We leave the term “combinatorial” safely undeﬁned, but basicallyit refers to algorithms that work directly with the graph, rather than resorting to any con-tinuous methods. For example, an algorithm that looks only at vertex degrees, subgraphs,shortest paths, etc., would be considered combinatorial.2.

Spectral algorithms.

Here “spectral” means an algorithm that computes and uses the eigen-vectors of a suitable matrix derived from the input graph. Spectral algorithms often achieveoptimal recovery guarantees for planted models.3.

Semideﬁnite programming (SDP).

Algorithms that use semideﬁnite programming have proveduseful for extending guarantees for spectral algorithms in planted models to hold also in semi-random models (see Chapters 9 and 10).

Example: Planted bisection.

In the planted bisection problem, a graph is generated accordingto the following random process (for a ﬁxed vertex set V , with | V | even, and parameters p, q ∈ [0 , S, T ) of V with | S | = | T | uniformly at random.2. Independently for each pair ( i, j ) of vertices inside the same cluster ( S or T ), include the edge( i, j ) with probability p . In fact, the size of the maximum clique turns out to be incredibly concentrated; see the Notes.

13. Independently for each pair ( i, j ) of vertices in diﬀerent clusters, include the edge ( i, j ) withprobability q . Thus the expected edge density inside the clusters is p , and between the clusters is q .The diﬃculty of recovering the planted bisection ( S, T ) clearly depends on the gap between p and q . The problem is impossible if p = q and trivial if p = 1 and q = 0. Thus the key question inthis model is: how big does the gap p − q need to be before exact recovery is possible in polynomialtime (with high probability)?When p , q , and p − q are bounded below by a constant independent of n , the problem is easilysolved by combinatorial approaches (Exercise 12); unfortunately, these do not resemble algorithmsthat perform well in practice.We can make the problem more diﬃcult by allowing p , q , and p − q to go to 0 with n . Here,semideﬁnite programming-based algorithms work for an impressively wide range of parameter val-ues. For example: Theorem 4.1 (Abbe et al. (2016); Hajek et al. (2016)) . If p = α ln nn and q = β ln nn with α > β ,then:(a) If √ α − √ β ≥ √ , there is a polynomial-time algorithm that recovers the planted parti-tion ( S, T ) with probability − o (1) as n → ∞ .(b) If √ α − √ β < √ , then no algorithm recovers the planted partition with constant probabilityas n → ∞ . In this parameter regime, semideﬁnite programming algorithms provably achieve information-theoretically optimal recovery guarantees. Thus, switching from the p, q, p − q = Ω(1) parameterregime to the p, q, p − q = o (1) regime is valuable not because we literally believe that the latter ismore faithful to “real-world” instances, but rather because it encourages better algorithm design. Example: Planted clique.

The planted clique problem with parameters k and n concerns thefollowing distribution over graphs.1. Fix a vertex set V with n vertices. Sample a graph from G n, / : Independently for each pair( i, j ) of vertices, include the edge ( i, j ) with probability .2. Choose a random subset Q ⊆ V of k vertices.3. Add all remaining edges between pairs of vertices in Q .Once k is signiﬁcantly bigger than ≈ n , the likely size of a maximum clique in a randomgraph from G n, / , the planted clique Q is with high probability the maximum clique of the graph.How big does k need to be before it becomes visible to a polynomial-time algorithm?When k = Ω( √ n log n ), the problem is trivial, with the k highest-degree vertices constitutingthe planted clique Q . To see why this is true, think ﬁrst about the sampled Erd˝os-R´enyi randomgraph, before the clique Q is planted. The expected degree of each vertex is ≈ n/

2, with standarddeviation ≈ √ n/

2. Textbook large deviation inequalities show that, with high probability, thedegree of every vertex is within ≈ √ ln n standard deviations of its expectation (Figure 1). Plantinga clique Q of size a √ n log n , for a suﬃciently large constant a , then boosts the degrees of all of theclique vertices enough that they catapult past the degrees of all of the non-clique vertices. This model is a special case of the stochastic block model studied in Chapter 10. , before plantingthe k -clique Q . If k = Ω( √ n lg n ), then the planted clique will consist of the k vertices with thehighest degrees.The “highest degrees” algorithm is not very useful in practice. What went wrong? The samething that often goes wrong with pure average-case analysis—the solution is brittle and overlytailored to a speciﬁc distributional assumption. How can we change the input model to encouragethe design of algorithms with more robust guarantees?One idea is to mimic what worked well for the planted bisection problem, and to study a morediﬃcult parameter regime that forces us to develop more useful algorithms. For the planted cliqueproblem, there are non-trivial algorithms, including spectral algorithms, that recover the plantedclique Q with high probability provided k = Ω( √ n ) (see the Notes). There is a happy ending to the study of both the planted bisection and planted clique problems:with the right choice of parameter regimes, these models drive us toward non-trivial algorithmsthat might plausibly be useful starting points for the design of practical algorithms. Still, bothresults seem to emerge from “threading the needle” in the parameter space. Could there be abetter alternative, in the form of input models that explicitly encourage the design of robustlygood algorithms?

Many of the remaining chapters in this book pursue diﬀerent hybrids of worst- and average-caseanalysis, in search of a “sweet spot” for algorithm analysis that both encourages robustly goodalgorithms (like in worst-case analysis) and allows for strong provable guarantees (like in average-case analysis). Most of these models assume that there is in fact a probability distribution overinputs (as in average-case analysis), but that this distribution is a priori unknown to an algorithm.The goal is then to design algorithms that work well no matter what the input distribution is(perhaps with some restrictions on the class of possible distributions). Indeed, several of theaverage-case guarantees in this chapter can be viewed as applying simultaneously (i.e., in the worstcase) across a restricted but still inﬁnite family of input distributions: • The -approximation in the prophet inequality (Theorem 1.1) for a threshold- t rule appliessimultaneously to all distribution sequences D , D , . . . , D n such that Pr v ∼ D [max i v i ≥ t ] = (e.g., all possible reorderings of one such sequence).15 The guarantees for our algorithms for the bin packing (Theorem 2.2), convex hull (The-orem 3.1), and Euclidean TSP (Theorem 3.2) problems hold more generally for all inputdistributions that are suﬃciently close to uniform.The general research agenda in robust distributional analysis is to prove approximate optimalityguarantees for algorithms for as many diﬀerent computational problems and as rich a class of inputdistributions as possible. Work in the area can be divided into two categories, both well representedin this book, depending on whether an algorithm observes one or many samples from the unknowninput distribution. We conclude this chapter with an overview of what’s to come. In single-sample models, an algorithm is designed with knowledge only of a class D of possible inputdistributions, and receives only a single input drawn from an unknown and adversarially chosendistribution from D . In these models, the algorithm cannot hope to learn anything non-trivialabout the input distribution. Instead, the goal is to design an algorithm that, for every inputdistribution D ∈ D , has expected performance close to that of the optimal algorithm speciﬁcallytailored for D . Examples include: • The semi-random models in Chapters 9–11 and 17 and the smoothed analysis models in Chap-ters 13–15 and 19. In these models, nature and an adversary collaborate to produce an input,and each ﬁxed adversary strategy induces a particular input distribution. Performing wellwith respect to the adversary in these models is equivalent to performing well simultaneouslyacross all of the induced input distributions. • The eﬀectiveness of simple hash functions with pseudorandom data (Chapter 26). The mainresult in this chapter is a guarantee for universal hashing that holds simultaneously across alldata distributions with suﬃcient entropy. • Prior-independent auctions (Chapter 27), which are auctions that achieve near-optimal ex-pected revenue simultaneously across a wide class of valuation distributions. In multi-sample models, an algorithm observes multiple samples from an unknown input distri-bution D ∈ D , and the goal is to eﬃciently identify a near-optimal algorithm for D from as fewsamples as possible. Examples include: • Self-improving algorithms (Chapter 12) and data-driven algorithm design (Chapter 29), inwhich the goal is to design an algorithm that, when presented with independent samples froman unknown input distribution, quickly converges to an approximately best-in-class algorithmfor that distribution. • Supervised learning (Chapters 16 and 22), in which the goal is to identify the expected loss-minimizing hypothesis (from a given hypothesis class) for an unknown data distribution givensamples from that distribution. • Distribution testing (Chapter 23), in which the goal is to make accurate inferences about anunknown distribution from a limited number of samples.16

Notes

The prophet inequality (Theorem 1.1) is due to Samuel-Cahn (1984). The pros and cons of thresholdrules versus optimal online algorithms are discussed also by Hartline (2017). QuickSort and itsoriginal analysis are due to Hoare (1962). The (1 − α ) − bound for linear probing with load α and random data, as well as the corresponding quote in Section 2.2, are in Knuth (1998). A good(if outdated) entry point to the literature on bin packing is Coﬀman, Jr. et al. (1996). The lowerbound for the FFD algorithm in Exercise 3 is from Johnson et al. (1974). The ﬁrst upper bound ofthe form · OP T + O (1) for the number of bins used by the FFD algorithm, where OP T denotesthe minimum-possible number of bins, is due to Johnson (1973). The exact worst-case bound forFFD was pinned down recently by D´osa et al. (2013). The average-case guarantee in Theorem 2.2is a variation on one by Frederickson (1980), who proved that the expected diﬀerence between thenumber of bins used by FFD and an optimal solution is O ( n / ). A more sophisticated argumentgives a tight bound of Θ( n / ) on this expectation (Coﬀman, Jr. et al., 1991).The linear expected time algorithm for 2D convex hulls (Theorem 3.1) is by Bentley and Shamos(1978). Lemma 3.1 was ﬁrst proved by R´enyi and Sulanke (1963); the proof outlined here fol-lows Har-Peled (1998). Exercise 6 is solved by Andrews (1979). The asymptotic optimality ofthe Stitch algorithm for Euclidean TSP (Theorem 3.2) is due to Karp (1977), who also gave analternative solution based on the adaptive dissections mentioned in Section 3.3. A good general ref-erence for this topic is Karp and Steele (1985). The worst-case approximation schemes mentionedin Section 3.3 are due to Arora (1998) and Mitchell (1999).The Erd˝os-R´enyi random graph model is from Erd˝os and R´enyi (1960). The size of the maxi-mum clique in a random graph drawn from G n, / was characterized by Matula (1976); with highprobability it is either k or k +1, where k is an integer roughly equal to 2 log n . Grimmett and McDiarmid(1975) proved that the greedy algorithm ﬁnds, with high probability, a clique of size roughly log n in a random graph from G n, / . The planted bisection model described here was proposed byBui et al. (1987) and is also a special case of the stochastic block model deﬁned by Holland et al.(1983). Part (b) of Theorem 4.1 and a weaker version of part (a) were proved by Abbe et al.(2016); the stated version of part (a) is due to Hajek et al. (2016). The planted clique model wassuggested by Jerrum (1992). Kucera (1995) noted that the “top- k degrees” algorithm works withhigh probability when k = Ω( √ n log n ). The ﬁrst polynomial-time algorithm for the planted cliqueproblem with k = O ( √ n ) was the spectral algorithm of Alon et al. (1998). Barak et al. (2016)supplied evidence, in the form of a sum-of-squares lower bound, that the planted clique problem isintractable when k = o ( √ n ).The versions of the Chernoﬀ bound stated in Exercises 4(a) and 9 can be found, for example,in Mitzenmacher and Upfal (2017). Acknowledgments

I thank Anupam Gupta, C. Seshadhri, and Sahil Singla for helpful comments on a preliminary draftof this chapter.

References

Abbe, E., A. S. Bandeira, and G. Hall (2016). Exact recovery in the stochastic block model.

IEEETransactions on Information Theory 62 (1), 471–487.17lon, N., M. Krivelevich, and B. Sudakov (1998). Finding a large hidden clique in a random graph.

Random Structures & Algorithms 13 (3-4), 457–466.Andrews, A. M. (1979). Another eﬃcient algorithm for convex hulls in two dimensions.

InformationProcessing Letters 9 (5), 216–219.Arora, S. (1998). Polynomial time approximation schemes for euclidean traveling salesman andother geometric problems.

Journal of the ACM 45 (5), 753–782.Barak, B., S. B. Hopkins, J. A. Kelner, P. Kothari, A. Moitra, and A. Potechin (2016). A nearlytight sum-of-squares lower bound for the planted clique problem. In

Proceedings of the 57thAnnual IEEE Symposium on Foundations of Computer Science (FOCS) , pp. 428–437.Bentley, J. L. and M. I. Shamos (1978). Divide and conquer for linear expected time.

InformationProcessing Letters 7 (2), 87–91.Bui, T. N., S. Chaudhuri, F. T. Leighton, and M. Sipser (1987). Graph bisection algorithms withgood average case behavior.

Combinatorica 7 (2), 171–191.Coﬀman, Jr., E. G., C. Courcoubetis, M. R. Garey, D. S. Johnson, L. A. McGeoch, P. W. Shor, R. R.Weber, and M. Yannakakis (1991). Fundamental discrepancies between average-case analysesunder discrete and continuous distributions: A bin packing case study. In

Proceedings of the23rd Annual ACM Symposium on Theory of Computing (STOC) , pp. 230–240.Coﬀman, Jr., E. G., M. R. Garey, and D. S. Johnson (1996). Approximation algorithms for binpacking: A survey. In D. Hochbaum (Ed.),

Approximation Algorithms for NP-Hard Problems ,Chapter 2, pp. 46–93. PWS.D´osa, G., R. Li, X. Hanc, and Z. Tuza (2013). Tight absolute bound for ﬁrst ﬁt decreasing bin-packing:

F F D ( L ) ≤ / OP T ( L ) + 6 / Theoretical Computer Science 510 , 13–61.Erd˝os, P. and A. R´enyi (1960). On the evolution of random graphs.

Publ. Math. Inst. Hungar.Acad. Sci. 5 , 17–61.Frederickson, G. N. (1980). Probabilistic analysis for simple one- and two-dimensional bin packingalgorithms.

Information Processing Letters 11 (4-5), 156–161.Grimmett, G. and C. J. H. McDiarmid (1975). On colouring random graphs.

Mathematical Pro-ceedings of the Cambridge Philosophical Society 77 , 313–324.Hajek, B., Y. Wu, and J. Xu (2016). Achieving exact cluster recovery threshold via semideﬁniteprogramming: Extensions.

IEEE Transactions on Information Theory 62 (10), 5918–5937.Har-Peled, S. (1998). On the expected complexity of random convex hulls. Technical Report 330/98,School of Mathematical Sciences, Tel Aviv University.Hartline, J. D. (2017). Mechanism design and approximation. Book in preparation.Hoare, C. A. R. (1962). Quicksort.

The Computer Journal 5 (1), 10–15.Holland, P. W., K. Lasket, and S. Leinhardt (1983). Stochastic blockmodels: First steps.

SocialNetworks 5 (2), 109–137. 18errum, M. (1992). Large cliques elude the Metropolis process.

Random Structures and Algo-rithms 3 (4), 347–359.Johnson, D. S. (1973).

Near-Optimal Bin Packing Algorithms . Ph. D. thesis, MIT.Johnson, D. S., A. Demers, J. D. Ullman, M. R. Garey, and R. L. Graham (1974). Worst-caseperformance bounds for simple one-dimensional packing algorithms.

SIAM Journal on Comput-ing 3 (4), 299–325.Karp, R. M. (1977). Probabilistic analysis of partitioning algorithms for the traveling-salesmanproblem in the plane.

Mathematics of Operations Research 2 (3), 209–224.Karp, R. M. and J. M. Steele (1985). Probabilistic analysis of heuristics. In E. L. Lawler, J. K.Lenstra, A. H. G. Rinnooy Kan, and D. B. Shmoys (Eds.),

The Traveling Salesman Problem ,Chapter 6, pp. 181–205. John Wiley & Sons.Knuth, D. E. (1998).

The Art of Computer Programming: Sorting and Searching , Volume 3.Addison-Wesley. Second edition.Kucera, L. (1995). Expected complexity of graph partitioning problems.

Discrete Applied Mathe-matics 57 (2-3), 193–212.Matula, D. W. (1976). The largest clique size in a random graph. Technical Report 7608, Depart-ment of Computer Science, Southern Methodist University.Mitchell, J. S. B. (1999). Guillotine subdivisions approximate polygonal subdivisions: A simplepolynomial-time approximation scheme for geometric tsp, k-mst, and related problems.

SIAMJournal on Computing 28 (4), 1298–1309.Mitzenmacher, M. and E. Upfal (2017).

Probability and Computing . Cambridge. Second edition.R´enyi, A. and R. Sulanke (1963). ¨Uber die konvexe h¨ulle von n zug¨allig gew¨ahlten punkten. Zeitschrift f¨ur Wahrscheinlichkeitstheorie und Verwandte Gebiete 2 , 75–84.Roughgarden, T. (Ed.) (2020).

Beyond the Worst-Case Analysis of Algorithms . Cambridge Uni-versity Press.Samuel-Cahn, E. (1984). Comparison of threshold stop rules and maximum for independent non-negative random variables.

Annals of Probability 12 (4), 1213–1216.

Exercises

1. Extend the prophet inequality (Theorem 1.1) to the case in which there is no threshold t with q ( t ) = , where q ( t ) is the probability that no prize meets the threshold.[Hint: Deﬁne t such that Pr[ π i > t for all i ] ≤ ≤ Pr[ π i ≥ t for all i ]. Show that at least oneof the two corresponding strategies—either taking the ﬁrst prize with value at least t , or theﬁrst with value exceeding t —satisﬁes the requirement.]2. The prophet inequality (Theorem 1.1) provides an approximation guarantee of relative tothe expected prize value obtained by a prophet, which is at least (and possibly more than)the expected prize value obtained by an optimal online algorithm. Show by examples thatthe latter quantity can range from 50% to 100% of the former.19. Prove that for a bin packing instance consisting of 6 items with size + ǫ , 6 items with size + 2 ǫ , 6 jobs with size + ǫ , and 12 items with size − ǫ , the ﬁrst-ﬁt decreasing algorithmuses 11 bins and an optimal solution uses 9 bins.4. This exercise and the next outline a proof of Theorem 2.2. Divide the interval [0 ,

1] evenlyinto n / intervals, with I j denoting the subinterval [ j − n / , jn / ] for j = 1 , , . . . , n / . Let P j denote the items with size in I j .(a) One version of the Chernoﬀ bound states that, for every sequence X , X , . . . , X n ofBernoulli (0-1) random variables with means p , p , . . . , p n and every δ ∈ (0 , Pr [ | X − µ | ≥ δµ ] ≤ e − µδ / , where X and µ denote P ni =1 X i and P ni =1 p i , respectively. Use this bound to prove that | P j | ∈ h n / − √ n, n / + √ n i for all j = 1 , , . . . , n / (6)with probability 1 − o (1) as n → ∞ .(b) Assuming (6), prove that the sum P ni =1 s i is at least n − c n / for some constant c > i and k − i + 1 ﬁts in a single bin.(d) Conclude that there is a constant c > n + c n / = (1 + o (1)) · OP T bins, where

OP T denotes thenumber of bins used by an optimal solution.5. Prove Lemma 2.1.6. Give an algorithm that, given a set S of n points from the square sorted by x -coordinate,computes the convex hull of S in O ( n ) time.[Hint: compute the lower and upper parts of the convex hull separately.]7. Prove that the convex hull of n points drawn independently and uniformly at random fromthe unit square occupies a 1 − O ( log nn ) fraction of the square.8. Prove Lemma 3.2.[Hint: Chop the unit square evenly into n subsquares of side length n − / , and each subsquarefurther into 9 mini-squares of side length · n − / . For a given subsquare, what is theprobability that the input includes one point from its center mini-square and none from theother 8 mini-squares?]9. Another variation of the Chernoﬀ bound states that, for every sequence X , X , . . . , X n ofBernoulli (0-1) random variables with means p , p , . . . , p n and every t ≥ µ , Pr [ X ≥ t ] ≤ − t , where X and µ denote P ni =1 X i and P ni =1 p i , respectively. Use this bound to prove Lemma 3.3.10. Prove Lemma 3.4. 201. Use the Chernoﬀ bound from Exercise 4(a) to prove that, with probability approaching 1 as n → ∞ , every bisection of a random graph from G n,p has (1 ± o (1)) · n crossing edges.12. Consider the planted bisection problem with parameters p = c and q = p − c for constants c , c >

0. Consider the following simple combinatorial algorithm for recovering a plantedbisection: • Choose a vertex v arbitrarily. • Let A denote the n vertices that have the fewest common neighbors with v . • Let B denote the rest of the vertices (including v ) and return ( A, B ).Prove that, with high probability over the random choice of G (approaching 1 as n → ∞ ),this algorithm exactly recovers the planted bisection.[Hint: compute the expected number of common neighbors for pairs of vertices on the sameand on diﬀerent sides of the planted partition. Use the Chernoﬀ bound.]13. Consider the planted clique problem (Section 4.2) with planted clique size k ≥ c log n fora suﬃciently large constant c . Design an algorithm that runs in n O (log n ) time and, withprobability 1 − o (1) as n → ∞→ ∞