[PDF] Beyond the Worst-Case Analysis of Algorithms (Introduction)

Abstract

One of the primary goals of the mathematical analysis of algorithms is to provide guidance about which algorithm is the "best" for solving a given computational problem. Worst-case analysis summarizes the performance profile of an algorithm by its worst performance on any input of a given size, implicitly advocating for the algorithm with the best-possible worst-case performance. Strong worst-case guarantees are the holy grail of algorithm design, providing an application-agnostic certification of an algorithm's robustly good performance. However, for many fundamental problems and performance measures, such guarantees are impossible and a more nuanced analysis approach is called for. This chapter surveys several alternatives to worst-case analysis that are discussed in detail later in the book.

Full PDF

aa r X i v : . [ c s . D S ] J u l Beyond the Worst-Case Analysis of Algorithms (Introduction) ∗ Tim Roughgarden † July 28, 2020

Abstract

One of the primary goals of the mathematical analysis of algorithms is to provide guidanceabout which algorithm is the “best” for solving a given computational problem. Worst-caseanalysis summarizes the performance proﬁle of an algorithm by its worst performance on anyinput of a given size, implicitly advocating for the algorithm with the best-possible worst-caseperformance. Strong worst-case guarantees are the holy grail of algorithm design, providingan application-agnostic certiﬁcation of an algorithm’s robustly good performance. However, formany fundamental problems and performance measures, such guarantees are impossible anda more nuanced analysis approach is called for. This chapter surveys several alternatives toworst-case analysis that are discussed in detail later in the book.

Comparing diﬀerent algorithms is hard. For almost any pair of algorithms and measure of algorithmperformance, each algorithm will perform better than the other on some inputs. For example, theMergeSort algorithm takes Θ( n log n ) time to sort length- n arrays, whether the input is alreadysorted or not, while the running time of the InsertionSort algorithm is Θ( n ) on already-sortedarrays but Θ( n ) in general. The diﬃculty is not speciﬁc to running time analysis. In general, consider a computationalproblem Π and a performance measure

Perf , with

Perf ( A, z ) quantifying the “performance”of an algorithm A for Π on an input z ∈ Π. For example, Π could be the Traveling SalesmanProblem (TSP), A could be a polynomial-time heuristic for the problem, and Perf ( A, z ) couldbe the approximation ratio of A —i.e., the ratio of the lengths of A ’s output tour and an optimaltour—on the TSP instance z . Or Π could be the problem of testing primality, A a randomizedpolynomial-time primality-testing algorithm, and Perf ( A, z ) the probability (over A ’s internalrandomness) that the algorithm correctly decides if the positive integer z is prime. In general, ∗ Chapter 1 of the book

Beyond the Worst-Case Analysis of Algorithms (Roughgarden, 2020). † Department of Computer Science, Columbia University. Supported in part by NSF award CCF-1813188 andARO award W911NF1910294. Email: [email protected]. A quick reminder about asymptotic notation in the analysis of algorithms: for nonnegative real-valued func-tions T ( n ) and f ( n ) deﬁned on the natural numbers, we write T ( n ) = O ( f ( n )) if there are positive constants c and n such that T ( n ) ≤ c · f ( n ) for all n ≥ n ; T ( n ) = Ω( f ( n )) if there exist positive c and n with T ( n ) ≥ c · f ( n ) forall n ≥ n ; and T ( n ) = Θ( f ( n )) if T ( n ) is both O ( f ( n )) and Ω( f ( n )). In the Traveling Salesman Problem, the input is a complete undirected graph (

V, E ) with a nonnegativecost c ( v, w ) for each edge ( v, w ) ∈ E , and the goal is to compute an ordering v , v , . . . , v n of the vertices V thatminimizes the length P ni =1 c ( v i , v i +1 ) of the corresponding tour (with v n +1 interpreted as v ). Worst-case analysis is a speciﬁc modeling choice in the analysis of algorithms, where the per-formance proﬁle { Perf ( A, z ) } z ∈ Π of an algorithm is summarized by its worst performance on anyinput of a given size (i.e., min z : | z | = n Perf ( A, z ) or max z : | z | = n Perf ( A, z ), depending on the mea-sure, where | z | denotes the size of the input z ). The “better” algorithm is then the one with superiorworst-case performance. MergeSort, with its worst-case asymptotic running time of Θ( n log n ) forlength- n arrays, is better in this sense than InsertionSort, which has a worst-case running time ofΘ( n ). While crude, worst-case analysis can be tremendously useful and, for several reasons, it has beenthe dominant paradigm for algorithm analysis in theoretical computer science.1. A good worst-case guarantee is the best-case scenario for an algorithm, certifying its general-purpose utility and absolving its users from understanding which inputs are most relevant totheir applications. Thus worst-case analysis is particularly well suited for “general-purpose”algorithms that are expected to work well across a range of application domains (like thedefault sorting routine of a programming language).2. Worst-case analysis is often more analytically tractable to carry out than its alternatives,such as average-case analysis with respect to a probability distribution over inputs.3. For a remarkable number of fundamental computational problems, there are algorithms withexcellent worst-case performance guarantees. For example, the lion’s share of an undergradu-ate algorithms course comprises algorithms that run in linear or near-linear time in the worstcase. Before critiquing the worst-case analysis approach, it’s worth taking a step back to clarify why wewant rigorous methods to reason about algorithm performance. There are at least three possiblegoals:1.

Performance prediction.

The ﬁrst goal is to explain or predict the empirical performanceof algorithms. In some cases, the analyst acts as a natural scientist, taking an observedphenomenon like “the simplex method for linear programming is fast” as ground truth, andseeking a transparent mathematical model that explains it. In others, the analyst plays therole of an engineer, seeking a theory that gives accurate advice about whether or not analgorithm will perform well in an application of interest.2.

Identify optimal algorithms.

The second goal is to rank diﬀerent algorithms according to theirperformance, and ideally to single out one algorithm as “optimal.” At the very least, giventwo algorithms A and B for the same problem, a method for algorithmic analysis should oﬀeran opinion about which one is “better.” Worst-case analysis is also the dominant paradigm in complexity theory, where it has led to the development of NP -completeness and many other fundamental concepts. Develop new algorithms.

The third goal is to provide a well-deﬁned framework in which tobrainstorm new algorithms. Once a measure of algorithm performance has been declared, thePavlovian response of most computer scientists is to seek out new algorithms that improveon the state-of-the-art with respect to this measure. The focusing eﬀect catalyzed by suchyardsticks should not be underestimated.When proving or interpreting results in algorithm design and analysis, it’s important to be clearin one’s mind about which of these goals the work is trying to achieve.What’s the report card for worst-case analysis with respect to these three goals?1. Worst-case analysis gives an accurate performance prediction only for algorithms that exhibitlittle variation in performance across inputs of a given size. This is the case for many of thegreatest hits of algorithms covered in an undergraduate course, including the running timesof near-linear-time algorithms and of many canonical dynamic programming algorithms. Formany more complex problems, however, the predictions of worst-case analysis are overlypessimistic (see Section 2).2. For the second goal, worst-case analysis earns a middling grade—it gives good advice aboutwhich algorithm to use for some important problems (like many of those in an undergraduatecourse) and bad advice for others (see Section 2).3. Worst-case analysis has served as a tremendously useful brainstorming organizer. For overa half-century, researchers striving to optimize worst-case algorithm performance have beenled to thousands of new algorithms, many of them practically useful.

For many problems a bit beyond the scope of an undergraduate course, the downside of worst-caseanalysis rears its ugly head. This section reviews four famous examples where worst-case analysisgives misleading or useless advice about how to solve a problem. These examples motivate thealternatives to worst-case analysis that are surveyed in Section 4 and described in detail in laterchapters of the book.

Perhaps the most famous failure of worst-case analysis concerns linear programming, the problemof optimizing a linear function subject to linear constraints (Figure 1). Dantzig proposed in the1940s an algorithm for solving linear programs called the simplex method . The simplex methodsolves linear programs using greedy local search on the vertices of the solution set boundary, andvariants of it remain in wide use to this day. The enduring appeal of the simplex method stemsfrom its consistently superb performance in practice. Its running time typically scales modestlywith the input size, and it routinely solves linear programs with millions of decision variables andconstraints. This robust empirical performance suggested that the simplex method might well solveevery linear program in a polynomial amount of time.Klee and Minty (1972) showed by example that there are contrived linear programs that forcethe simplex method to run in time exponential in the number of decision variables (for all of thecommon “pivot rules” for choosing the next vertex). This illustrates the ﬁrst potential pitfallof worst-case analysis: overly pessimistic performance predictions that cannot be taken at facevalue. The running time of the simplex method is polynomial for all practical purposes, despitethe exponential prediction of worst-case analysis.3igure 1: A two-dimensional linear programming problem.To add insult to injury, the ﬁrst worst-case polynomial-time algorithm for linear programming,the ellipsoid method, is not competitive with the simplex method in practice. Taken at facevalue, worst-case analysis recommends the ellipsoid method over the empirically superior simplexmethod. One framework for narrowing the gap between these theoretical predictions and empiricalobservations is smoothed analysis , the subject of Part IV of this book; see Section 4.4 for anoverview.

N P -Hard Optimization Problems

Clustering is a form of unsupervised learning (ﬁnding patterns in unlabeled data), where the infor-mal goal is to partition a set of points into “coherent groups” (Figure 2). One popular way to coaxthis goal into a well-deﬁned computational problem is to posit a numerical objective function overclusterings of the point set, and then seek the clustering with the best objective function value.For example, the goal could be to choose k cluster centers to minimize the sum of the distancesbetween points and their nearest centers (the k -median objective) or the sum of the squared suchdistances (the k -means objective). Almost all natural optimization problems that are deﬁned overclusterings are N P -hard. In practice, clustering is not viewed as a particularly diﬃcult problem. Lightweight clusteringalgorithms, like Lloyd’s algorithm for k -means and its variants, regularly return the intuitively“correct” clusterings of real-world point sets. How can we reconcile the worst-case intractability ofclustering problems with the empirical success of relatively simple algorithms? One possible explanation is that clustering is hard only when it doesn’t matter . For example, ifthe diﬃcult instances of an

N P -hard clustering problem look like a bunch of random unstructuredpoints, who cares? The common use case for a clustering algorithm is for points that represent Interior-point methods, developed ﬁve years later, lead to algorithms that both run in worst-case polynomialtime and are competitive with the simplex method in practice. Recall that a polynomial-time algorithm for an NP -hard problem would yield a polynomial-time algorithm forevery problem in NP —for every problem with eﬃciently veriﬁable solutions. Assuming the widely-believed P = NP conjecture, every algorithm for an NP -hard problem either returns an incorrect answer for some inputs or runs insuper-polynomial time for some inputs (or both). More generally, optimization problems are more likely to be NP -hard than polynomial-time solvable. In manycases, even computing an approximately optimal solution is an NP -hard problem. Whenever an eﬃcient algorithmfor such a problem performs better on real-world instances than (worst-case) complexity theory would suggest, there’san opportunity for a reﬁned and more accurate theoretical analysis. The unreasonable eﬀectiveness of modern machine learning algorithms has thrown the gauntletdown to researchers in algorithm analysis, and there is perhaps no other problem domain that callsout as loudly for a “beyond worst-case” approach.To illustrate some of the challenges, consider a canonical supervised learning problem, where alearning algorithm is given a data set of object-label pairs and the goal is to produce a classiﬁerthat accurately predicts the label of as-yet-unseen objects (e.g., whether or not an image contains acat). Over the past decade, aided by massive data sets and computational power, neural networkshave achieved impressive levels of performance across a range of prediction tasks. Their empiricalsuccess ﬂies in the face of conventional wisdom in multiple ways. First, there is a computationalmystery: Neural network training usually boils down to ﬁtting parameters (weights and biases) tominimize a nonconvex loss function, for example to minimize the number of classiﬁcation errors themodel makes on the training set. In the past such problems were written oﬀ as computationallyintractable, but ﬁrst-order methods (i.e., variants of gradient descent) often converge quickly to alocal optimum or even to a global optimum. Why?Second, there is a statistical mystery: Modern neural networks are typically over-parameterized,meaning that the number of parameters to ﬁt is considerably larger than the size of the trainingdata set. Over-parameterized models are vulnerable to large generalization error (i.e., overﬁtting),since they can eﬀectively memorize the training data without learning anything that helps classifyas-yet-unseen data points. Nevertheless, state-of-the-art neural networks generalize shockinglywell—why? The answer likely hinges on special properties of both real-world data sets and theoptimization algorithms used for neural network training (principally stochastic gradient descent).Part V of this book covers the state-of-the-art explanations of these and other mysteries in theempirical performance of machine learning algorithms.5he beyond worst-case viewpoint can also contribute to machine learning by “stress-testing”the existing theory and providing a road map for more robust guarantees. While work in beyondworst-case analysis makes strong assumptions relative to the norm in theoretical computer science,these assumptions are usually weaker than the norm in statistical machine learning. Research inthe latter ﬁeld often resembles average-case analysis, for example when data points are modeled asindependent and identically distributed samples from some underlying structured distribution. Thesemi-random models described in Parts III and IV of this book serve as role models for blendingadversarial and average-case modeling to encourage the design of algorithms with robustly goodperformance.

Online algorithms are algorithms that must process their input as it arrives over time. For example,consider the online paging problem, where there is a system with a small fast memory (the cache)and a big slow memory. Data is organized into blocks called pages , with up to k diﬀerent pagesﬁtting in the cache at once. A page request results in either a cache hit (if the page is already in thecache) or a cache miss (if not). On a cache miss, the requested page must be brought into the cache.If the cache is already full, then some page in it must be evicted. A cache replacement policy is analgorithm for making these eviction decisions. Any systems textbook will recommend aspiring tothe least recently used (LRU) policy, which evicts the page whose most recent reference is furthestin the past. The same textbook will explain why: Real-world page request sequences tend to exhibitlocality of reference, meaning that recently requested pages are likely to be requested again soon.The LRU policy uses the recent past as a prediction for the near future. Empirically, it typicallysuﬀers fewer cache misses than competing policies like ﬁrst-in ﬁrst-out (FIFO).Worst-case analysis, straightforwardly applied, provides no useful insights about the perfor-mance of diﬀerent cache replacement policies. For every deterministic policy and cache size k ,there is a pathological page request sequence that triggers a page fault rate of 100%, even thoughthe optimal clairvoyant replacement policy (known as B´el´ady’s furthest-in-the-future algorithm)would have a page fault rate of at most 1 /k (Exercise 1). This observation is troublesome both forits absurdly pessimistic performance prediction and for its failure to diﬀerentiate between competingreplacement policies (like LRU vs. FIFO). One solution, described in Section 3, is to choose an ap-propriately ﬁne-grained parameterization of the input space and to assess and compare algorithmsusing parameterized guarantees. We should celebrate the fact that worst-case analysis works so well for so many fundamental compu-tational problems, while at the same time recognizing that the cherrypicked successes highlighted inundergraduate algorithms can paint a potentially misleading picture about the range of its practicalrelevance. The preceding four examples highlight the chief weaknesses of the worst-case analysisframework.1.

Overly pessimistic performance predictions.

By design, worst-case analysis gives a pessimisticestimate of an algorithm’s empirical performance. In the preceding four examples, the gapbetween the two is embarrassingly large.2.

Can rank algorithms inaccurately.

Overly pessimistic performance summaries can derailworst-case analysis from identifying the right algorithm to use in practice. In the online6aging problem, it cannot distinguish between the FIFO and LRU policies; for linear pro-gramming, it implicitly suggests that the ellipsoid method is superior to the simplex method.3.

No data model.

If worst-case analysis has an implicit model of data, then it’s the “Murphy’sLaw” data model, where the instance to be solved is an adversarially selected function of thechosen algorithm. Outside of security applications, this algorithm-dependent model of datais a rather paranoid and incoherent way to think about a computational problem.In many applications, the algorithm of choice is superior precisely because of properties ofdata in the application domain, like meaningful solutions in clustering problems or localityof reference in online paging. Pure worst-case analysis provides no language for articulatingsuch domain-speciﬁc properties of data. In this sense, the strength of worst-case analysis isalso its weakness.These drawbacks show the importance of alternatives to worst-case analysis, in the form of modelsthat articulate properties of “relevant” inputs and algorithms that possess rigorous and meaningfulalgorithmic guarantees for inputs with these properties. Research in “beyond worst-case analysis”is a conversation between models and algorithms, with each informing the development of the other.It has both a scientiﬁc dimension, where the goal is to formulate transparent mathematical modelsthat explain empirically observed phenomena about algorithm performance, and an engineeringdimension, where the goals are to provide accurate guidance about which algorithm to use for aproblem and to design new algorithms that perform particularly well on the relevant inputs.Concretely, what might a result that goes “beyond worst-case analysis” look like? The nextsection covers in detail an exemplary result by Albers et al. (2005) for the online paging problemintroduced in Section 2.4. The rest of the book oﬀers dozens of further examples.

Returning to the online paging example in Section 2.4, perhaps we shouldn’t be surprised thatworst-case analysis fails to advocate LRU over FIFO. The empirical superiority of LRU is due tothe special structure in real-world page request sequences (locality of reference), which is outsidethe language of pure worst-case analysis.The key idea for obtaining meaningful performance guarantees for and comparisons betweenonline paging algorithms is to parameterize page request sequences according to how much lo-cality of reference they exhibit, and then prove parameterized worst-case guarantees. Reﬁningworst-case analysis in this way leads to dramatically more informative results. Part I of the bookdescribes many other applications of such ﬁne-grained input parameterizations; see Section 4.1 foran overview.How should we measure locality in a page request sequence? One tried and true method is the working set model, which is parameterized by a function f from the positive integers N to N thatdescribes how many diﬀerent page requests are possible in a window of a given length. Formally,we say that a page sequence σ conforms to f if for every positive integer n and every set of n consecutive page requests in σ , there are requests for at most f ( n ) distinct pages. For example,the identity function f ( n ) = n imposes no restrictions on the page request sequence. A sequencecan only conform to a sublinear function like f ( n ) = ⌈√ n ⌉ or f ( n ) = ⌈ n ⌉ if it exhibits Murphy’s Law: If anything can go wrong, it will. ( n ) 1 2 3 3 4 4 4 5 · · · n · · · Figure 3: An approximately concave function, with m = 1, m = 1, m = 2, m = 3 , . . . locality of reference. We can assume without loss of generality that f (1) = 1, f (2) = 2, and f ( n + 1) ∈ { f ( n ) , f ( n ) + 1 } for all n (Exercise 2).We adopt as our performance measure Perf ( A, z ) the fault rate of an online algorithm A onthe page request sequence z —the fraction of requests in z on which A suﬀers a page fault. We nextstate a performance guarantee for the fault rate of the LRU policy with a cache size of k that isparameterized by a number α f ( k ) ∈ [0 , α f ( k ) is deﬁned below in (1); intuitively,it will be close to 0 for slow-growing functions f (i.e., functions that impose strong locality ofreference) and close to 1 for functions f that grow quickly (e.g., near-linearly). This performanceguarantee requires that the function f is approximately concave in the sense that the number m y of inputs with value y under f (that is, | f − ( y ) | ) is nondecreasing in y (Figure 3). Theorem 3.1 (Albers et al. (2005)) . With α f ( k ) deﬁned as in (1) below:(a) For every approximately concave function f , cache size k ≥ , and deterministic cache re-placement policy, there are arbitrarily long page request sequences conforming to f for whichthe policy’s page fault rate is at least α f ( k ) .(b) For every approximately concave function f , cache size k ≥ , and page request sequence thatconforms to f , the page fault rate of the LRU policy is at most α f ( k ) plus an additive termthat goes to 0 with the sequence length.(c) There exists a choice of an approximately concave function f , a cache size k ≥ , and anarbitrarily long page request sequence that conforms to f , such that the page fault rate of theFIFO policy is bounded away from α f ( k ) . Parts (a) and (b) prove the worst-case optimality of the LRU policy in a strong and ﬁne-grainedsense, f -by- f and k -by- k . Part (c) diﬀerentiates LRU from FIFO, as the latter is suboptimal forsome (in fact, many) choices of f and k .The guarantees in Theorem 3.1 are so good that they are meaningful even when taken at facevalue—for strongly sublinear f ’s, α f ( k ) goes to 0 reasonably quickly with k . The precise deﬁnitionof α f ( k ) for k ≥ α f ( k ) = k − f − ( k + 1) − , (1)where we abuse notation and interpret f − ( y ) as the smallest value of x such that f ( x ) = y . Thatis, f − ( y ) denotes the smallest window length in which page requests for y distinct pages mightappear. As expected, for the function f ( n ) = n we have α f ( k ) = 1 for all k . (With no restrictionon the input sequence, an adversary can force a 100% fault rate.) If f ( n ) = ⌈√ n ⌉ , however, then α f ( k ) scales with 1 / √ k . Thus with a cache size of 10,000, the page fault rate is always at most1%. If f ( n ) = ⌈ n ⌉ , then α f ( k ) goes to 0 even faster with k , roughly as k/ k . This section proves the ﬁrst two parts of Theorem 3.1; part (c) is left as Exercise 4. The notation ⌈ x ⌉ means the number x , rounded up to the nearest integer. k − k = 3. Part (a).

To prove the lower bound in part (a), ﬁx an approximately concave function f and acache size k ≥

2. Fix a deterministic cache replacement policy A .We construct a page sequence σ that uses only k + 1 distinct pages, so at any given time stepthere is exactly one page missing from the algorithm’s cache. (Assume that the algorithm beginswith the ﬁrst k pages in its cache.) The sequence comprises k − j th blockconsists of m j +1 consecutive requests for the same page p j , where p j is the unique page missingfrom the algorithm A ’s cache at the start of the block. (Recall that m y is the number of valuesof x such that f ( x ) = y .) This sequence conforms to f (Exercise 3).By the choice of the p j ’s, A incurs a page fault on the ﬁrst request of a block, and not on anyof the other (duplicate) requests of that block. Thus, algorithm A suﬀers exactly k − m + m + · · · + m k . Because m = 1, this sum equals( P kj =1 m j ) − m j ’s, equals ( f − ( k +1) − − f − ( k +1) −

2. Thealgorithm’s page fault rate on this sequence matches the deﬁnition (1) of α f ( k ), as required. Moregenerally, repeating the construction over and over again produces arbitrarily long page requestsequences for which the algorithm has page fault rate α f ( k ). Part (b).

To prove a matching upper bound for the LRU policy, ﬁx an approximately concavefunction f , a cache size k ≥

2, and a sequence σ that conforms to f . Our fault rate target α f ( k ) isa major clue to the proof (recall (1)): we should be looking to partition the sequence σ into blocksof length at least f − ( k + 1) − k − k − σ . Each such group deﬁnes a block , beginningwith the ﬁrst fault of the group, and ending with the page request that immediately precedes thebeginning of the next group of faults (see Figure 4). Claim:

Consider a block other than the ﬁrst or last. Consider the page requests in this block,together with the requests immediately before and after this block. These requests are for at least k + 1 distinct pages.The claim immediately implies that every block contains at least f − ( k +1) − k − α f ( k ) (ignoring thevanishing additive error due to the ﬁrst and last blocks), proving Theorem 3.1(b).We proceed to the proof of the claim. Note that, in light of Theorem 3.1(c), it is essential thatthe proof uses properties of the LRU policy not shared by FIFO. Fix a block other than the ﬁrst orlast, and let p be the page requested immediately prior to this block. This request could have beena page fault, or not (cf., Figure 4). In any case, p is in the cache when this block begins. Considerthe k − k th fault that occurs immediately afterthe block. We consider three cases.First, if the k faults occurred on distinct pages that are all diﬀerent from p , we have identiﬁedour k + 1 distinct requests ( p and the k faults). For the second case, suppose that two of the k faults were for the same page q = p . How could this have happened? The page q was brought intothe cache after the ﬁrst fault on q , and was not evicted until there were k requests for distinct pagesother than q after this page fault. This gives k + 1 distinct page requests ( q and the k other distinct9equests between the two faults on q ). Finally, suppose that one of these k faults was on the page p . Because p was requested just before the ﬁrst of these faults, the LRU algorithm, subsequent tothis request and prior to evicting p , must have received requests for k distinct pages other than p .These requests, together with that for p , give the desired k + 1 distinct page requests. Theorem 3.1 is an example of a “parameterized analysis” of an algorithm, where the performanceguarantee is expressed as a function of parameters of the input other than its size. A parameterlike α f ( k ) measures the “easiness” of an input, much like matrix condition numbers in linear algebra.We will see many more examples of parameterized analyses later in the book.There are several reasons to aspire toward parameterized performance guarantees.1. A parameterized guarantee is a mathematically stronger statement, containing strictly moreinformation about an algorithm’s performance than a worst-case guarantee parameterizedsolely by the input size.2. A parameterized analysis can explain why an algorithm has good “real-world” performanceeven when its worst-case performance is poor. The approach is to ﬁrst show that the algorithmperforms well for “easy” values of the parameter (e.g., for f and k such that α f ( k ) is closeto 0), and then make a case that “real-world” instances are “easy” in this sense (e.g., haveenough locality of reference to conform to a function f with a small value of α f ( k )). Thelatter argument can be made empirically (e.g., by computing the parameter on representativebenchmarks) or mathematically (e.g., by positing a generative model and proving that ittypically generates “easy” inputs). Results in smoothed analysis (see Section 4.4 and Part IV)typically follow this two-step approach.3. A parameterized performance guarantee suggests when—for which inputs, and which ap-plication domains—a given algorithm should be used. (Namely, on the inputs where theperformance of the algorithm is good!) Such advice is useful to someone who has no time orinterest in developing their own algorithm from scratch, and merely wishes to be an educatedclient of existing algorithms.

4. Fine-grained performance characterizations can diﬀerentiate algorithms when worst-case anal-ysis cannot (as with LRU vs. FIFO).5. Formulating a good parameter often forces the analyst to articulate a form of structure indata, like the “amount of locality” in a page request sequence. Ideas for new algorithms thatexplicitly exploit such structure often follow soon thereafter. Useful parameters come in several ﬂavors. The parameter α f ( k ) in Theorem 3.1 is deriveddirectly from the input to the problem, and later chapters contain many more examples of suchinput-based parameters. It is also common to parameterize algorithm performance by properties of The ﬁrst two arguments apply also to the FIFO policy, but the third does not. Suppose p was already in thecache when it was requested just prior to the block. Under FIFO, this request does not “reset p ’s clock”; if it wasoriginally brought into the cache long ago, FIFO might well evict p on the block’s very ﬁrst fault. For a familiar example, parameterizing the running time of graph algorithms by both the number of vertices andthe number of edges provides guidance about which algorithms should be used for sparse graphs and which ones fordense graphs. The parameter α f ( k ) showed up only in our analysis of the LRU policy; in other applications, the chosenparameter also guides the design of algorithms for the problem.

10n optimal solution. In parameterized algorithms (Chapter 2), the most well-studied such parameteris the size of an optimal solution. Another solution-based parameterization, popular in machinelearning applications, is by the “margin,” meaning the extent to which the optimal solution isparticularly “pronounced”; see Exercise 7 for the canonical example of the analysis of the perceptronalgorithm.“Input size” is well deﬁned for every computational problem, and this is one of the reasonswhy performance guarantees parameterized by input size are so ubiquitous. By contrast, theparameter α f ( k ) used in Theorem 3.1 is speciﬁcally tailored to the online paging problem; inexchange, the performance guarantee is unusually accurate and meaningful. Alas, there are nosilver bullets in parameterized analysis, or in algorithm analysis more generally, and the mostenlightening analysis approach is often problem-speciﬁc. Worst-case analysis can inform the choiceof an appropriate analysis framework for a problem by highlighting the problem’s most diﬃcult(and often unrealistic) instances. This book has six parts, four on “core theory” and two on “applications.” Each of the followingsections summarizes the chapters in one of the parts.

Part I of the book hews closest to traditional worst-case analysis. No assumptions are imposedon the input; as in worst-case analysis, there is no commitment to a “model of data.” The in-novative ideas in these chapters concern novel and problem-speciﬁc ways of expressing algorithmperformance. Our online paging example (Section 3) falls squarely in this domain.Chapter 2, by Fomin, Lokshtanov, Saurabh and Zehavi, provides an overview of the relativelymature ﬁeld of parameterized algorithms . The goal here is to understand how the running time ofalgorithms and the complexity of computational problems depend on parameters other than theinput size. For example, for which

N P -hard problems Π and parameters k is Π “ﬁxed-parametertractable” with respect to k , meaning solvable in time f ( k ) · n O (1) for some function f that isindependent of the input size n ? The ﬁeld has developed a number of powerful approaches todesigning ﬁxed-parameter tractable algorithms, as well as lower bound techniques for ruling outthe existence of such algorithms (under appropriate complexity assumptions).Chapter 3, by Barbay, searches for instance-optimal algorithms—algorithms that for every inputperform better than every other algorithm (up to a constant factor). Such an input-by-inputguarantee is essentially the strongest notion of optimality one could hope for. Remarkably, thereare several fundamental problems, for example in low-dimensional computational geometry, thatadmit an instance-optimal algorithm. Proofs of instance optimality involve input-by-input matchingupper and lower bounds, and this typically requires a very ﬁne-grained parameterization of the inputspace.Chapter 4, by Roughgarden, concerns resource augmentation . This concept makes sense forproblems that have a natural notion of a “resource,” with the performance of an algorithm im-proving as it is given more resources. Examples include the size of a cache (with larger cachesleading to fewer faults), the capacity of a network (with higher-capacity networks leading to lesscongestion), and the speed of a processor (with faster processors leading to earlier job completiontimes). A resource augmentation guarantee then states that the performance of an algorithm ofinterest is always close to that achieved by an all-powerful algorithm that is handicapped by slightlyless resources. 11 .2 Deterministic Models of Data Part II of the book proposes deterministic models of data for several

N P -hard clustering andsparse recovery problems, which eﬀectively posit conditions that are conceivably satisﬁed by “real-world” inputs. This work ﬁts into the long-standing tradition of identifying “islands of tractability,”meaning polynomial-time solvable special cases of

N P -hard problems. 20th-century research ontractable special cases focused primarily on syntactic and easily-checked restrictions (e.g., graphplanarity or Horn satisﬁability). The chapters in Part II and some of the related application chaptersconsider conditions that are not necessarily easy to check, but for which there is a plausible narrativeabout why “real-world instances” might satisfy them, at least approximately.Chapter 5, by Makarychev and Makarychev, studies perturbation stability in several diﬀerentcomputational problems. A perturbation-stable instance satisﬁes a property that is eﬀectively auniqueness condition on steroids, stating that the optimal solution remains invariant to suﬃcientlysmall perturbations of the numbers in the input. The larger the perturbations that are tolerated, thestronger the condition on the instance and the easier the computational problem. Many problemshave “stability thresholds,” an allowable perturbation size at which the complexity of the problemswitches suddenly from

N P -hard to polynomial-time solvable. To the extent that we’re comfortableidentifying “instances with a meaningful clustering” with perturbation-stable instances, the positiveresults in this chapter give a precise sense in which clustering is hard only when it doesn’t matter(cf., Section 2.2). As a bonus, many of these positive results are achieved by algorithms thatresemble popular approaches in practice, like single-linkage clustering and local search.Chapter 6, by Blum, proposes an alternative condition called approximation stability , statingthat every solution with a near-optimal objective function value closely resembles the optimal solu-tion. That is, any solution that is structurally diﬀerent from the optimal solution has signiﬁcantlyworse objective function value. This condition is particularly appropriate for problems like cluster-ing, where the objective function is only means to an end and the real goal is to recover some typeof “ground-truth” clustering. This chapter demonstrates that many

N P -hard problems becomeprovably easier for approximation-stable instances.Chapter 7, by Price, provides a glimpse of the vast literature on sparse recovery , where thegoal is to reverse engineer a “sparse” object from a small number of clues about it. This areais more strongly associated with applied mathematics than with theoretical computer science andalgorithms, but there are compelling parallels between it and the topics of the preceding twochapters. For example, consider the canonical problem in compressive sensing, where the goal isto recover an unknown sparse signal z (a vector of length n ) from a small number m of linearmeasurements of it. If z can be arbitrary, then the problem is hopeless unless m = n . Butmany real-world signals have most of their mass concentrated on k coordinates for small k (and anappropriate basis), and the results surveyed in this chapter show that, for such “natural” signals,the problem can be solved eﬃciently even when m is only modestly bigger than k (and much smallerthan n ). Part III of the book is about semi-random models —hybrids of worst- and average-case analysis inwhich nature and an adversary collaborate to produce an instance of a problem. For many problems,such hybrid frameworks are a “sweet spot” for algorithm analysis, with the worst-case dimensionencouraging the design of robustly good algorithms and the average-case dimension allowing forstrong provable guarantees.Chapter 8, by Roughgarden, sets the stage with a review of pure average-case or distributional nalysis , along with some of its killer applications and biggest weaknesses. Work in this area adoptsa speciﬁc probability distribution over the inputs of a problem, and analyzes the expectation (orsome other statistic) of the performance of an algorithm with respect to this distribution. Oneuse of distributional analysis is to show that a general-purpose algorithm has good performanceon non-pathological inputs (e.g., deterministic QuickSort on randomly ordered arrays). One keydrawback of distributional analysis is that it can encourage the design of algorithms that are brittleand overly tailored to the assumed input distribution. The semi-random models of the subsequentchapters are designed to ameliorate this issue.Chapter 9, by Feige, introduces several planted models and their semi-random counterparts.For example, in the planted clique problem, a clique of size k is planted in an otherwise uniformlyrandom graph. How large does k need to be, as a function of the number of vertices, before theplanted clique can be recovered in polynomial time (with high probability)? In a semi-randomversion of a planted model, an adversary can modify the random input in a restricted way. Forexample, in the clique problem, an adversary might be allowed to remove edges not in the clique;such changes intuitively make the planted clique only “more obviously optimal,” but neverthelesscan foil overly simplistic algorithms. One rule of thumb that emerges from this line of work, and alsorecurs in the next chapter, is that spectral algorithms tend to work well for planted models but theheavier machinery of semideﬁnite programming seems required for their semi-random counterparts.This chapter also investigates random and semi-random models for Boolean formulas, includingrefutation algorithms that certify that a given input formula is not satisﬁable.Chapter 10, by Moitra, drills down on a speciﬁc and extensively-studied planted model, the stochastic block model . The vertices of a graph are partitioned into groups, and each potentialedge of the graph is present independently with a probability that depends only on the groupsthat contain its endpoints. The algorithmic goal is to recover the groups from the (unlabeled)graph. One important special case is the planted bisection problem, where the vertices are splitinto two equal-size sets A and B and each edge is present independently with probability p (ifboth endpoints are in the same group) or q < p (otherwise). How big does the gap p − q need tobe before the planted bisection ( A, B ) can be recovered, either statistically (i.e., with unboundedcomputational power) or with a polynomial-time algorithm? When p and q are suﬃciently small,the relevant goal becomes partial recovery, meaning a proposed classiﬁcation of the vertices withaccuracy better than random guessing. In the semi-random version of the model, an adversarycan remove edges crossing the bisection and add edges internal to each of the groups. For partialrecovery, this semi-random version is provably more diﬃcult than the original model.Chapter 11, by Gupta and Singla, describes results for a number of online algorithms in random-order models . These are semi-random models in which an adversary decides on an input, and naturethen presents this input to an online algorithm, one piece at a time and in random order. The canon-ical example here is the secretary problem, where an arbitrary ﬁnite set of numbers is presented toan algorithm in random order, and the goal is to design a stopping rule with the maximum-possibleprobability of stopping on the largest number of the sequence. Analogous random-order modelshave proved useful for overcoming worst-case lower bounds for the online versions of a number ofcombinatorial optimization problems, including bin packing, facility location, and network design.Chapter 12, by Seshadhri, is a survey of the ﬁeld of self-improving algorithms . The goal here isto design an algorithm that, when presented with a sequence of independent samples drawn froman unknown input distribution, quickly converges to the optimal algorithm for that distribution.For example, for many distributions over length- n arrays, there are sorting algorithms that makeless than Θ( n log n ) comparisons on average. Could there be a “master algorithm” that replicatesthe performance of a distribution-optimal sorter from only a limited number of samples from thedistribution? This chapter gives a positive answer under the assumption that array entries are13rawn independently (from possibly diﬀerent distributions), along with analogous positive resultsfor several fundamental problems in low-dimensional computational geometry. Part IV of the book focuses on the semi-random models studied in smoothed analysis . In smoothedanalysis, an adversary chooses an arbitrary input, and this input is then perturbed slightly bynature. The performance of an algorithm is then assessed by its worst-case expected performance,where the worst case is over the adversary’s input choice and the expectation is over the randomperturbation. This analysis framework can be applied to any problem where “small random per-turbations” make sense, including most problems with real-valued inputs. It can be applied toany measure of algorithm performance, but has proven most eﬀective for running time analysesof algorithms that seem to run in super-polynomial time only on highly contrived inputs (like thesimplex method). As with other semi-random models, smoothed analysis has the beneﬁt of poten-tially escaping worst-case inputs, especially if they are “isolated” in the input space, while avoidingoverﬁtting a solution to a speciﬁc distributional assumption. There is also a plausible narrativeabout why “real-world” inputs are captured by this framework: Whatever problem you’d like tosolve, there are inevitable inaccuracies in its formulation from measurement errors, uncertainty,and so on.Chapter 13, by Manthey, details several applications of smoothed analysis to the analysis oflocal search algorithms for combinatorial optimization problems. For example, the 2-opt heuristicfor the Traveling Salesman Problem is a local search algorithm that begins with an arbitrary tourand repeatedly improves the current solution using local moves that swap one pair of edges foranother. In practice, local search algorithms like the 2-opt heuristic almost always converge toa locally optimal solution in a small number of steps. Delicate constructions show that the 2-opt heuristic, and many other local search algorithms, require an exponential number of steps toconverge in the worst case. The results in this chapter use smoothed analysis to narrow the gapbetween worst-case analysis and empirically observed performance, establishing that many localsearch algorithms (including the 2-opt heuristic) have polynomial smoothed complexity.Chapter 14, by Dadush and Huiberts, surveys the ﬁrst and most famous killer application ofsmoothed analysis, the Spielman-Teng analysis of the running time of the simplex method for linearprogramming. As discussed in Section 2.1, the running time of the simplex method is exponentialin the worst case but almost always polynomial in practice. This chapter develops intuition for andoutlines a proof of the fact that the simplex method, implemented with the shadow vertex pivotrule, has polynomial smoothed complexity with respect to small Gaussian perturbations of theentries of the constraint matrix. The chapter also shows how to interpret the successive shortest-path algorithm for the minimum-cost maximum-ﬂow problem as an instantiation of this version ofthe simplex method.Chapter 15, by R¨oglin, presents a third application of smoothed analysis, to the size of

Paretocurves for multi-objective optimization problems . For example, consider the knapsack problem,where the input consists of n items with values and sizes. One subset of the items dominatesanother if it has both a larger overall value and a smaller overall size, and the Pareto curve isdeﬁned as the set of undominated solutions. Pareto curves matter for algorithm design becausemany algorithms for multi-objective optimization problems (like the Nemhauser-Ullmann knapsackalgorithm) run in time polynomial in the size of the Pareto curve. For many problems, the Paretocurve has exponential size in the worst case but expected polynomial size in a smoothed analysismodel. This chapter also presents a satisfyingly strong connection between smoothed polynomialcomplexity and worst-case pseudopolynomial complexity for linear binary optimization problems.14 .5 Applications in Machine Learning and Statistics Part V of the book gives a number of examples of how the paradigms in Parts I–IV have beenapplied to problems in machine learning and statistics .Chapter 16, by Balcan and Haghtalab, considers one of the most basic problems in supervisedlearning, that of learning an unknown halfspace . This problem is relatively easy in the noiseless casebut becomes notoriously diﬃcult in the worst case in the presence of adversarial noise. This chaptersurveys a number of positive statistical and computational results for the problem under additionalassumptions on the data-generating distribution. One type of assumption imposes structure, suchas log-concavity, on the marginal distribution over data points (i.e., ignoring their labels). A secondtype restricts the power of the adversary who introduces the noise, for example by allowing theadversary to mislabel a point only with a probability that is bounded away from 1 / robust high-dimensional statistics , where the goal is to design learning algorithms that have provable guaranteeseven when a small constant fraction of the data points has been adversarially corrupted. Forexample, consider the problem of estimating the mean µ of an unknown one-dimensional Gaussiandistribution N ( µ, σ ), where the input consists of (1 − ǫ ) n samples from the distribution and ǫn additional points deﬁned by an adversary. The empirical mean of the data points is a goodestimator of the true mean when there is no adversary, but adversarial outliers can distort theempirical mean arbitrarily. The median of the input points, however, remains a good estimatorof the true mean even with a small fraction of corrupted data points. What about in more thanone dimension? Among other results, this chapter describes a robust and eﬃciently computableestimator for learning the mean of a high-dimensional Gaussian distribution.Chapter 18, by Dasgupta and Kpotufe, investigates the twin topics of nearest neighbor searchand classiﬁcation . The former is algorithmic, and the goal is to design a data structure that enablesfast nearest neighbor queries. The latter is statistical, and the goal is to understand the amount ofdata required before the nearest neighbor classiﬁer enjoys provable accuracy guarantees. In bothcases, novel parameterizations are the key to narrowing the gap between worst-case analysis andempirically observed performance—for search, a parameterization of the data set; for classiﬁcation,of the allowable target functions.Chapter 19, by Vijayaraghavan, is about computing a low-rank tensor decomposition . Forexample, given an m × n × p { T i,j,k } , the goal is to express T as a linearcombination of the minimum-possible number of rank-one tensors (where a rank-one tensor hasentries of the form { u i · v j · w k } for some vectors u ∈ R m , v ∈ R n , and w ∈ R p ). Eﬃcientalgorithms for this problem are an increasingly important tool in the design of learning algorithms;see also Chapters 20 and 21. This problem is N P -hard in general. Jennrich’s algorithm solves inpolynomial time the special case of the problem in which the three sets of vectors in the low-rankdecomposition (the u ’s, the v ’s, and the w ’s) are linearly independent. This result does not addressthe overcomplete regime, meaning tensors that have rank larger than dimension. (Unlike matrices,the rank of a tensor can be much larger than its smallest dimension.) For this regime, the chaptershows that a generalization of Jennrich’s algorithm has smoothed polynomial complexity.Chapter 20, by Ge and Moitra, concerns topic modeling , which is a basic problem in unsupervisedlearning. The goal here is to process a large unlabeled corpus of documents and produce a list ofmeaningful topics and an assignment of each document to a mixture of topics. One approach to theproblem is to reduce it to nonnegative matrix factorization (NMF)—the analog of a singular valuedecomposition of a matrix, with the additional constraint that both matrix factors are nonnegative.The NMF problem is hard in general, but this chapter proposes a condition on inputs, which isreasonable in a topic modeling context, under which the problem can be solved quickly in both15heory and practice. The key assumption is that each topic has at least one “anchor word,” thepresence of which strongly indicates that the document is at least partly about that topic.Chapter 21, by Ma, studies the computational mystery outlined in Section 2.3: Why are localmethods like stochastic gradient descent so eﬀective in solving the nonconvex optimization problemsthat arise in supervised learning, such as computing the loss-minimizing parameters for a givenneural network architecture? This chapter surveys the quickly evolving state-of-the-art on thistopic, including a number of diﬀerent restrictions on problems under which local methods haveprovable guarantees. For example, some natural problems have a nonconvex objective functionthat satisﬁes the “strict saddle condition,” which asserts that at every saddle point (i.e., a pointwith zero gradient that is neither a minimum nor a maximum) there is a direction with strictlynegative curvature. Under this condition, variants of gradient descent provably converge to a localminimum (and, for some problems, a global minimum).Chapter 22, by Hardt, tackles the statistical mystery discussed in Section 2.3: Why do overpa-rameterized models like deep neural networks, which have many more parameters than training datapoints, so often generalize well in practice? While the jury is still out, this chapter surveys severalof the leading explanations for this phenomenon, ranging from properties of optimization algo-rithms like stochastic gradient descent (including algorithmic stability and implicit regularization)to properties of data sets (such as margin-based guarantees).Chapter 23, by G. Valiant and P. Valiant, presents two instance optimality results for distributiontesting and learning . The chapter ﬁrst considers the problem of learning a discretely supporteddistribution from independent samples, and describes an algorithm that learns the distributionnearly as accurately as would an optimal algorithm with advance knowledge of the true multiset of(unlabeled) probabilities of the distribution. This algorithm is instance optimal in the sense that,whatever the structure of the distribution, the learning algorithm will perform almost as well as analgorithm speciﬁcally tailored for that structure. The chapter then explores the problem of identitytesting: given the description of a reference probability distribution, p , supported on a countableset, and sample access to an unknown distribution, q , the goal is to distinguish whether p = q versus the case that p and q have total variation distance at least ǫ . This chapter presents a testingalgorithm that has optimal sample complexity simultaneously for every distribution p and ǫ , up toconstant factors. The ﬁnal part of the book, Part VI, gathers a number of additional applications of the ideas andtechniques introduced in Parts I–III.Chapter 24, by Karlin and Koutsoupias, surveys alternatives to worst-case analysis in the competitive analysis of online algorithms . There is a long tradition in online algorithms of exploringalternative analysis frameworks, and accordingly this chapter connects to many of the themes ofParts I–III. For example, the chapter includes results on deterministic models of data (e.g., theaccess graph model for restricting the allowable page request sequences) and semi-random models(e.g., the diﬀuse adversary model to blend worst- and average-case analysis).Chapter 25, by Ganesh and Vardi, explores the mysteries posed by the empirical performance of

Boolean satisﬁability (SAT) solvers . Solvers based on backtracking algorithms such as the Davis-Putnam-Logemann-Loveland (DPLL) algorithm frequently solve SAT instances with millions ofvariables and clauses in a reasonable amount of time. This chapter provides an introduction toconﬂict-driven clause-learning (CDCL) solvers and their connections to proof systems, followed by Indeed, the title of this book is a riﬀ on that of a paper in the competitive analysis of online algorithms(Koutsoupias and Papadimitriou, 2000).

16 high-level overview of the state-of-the-art parameterizations of SAT formulas, including input-based parameters (such as parameters derived from the variable-incidence graph of an instance)and output-based parameters (such as the proof complexity in the proof system associated withCDCL solvers).Chapter 26, by Chung, Mitzenmacher, and Vadhan, uses ideas from pseudorandomness toexplain why simple hash functions work so well in practice. Well-designed hash functions arepractical proxies for random functions—simple enough to be eﬃciently implementable, but complexenough to “look random.” In the theoretical analysis of hash functions and their applications,one generally assumes that a hash function is chosen at random from a restricted family, suchas a set of universal or k -wise independent functions for small k . For some statistics, such asthe expected number of collisions under a random hash function, small families of hash functionsprovably perform as well as completely random functions. For others, such as the expected insertiontime in a hash table with linear probing, simple hash functions are provably worse than randomfunctions (for worst-case data). The running theme of this chapter is that a little randomness inthe data, in the form of a lower bound on the entropy of the (otherwise adversarial) data-generatingdistribution, compensates for any missing randomness in a universal family of hash functions.Chapter 27, by Talgam-Cohen, presents an application of the beyond worst-case viewpoint inalgorithmic game theory, to prior-independent auctions . For example, consider the problem of de-signing a single-item auction, which solicits bids from bidders and then decides which bidder (ifany) wins the item and what everybody pays. The traditional approach in economics to designingrevenue-maximizing auctions is average-case, meaning that the setup includes a commonly knowndistribution over each bidder’s willingness to pay for the item. An auction designer can then im-plement an auction that maximizes the expected revenue with respect to the assumed distributions(e.g., by setting a distribution-dependent reserve price). As with many average-case frameworks,this approach can lead to impractical solutions that are overly tailored to the assumed distribu-tions. A semi-random variant of the model allows an adversary to pick its favorite distribution outof a rich class, from which nature chooses a random sample for each bidder. This chapter presentsprior-independent auctions, both with and without a type of resource augmentation, that achievenear-optimal expected revenue simultaneously across all distributions in the class.Chapter 28, by Roughgarden and Seshadhri, takes a beyond worst-case approach to the anal-ysis of social networks . Most research in social network analysis revolves around a collection ofcompeting generative models—probability distributions over graphs designed to replicate the mostcommon features observed in such networks. The results in this chapter dispense with generativemodels and instead provide algorithmic or structural guarantees under deterministic combinatorialrestrictions on a graph—that is, for restricted classes of graphs. The restrictions are motivated bythe most uncontroversial properties of social and information networks, such as heavy-tailed degreedistributions and strong triadic closure properties. Results for these graph classes eﬀectively applyto all “plausible” generative models of social networks.Chapter 29, by Balcan, reports on the emerging area of data-driven algorithm design . The ideahere is to model the problem of selecting the best-in-class algorithm for a given application domainas an oﬄine or online learning problem, in the spirit of the aforementioned work on self-improvingalgorithms. For example, in the oﬄine version of the problem, there is an unknown distribution D over inputs, a class C of allowable algorithms, and the goal is to identify from samples the algorithmin C with the best expected performance with respect to D . The distribution D captures the detailsof the application domain, the samples correspond to benchmark instances representative of thedomain, and the restriction to the class C is a concession to the reality that it is often more practicalto be an educated client of already-implemented algorithms than to design a new algorithm fromscratch. For many computational problems and algorithm classes C , it is possible to learn an17almost) best-in-class algorithm from a modest number of representative instances.Chapter 30, by Mitzenmacher and Vassilvitskii, is an introduction to algorithms with predictions .For example, in the online paging problem (Section 2.4), the LRU policy makes predictions aboutfuture page requests based on the recent past. If its predictions were perfect, the algorithm wouldbe optimal. What if a good but imperfect predictor is available, such as one computed by amachine learning algorithm using past data? An ideal solution would be a generic online algorithmthat, given a predictor as a “black box”: (i) is optimal when predictions are perfect; (ii) hasgracefully degrading performance as the predictor error increases; and (iii) with an arbitrarily poorpredictor, defaults to the optimal worst-case guarantee. This chapter investigates the extent towhich properties (i)–(iii) can be achieved by predictor-augmented data structures and algorithmsfor several diﬀerent problems. This chapter is based in part on Roughgarden (2019).The simplex method (Section 2.1) is described, for example, in Dantzig (1963); Khachiyan(1979) proved that the ellipsoid method solves linear programming problems in polynomial time;and the ﬁrst polynomial-time interior-point method was developed by Karmarkar (1984). Lloyd’salgorithm for k -means (Section 2.2) appears in Lloyd (1962). The phrase “clustering is hard onlywhen it doesn’t matter” (Section 2.2) is credited to Naftali Tishby by Daniely et al. (2012). Thecompetitive analysis of online algorithms (Section 2.4) was pioneered by Sleator and Tarjan (1985).B´el´ady’s algorithm (Section 2.4) appears in B´el´ady (1967). The working set model in Section 3.1was formulated by Denning (1968). Theorem 3.1 is due to Albers et al. (2005), as is Exercise 5.Exercise 6 is folklore. The result in Exercise 7 is due to Block (1962) and Novikoﬀ (1962). Acknowledgments

I thank J´er´emy Barbay, Daniel Kane, and Salil Vadhan for helpful comments on a preliminary draftof this chapter.

References

Albers, S., L. M. Favrholdt, and O. Giel (2005). On paging with locality of reference.

Journal ofComputer and System Sciences 70 (2), 145–175.B´el´ady, L. A. (1967). A study of replacement algorithms for a virtual storage computer.

IBMSystems Journal 5 (2), 78–101.Block, H. D. (1962). The perceptron: A model for brain functioning.

Reviews of Modern Physics 34 ,123–135.Daniely, A., N. Linial, and M. Saks (2012). Clustering is diﬃcult only when it does not matter.arXiv:1205.4891.Dantzig, G. B. (1963).

Linear Programming and Extensions . Princeton University Press.Denning, P. J. (1968, May). The working set model for program behavior.

Commuications of theACM 11 (5), 323–333. 18armarkar, N. (1984). A new polynomial-time algorithm for linear programming.

Combinatorica 4 ,373–395.Khachiyan, L. G. (1979). A polynomial algorithm in linear programming.

Soviet MathematicsDoklady 20 (1), 191–194.Klee, V. and G. J. Minty (1972). How Good is the Simplex Algorithm? In O. Shisha (Ed.),

Inequalities III , pp. 159–175. New York: Academic Press Inc.Koutsoupias, E. and C. H. Papadimitriou (2000). Beyond competitive analysis.

SIAM Journal onComputing 30 (1), 300–317.Lloyd, S. P. (1962). Least squares quantization in PCM.

IEEE Transactions on InformationTheory 28 (2), 129–136.Novikoﬀ, A. (1962). On convergence proofs for perceptrons. In

Proceedings of the Symposium onMathematical Theory of Automata , Volume 12, pp. 615–622.Roughgarden, T. (2019). Beyond worst-case analysis.

Communications of the ACM 62 (3), 88–96.Roughgarden, T. (Ed.) (2020).

Beyond the Worst-Case Analysis of Algorithms . Cambridge Uni-versity Press.Sleator, D. D. and R. E. Tarjan (1985). Amortized eﬃciency of list update and paging rules.

Commincations of the ACM 28 (2), 202–208.

Exercises

1. Prove that for every deterministic cache replacement policy and cache size k , there is anadversarial page request sequence such that the policy faults on every request, and such thatan optimal clairvoyant policy would fault on at most a 1 /k fraction of the requests.[Hint: use only k + 1 distinct pages, and the fact that the optimal policy always evicts thepage that will be requested furthest in the future.]2. Let f : N → N be a function of the type described in Section 3, with f ( n ) denoting themaximum allowable number of distinct page requests in any window of length n .(a) Prove that there is a nondecreasing function f ′ : N → N with f ′ (1) = 1 and f ′ ( n + 1) ∈{ f ′ ( n ) , f ′ ( n + 1) } for all n such that a page request sequence conforms to f ′ if and onlyif it conforms to f .(b) Prove that parts (a) and (b) of Theorem 3.1 hold trivially if f ′ (2) = 1.3. Prove that the page request sequence constructed in the proof of Theorem 3.1(a) conformsto the given approximately concave function f .4. Prove Theorem 3.1(c).[Hint: Many diﬀerent choices of f and k work. For example, take k = 4, a set { , , , , } of 5 pages, the function f shown in Figure 5, and a page request sequence consisting of anarbitrarily large number of identical blocks of the eight page requests 10203040.]19 ( n ) 1 2 3 3 4 4 5 5 n · · · Figure 5: Function used to construct a bad page sequence for FIFO (Exercise 4).

Input : n unit vectors x , . . . , x n ∈ R d with labels b , . . . , b n ∈ {− , +1 } .(a) Initialize t to 1 and w to the all-zero vector.(b) While there is a point x i such that sgn( w t · x i ) = b i , set w t +1 = w t + b i x i and increment t . Figure 6:

The Perceptron Algorithm.

5. Prove the following analog of Theorem 3.1(b) for the FIFO replacement policy: for every k ≥ f with f (1) = 1, f (2) = 2, and f ( n +1) ∈ { f ( n ) , f ( n +1) } for all n ≥

2, the page fault rate of the FIFO policy on every request sequence that conformsto f is at most kf − ( k + 1) − . (2)[Hint: make minor modiﬁcations to the proof of Theorem 3.1(b). The expression in (2)suggests deﬁning phases such that (i) the FIFO policy makes at most k faults per phase; and(ii) a phase plus one additional request comprises requests for at least k + 1 distinct pages.]6. An instance of the knapsack problem consists of n items with nonnegative values v , . . . , v n andsizes s , . . . , s n , and a knapsack capacity C . The goal is to compute a subset S ⊆ { , , . . . , n } of items that ﬁts in the knapsack (i.e., with P i ∈ S s i ≤ C ) and, subject to this, has themaximum total value P i ∈ S v i .One simple greedy algorithm for the problem reindexes the items in nonincreasing order ofdensity v i s i and then returns the largest preﬁx { , , . . . , j } of items that ﬁts in the knapsack(i.e., with P ji =1 s i ≤ C ). Parameterize a knapsack instance by the ratio α of the largest sizeof an item and the knapsack capacity, and prove a parameterized guarantee for the greedyalgorithm: the total value of its solution is at least 1 − α times that of an optimal solution.7. The perceptron algorithm is one of the most classical machine learning algorithms (Figure 6).The input to the algorithm is n points in R d , with a label b i ∈ {− , +1 } for each point x i . Thegoal is to compute a separating hyperplane : a hyperplane with all of the positively-labeledpoints on one side, and all of the negatively labeled points on the other. Assume that thereexists a separating hyperplane, and moreover that some such hyperplane passes through theorigin. We are then free to scale each data point x i so that k x i k = 1—this does not changewhich side of a hyperplane x i is on.Parameterize the input by its margin µ , deﬁned as µ = max w : k w k =1 n min i =1 | w · x i | , The second assumption is without loss of generality, as it can be enforced by adding an extra “dummy coordinate”to the data points. w ranges over the unit normal vectors of all separating hyperplanes. Let w ∗ attainthe maximum. Geometrically, the parameter µ is the smallest cosine of an angle deﬁned bya point x i and the normal vector w ∗ .(a) Prove that the squared norm of w t grows slowly with the number of iterations t : k w t +1 k ≤ k w t k + 1 for every t ≥ w t onto w ∗ grows signiﬁcantly with every iteration: w t +1 · w ∗ ≥ w t · w ∗ + µ for every t ≥ t never exceeds 1 /µ2