# Lessons from the German Tank Problem

aa r X i v : . [ s t a t . O T ] J a n LESSONS FROM THE GERMAN TANK PROBLEM

GEORGE CLARK, ALEX GONYE, AND STEVEN J. MILLERA

BSTRACT . During World War II the German army used tanks to devastating advantage. The Alliesneeded accurate estimates of their tank production and deployment. They used two approaches toﬁnd these values: spies, and statistics. This note describes the statistical approach. Assuming thetanks are labeled consecutively starting at 1, if we observe k serial numbers from an unknown num-ber N of tanks, with the maximum observed value m , then the best estimate for N is m (1+1 /k ) − .This is now known as the German Tank Problem, and is a terriﬁc example of the applicability ofmathematics and statistics in the real world. The ﬁrst part of the paper reproduces known results,speciﬁcally deriving this estimate and comparing its effectiveness to that of the spies. The secondpart presents a result we have not found in print elsewhere, the generalization to the case where thesmallest value is not necessarily 1. We emphasize in detail why we are able to obtain such clean,closed-form expressions for the estimates, and conclude with an appendix highlighting how to usethis problem to teach regression and how statistics can help us ﬁnd functional relationships. C ONTENTS

1. Introduction 22. Derivation with a known minimum 42.1. The probability that the sample maximum is m b N

63. Derivation with an unknown minimum 73.1. The probability that the spread is s b N

84. Comparison of Approaches 10Appendix A. The German Tank Problem and Linear Regression 11A.1. Theory of Regression 11A.2. Issues in Applying to the German Tank Problem 12A.3. Resolving Implementation Issues 14A.4. Determining the Functional Form 15References 16

Date : January 22, 2021.2010

Mathematics Subject Classiﬁcation.

Key words and phrases.

German Tank Problem, Binomial Identities, Regression.The third named author was partially supported by NSF Grant DMS1561945. Parts of this paper were given bythe ﬁrst two named authors as their senior colloquium; we thank our colleagues in the department for comments.The third author has presented this topic at several conferences and programs, and thanks the participants for manysuggestions; for video see or https://youtu.be/I3ngtIYjw3w .We thank our colleagues at Williams and audiences at several talks over the years, and the referee, for valuablefeedback that improved the exposition, and Jason Zhou for pointing out some typos in an earlier draft. IGURE

1. Left: Europe before the start of major hostilities. Right: Europe in1942. Images from Wikimedia Commons from author San Jose.1. I

NTRODUCTION

In this paper we revisit a famous and historically important problem which has since becomea staple in many probability and statistics classes: the German Tank Problem. This case studyillustrates that one does not need to use the most advanced mathematics to have a tremendousimpact on real world problems; the challenge is frequently in creatively using what one knows. By the end of 1941 most of continental Europe had fallen to Nazi Germany and the other Axispowers, and by 1942 their forces had begun their signiﬁcant advance in the Eastern Front deepinto the Soviet Union; see Figure 1 for an illustration of their rapid progress, or for an animation of territorial gains day by day. Akey component to their rapid conquests lay with their revolutionary use of tanks in modern warfare.While other militaries, most notably France, used tanks as a modern, armored form of cavalry,the Germans were the ﬁrst to fully utilize tanks’ speed and strength to their advantage. Tankswould move rapidly and punch through enemy lines creating gaps which German infantry wouldstream through. Once through the holes in the line, the Germans would wreak havoc on linesof communication, creating logistical nightmares for the combatants left on the front lines. Thislightning fast warfare has become dubbed Blitzkrieg (or lighting war) by many historians.With the Nazis utilizing tanks with such devastating results, it was essential for the Allies to stopthem. A key component to the solution was ﬁguring out how many tanks the Germans were build-ing, or had deployed in various theaters, in order to properly allocate resources. As expected, theytried spying (both with agents and through decrypting intercepted messages) to accurately estimatethese numbers. It was essential that the Allies obtain accurate values, as there is a tremendous dan-ger in both under- and over-estimating the enemy’s strength. The consequence of underestimatingis clear, as one could suddenly be outnumbered in battle. Overestimating is also bad, as it can lead Another great example is the famous Battle of Midway and the role the cryptographers played in ﬁguring out theJapanese target; see for example [1]. o undo caution and failure to exploit advantages, or to committing too many resources in one the-ater and thus not having enough elsewhere. The U.S. Civil War provides a terriﬁc example of theseconsequences, where General George McClellan would not have his Union army take the ﬁeldagainst the Confederates because, in his estimation, his forces were greatly outnumbered thoughin fact they were not. This situation led to one of President Abraham Lincoln’s many great quips: If General McClellan isn’t going to use his army, I’d like to borrow it for a time. Considering howclose Pickett’s charge came to succeeding at Gettysburg, or what would have happened if Shermanhadn’t taken Atlanta before the 1864 elections (where McClellan, now as the Democrat nomineefor president, was running against Lincoln on a platform of a negotiated peace), the paralysis fromincorrect analysis could have changed the outcome of the war.Returning to World War II and the problem of determining the number of tanks produced andfacing them in the ﬁeld, the Allies understandably wanted to have a way to evaluate the effective-ness of their estimates. During a battle, they realized that the destroyed and captured tanks hadserial numbers on their gearboxes which could help with this problem. Assuming that the serialnumbers are sequential and start with 1, given a string of observed numbers one can try to estimatethe largest value. This discovery contributed to the birth of the statistical method, the use of apopulation maximum formula, and changed both the war and science. The original variant of the problem assumes the ﬁrst tank is numbered 1, there are an unknownnumber N produced (or in theater), that the numbers are consecutive, and that k values are observedwith the largest being m . Given this information, the goal is to ﬁnd b N , the best estimate of N . Theformula they derived is b N = m (cid:18) k (cid:19) − . For the reader unfamiliar with this subject, we deliberately do not state here in the introductionhow well this formula does versus what was done by spies and espionage. While this story hasbeen well told before (see for example [3, 4, 5, 6]), our contribution is to extend the analysis toconsider the more general case, namely what happens when we do not know the smallest value.To our knowledge this result has not been isolated in the literature; we derive in §3 that if s is thespread between the smallest and the largest of the observed k serial numbers, then b N = s (cid:18) k − (cid:19) − . In Appendix A we show how to use regression to show that these are reasonable formulas, and thusthe German Tank Problem can also be used to introduce some problems and subtleties in regressionanalysis, as well as serve as an introduction to mathematical modeling.As it is rare to have a clean, closed-form expression such as the ones above, we brieﬂy remark onour fortune. The key observation is that we have a combinatorial problem where certain binomialidentities are available, and these lead to tremendous simpliﬁcations. There are many different phrasings of this remark; this one is taken from https://thehistoriansmanifesto.wordpress.com/2013/05/13/best-abraham-lincoln-quotes/ . There are advantages in consecutive labeling; for example, it simpliﬁes some maintenance issues as it is clearwhich of two tanks is older. That said, as this paper is appearing in a mathematics journal and not a bulletin of a spy agency, we invite thereader to conjecture which method did better. . D ERIVATION WITH A KNOWN MINIMUM

In this section we prove b N = m (cid:18) k (cid:19) − when we observe k tanks, the largest labeled m , and knowing that the smallest number is 1 andthe tanks are consecutively numbered. Before proving it, as a smell test we look at some extremecases. First, we never obtain an estimate that is less than the largest observed number. Second, ifthere are many tanks and we observe just one (so k = 1 ), then b N is approximately m . This is veryreasonable, and essentially just means that if we only have one data point, it’s a good guess thatit was in the middle. Further, as k increases the amount we must inﬂate our observed maximumvalue decreases. For example, when k = 2 we inﬂate m by approximately a factor of / , or inother words this is saying our observed maximum value is probably about two-thirds of the truevalue. Finally, if k equals the number of tanks N , then m must also equal N , and the formulasimpliﬁes to b N = N .We break the proof into two parts. While we are fortunate in that we are able to obtain a closed-form expression, if we have a good guess as to the relationship we can use statistics to test itsreasonableness; we do that in Appendix A. For the proof we ﬁrst determine the probability that theobserved largest value is m . Next we compute the expected value, and show how to pass from thatto an estimate for N . We need two combinatorial results.The ﬁrst is Pascal’s identity: (cid:18) n + 1 r (cid:19) = (cid:18) nr (cid:19) + (cid:18) nr − (cid:19) . (2.1)There are many approaches to proving this; the easiest is to interpret both sides as two differentways of counting how many ways we can choose r people from a group of n + 1 people, whereexactly n of these people are in one set and exactly one person is in another set. It is easier to seeif we rewrite it as (cid:18) n + 1 r (cid:19) = (cid:18) (cid:19)(cid:18) nr (cid:19) + (cid:18) (cid:19)(cid:18) nr − (cid:19) ; this is permissible because (cid:0) (cid:1) = (cid:0) (cid:1) = 1 . Note the left side is choosing r people from thecombined group of n + 1 people, while the right is choosing r people with the ﬁrst summand cor-responds to not choosing the person from the group with just one person, and the second summandto requiring we chose that person. ✷ The second identity involves sums of binomial coefﬁcients: N X m = k (cid:18) mk (cid:19) = (cid:18) N + 1 k + 1 (cid:19) . (2.2)We can prove this by induction on N , noting that k is ﬁxed. The base case is readily established.Letting N = k , we ﬁnd N X m = k (cid:18) mk (cid:19) = k X m = k (cid:18) mk (cid:19) = (cid:18) kk (cid:19) = 1 = (cid:18) k + 1 k + 1 (cid:19) . or the inductive step, we assume N X m = k (cid:18) mk (cid:19) = (cid:18) N + 1 k + 1 (cid:19) . Then N +1 X m = k (cid:18) mk (cid:19) = N X m = k (cid:18) mk (cid:19)! + (cid:18) N + 1 k (cid:19) = (cid:18) N + 1 k + 1 (cid:19) + (cid:18) N + 1 k (cid:19) = (cid:18) N + 2 k + 1 (cid:19) , where the last equality follows from Pascal’s identity, (2.1). This completes the proof. ✷ While this identity sufﬁces for the original formulation of the German Tank Problem, when wedo not know the starting serial number the combinatorics become slightly more involved, and weneed a straightforward generalization: b X ℓ = a (cid:18) ℓa (cid:19) = (cid:18) b + 1 a + 1 (cid:19) ; (2.3)the proof follows similarly.2.1. The probability that the sample maximum is m . Let M be the random variable for themaximum number observed, and let m be the value we see. Note that there is zero probability ofobserving a value smaller than k or larger than N . We claim for k ≤ m ≤ N that Prob( M = m ) = (cid:0) mk (cid:1) − (cid:0) m − k (cid:1)(cid:0) Nk (cid:1) = (cid:0) m − k − (cid:1)(cid:0) Nk (cid:1) . We give two proofs. The ﬁrst is to note there are (cid:0) Nk (cid:1) ways to choose k numbers from N whenorder does not matter. The probability that the largest observed is exactly m equals the probabilitythe largest is at most m minus the probability the largest is at most m − . The ﬁrst probability isjust (cid:0) mk (cid:1) / (cid:0) Nk (cid:1) , as if the largest value is at most m then all k observed numbers must be taken from { , , . . . , m } . A similar argument gives the second probability is (cid:0) m − k (cid:1) / (cid:0) Nk (cid:1) , and the claim nowfollows by using Pascal’s identity to simplify the difference of the binomial coefﬁcients.We could also argue as follows. If the largest is m then we have to choose that serial number,and now we must choose k − tanks from the m − smaller values; thus we ﬁnd the probabilityis just (cid:0) m − k − (cid:1) / (cid:0) Nk (cid:1) . ✷ Remark 2.1.

Interestingly, we can use the two equivalent arguments above as yet another way toprove the Pascal identity. .2. The best guess for b N . We now compute the best guess for N by ﬁrst ﬁnding the expectedvalue of M . Recall the expected value of a random variable M is the sum of all the possible valuesof M times the probability of observing that value. We write E [ M ] for this quantity, and thus wemust compute E [ M ] := N X m = k m Prob( M = m ) (note we only need to worry about m in this range, as for all other m the probability is zero andthus does not contribute). Once we ﬁnd a formula for E [ M ] we will convert that to one for theexpected number of tanks.Our ﬁrst step is to substitute in the probability that M equals m , obtaining E [ M ] = N X m = k m (cid:0) m − k − (cid:1)(cid:0) Nk (cid:1) . Fortunately this sum can be simpliﬁed into a nice closed-form expression; it is this simpliﬁcationthat allows us to obtain a simple formula for b N . We expand the binomial coefﬁcients in theexpression for E [ M ] and then use our second combinatorial identity, (2.2), to simplify the sumof (cid:0) mk (cid:1) which emerges as we manipulate the quantities below. We ﬁnd E [ M ] = N X m = k m (cid:0) m − k − (cid:1)(cid:0) Nk (cid:1) = N X m = k m ( m − k − m − k )! k !( N − k )! N != N X m = k m ! kk !( m − k )! k !( N − k )! N != k · k !( N − k )! N ! N X m = k (cid:18) mk (cid:19) = k · k !( N − k )! N ! (cid:18) N + 1 k + 1 (cid:19) = k · k !( N − k )! N ! ( N + 1)!( k + 1)!( N − k )!= k ( N + 1) k + 1 . As we have such a clean expression, it’s trivial to solve for N in terms of k and E [ M ] : N = E [ M ] (cid:18) k (cid:19) − . Thus if we substitute in m (our observed value for M ) as our best guess for E [ M ] , we obtain ourestimate for the number of tanks produced: b N = m (cid:18) k (cid:19) − , ompleting the proof. ✷ Remark 2.2.

A more advanced analysis can prove additional results about our estimator, forexample, whether or not it is unbiased.

Remark 2.3.

There are many ways to see this formula is reasonable. The ﬁrst is to try extremecases, such as k = N (which forces m to be N and gives N as the answer), or to try k = 1 . In thatcase we expect our one observation to be around N/ , and thus a formula that has the best guessbeing doubling the observation is logical. We can also get close to this formula from by trying toguess the functional form (for more details see Appendix A). We know our best guess must be atleast m , so let’s write it as m + f ( m, k ) . For a ﬁxed k as m increases we might expect our guessto increase, while for ﬁxed m as k increases we would expect a smaller boost. These heuristicssuggest f ( m, k ) increases with m and decreases with k ; the simplest such function is bm/k forsome constant b . This leads to a guess of m + bm/k , and again looking at extreme cases we getvery close to the correct formula.

3. D

ERIVATION WITH AN UNKNOWN MINIMUM

Not surprisingly, when we do not know the lowest serial number the resulting algebra becomesmore involved; fortunately, though, with a bit of work we are still able to get nice closed-formexpressions for the needed sums and obtain again a clean answer for the estimated number oftanks. We still assume the tanks are numbered sequentially, and focus on the spread (the differencebetween the largest and smallest observed values). Similar to the previous section, we derive aformula to inﬂate the observed spread to be a good estimate of the number of total tanks.We ﬁrst set some notation: • the minimum tank serial value, N , • the maximum tank serial value, N , • the total number of tanks, N ( N = N − N + 1 ), • the observed minimum value, m (with corresponding random variable M ), • the observed maximum value, m (with corresponding random variable M ), • the observed spread s (with corresponding random variable S ).As s = m − m , in the arguments below we can focus on just s and S . We will prove the bestguess is s (cid:0) k − (cid:1) − . Remark 3.1.

There are two differences between this formula and the case when the smallest serialnumber is known. The ﬁrst is we divide by k − and not k ; however, as we cannot estimate aspread with one observation this is reasonable. Note the similarity here with the sample standarddeviation, where we divide by one less than the number of observations; while one point sufﬁcesto estimate a mean, we need at least two for the variance. The second difference is that we have afactor of 2, which can be interpreted as movement in both directions. The probability that the spread is s . We claim that if we observe k tanks then for k − ≤ s ≤ N − N we have Prob( S = s ) = P N − sm = N (cid:0) s − k − (cid:1)(cid:0) N − N +1 k (cid:1) = ( N − N + 1 − s ) (cid:0) s − k − (cid:1)(cid:0) N − N +1 k (cid:1) = ( N − s ) (cid:0) s − k − (cid:1)(cid:0) Nk (cid:1) , and for all other s the probability is zero. o see this, note that the spread s must be at least k − (as we have k observations), and cannotbe larger than N − N . If we want a spread of s , if the smallest observed value is m then the largestis m + s . We must choose exactly k − of the s − numbers in { m + 1 , m + 2 , . . . , m + s − } ;there are (cid:0) s − k − (cid:1) ways to do so. This proves the ﬁrst equality, the sum over m . As all the summandsare the same we get the second equality, and the third follows from our deﬁnition of N . ✷ The best guess for b N . We argue similarly as in the previous section. In the algebra belowwe will use our second binomial identity, (2.2); relabeling the parameters it is b X ℓ = a (cid:18) ℓa (cid:19) = (cid:18) b + 1 a + 1 (cid:19) . (3.1)We begin by computing the expected value of the spread. We include all the details of thealgebra; the idea is to manipulate the expressions and pull out terms that are independent of thesummation variable, and rewrite expressions so that we can identify binomial coefﬁcients and thenapply our combinatorial results. We have E [ S ] = N − X s = k − s Prob( S = s )= N − X s = k − s ( N − s ) (cid:0) s − k − (cid:1)(cid:0) Nk (cid:1) = (cid:18) Nk (cid:19) − N − X s = k − s ( N − s ) (cid:18) s − k − (cid:19) = (cid:18) Nk (cid:19) − N N − X s = k − s ( s − s − k + 1)!( k − − (cid:18) Nk (cid:19) − N − X s = k − s ( s − s − k + 1)!( k − (cid:18) Nk (cid:19) − N N − X s = k − s !( k − s − k + 1)!( k − − (cid:18) Nk (cid:19) − N − X s = k − ss !( k − s − k + 1)!( k − T − T . We ﬁrst simplify T ; below we always try to multiply by 1 in such a way that we can combineratios of factorials into binomial coefﬁcients: T = (cid:18) Nk (cid:19) − N N − X s = k − s !( k − s − k + 1)!( k − (cid:18) Nk (cid:19) − N ( k − N − X s = k − (cid:18) sk − (cid:19) = (cid:18) Nk (cid:19) − N ( k − (cid:18) Nk (cid:19) = N ( k − , where we used (3.1) with a = k − and b = N − . urning to T we argue similarly, at one point replacing s with ( s −

1) + 1 to assist in collectingfactors into a binomial coefﬁcient: T = (cid:18) Nk (cid:19) − N − X s = k − ss !( k − s − k + 1)!( k − (cid:18) Nk (cid:19) − N − X s = k − ( s + 1 − s !( k − s − ( k − k − (cid:18) Nk (cid:19) − N − X s = k − ( s + 1)!( k − k ( s + 1 − k )!( k − k − (cid:18) Nk (cid:19) − N − X s = k − s !( k − s − ( k − k − (cid:18) Nk (cid:19) − N − X s = k − ( s + 1)! k ( k − s + 1 − k )! k ! − (cid:18) Nk (cid:19) − N − X s = k − ( k − (cid:18) sk − (cid:19) = (cid:18) Nk (cid:19) − N − X s = k − k ( k − (cid:18) s + 1 k (cid:19) − (cid:18) Nk (cid:19) − N − X s = k − ( k − (cid:18) sk − (cid:19) = T + T . We can immediately evaluate T by using (2.3) with a = k − and b = N − , and ﬁnd T = (cid:18) Nk (cid:19) − ( k − (cid:18) Nk (cid:19) = k − . Thus all that remains is analyzing T : T = (cid:18) Nk (cid:19) − N − X s = k − (cid:18) s + 1 k (cid:19) k ( k − . We pull k ( k − outside the sum, and letting w = s + 1 we see that T = (cid:18) Nk (cid:19) − k ( k − N X w = k (cid:18) wk (cid:19) , and then from (2.3) with a = k and b = N we obtain T = (cid:18) Nk (cid:19) − k ( k − N X w = k (cid:18) wk (cid:19) = (cid:18) Nk (cid:19) − k ( k − (cid:18) N + 1 k + 1 (cid:19) . Thus substituting everything back yields E [ S ] = N ( k −

1) + ( k − − (cid:18) Nk (cid:19) − k ( k − (cid:18) N + 1 k + 1 (cid:19) . e can simplify the right hand side: ( N + 1)( k − − k ( k − ( N +1)!( N − k )!( k +1)! N !( N − k )! k ! = ( N + 1)( k − − k ( k −

1) ( N + 1)!( N − k )! k ! N !( N − k )!( k + 1)!= ( N + 1)( k − − k ( k − N + 1 k + 1= ( N + 1)( k − − k ( k − N + 1) k + 1= ( N + 1)( k − (cid:18) − kk + 1 (cid:19) = ( N + 1) k − k + 1 , and thus obtain E [ S ] = ( N + 1) k − k + 1 . The analysis is completed as before, where we pass from our observation of s for S to a predic-tion b N for N : b N = k + 1 k − s − s (cid:18) k − (cid:19) − , where the ﬁnal equality is due to rewriting the algebra to mirror more closely the formula from thecase where the ﬁrst tank is numbered 1. Note that this formula passes the same smell checks theother did; for example s k − − is always at least 1 (remember k is at least 2), and thus the lowestestimate we can get for the number of tanks is s + 1 .4. C OMPARISON OF A PPROACHES

So, which did better: statistics or spies? Once the Allies won the war, they could look into AlbertSpeer’s, the Nazi Minister of Armaments, records to see the exact number of tanks produced eachmonth; see Figure 2.F

IGURE

2. Comparison of estimates from statistics and spies to the true values.Table from [6].The meticulous German record keeping comes in handy for the vindication of the statisticians;these estimates were astoundingly more accurate. While certainly not perfect (an underestimation f 30 tanks could have pretty dire consequences when high command is allocating resources),the statistical analysis was tremendously superior to the intelligence estimates, which were off byfactors of 5 or more. We mentioned earlier the lessons to be learned from McClellan’s caution.He was the ﬁrst General of the Army of the Potomac (which was the Union army headquarterednear Washington), and he repeatedly missed opportunities to deliver a debilitating blow to GeneralRobert E. Lee’s army of Northern Virginia, most famously during Lee’s retreat from Antietam.Despite vastly outnumbering Lee in men and supplies, McClellan chronically overestimated Lee’sforces, causing him to be overly cautious and far too timid a commander. Ultimately the Civil Warwould drag on for four years and costing over 650,000 American lives, and one wonders how theoutcome would have been different if McClellan had been more willing to take the ﬁeld.We encourage the reader to write some simple code to simulate both problems discussed here(or see [5]), namely when we know and when we don’t know the number of the lowest tank. Theseproblems provide a valuable warning on how easy it is to accidentally convey information. Inmany situations today numbers are randomly generated to prevent such an analysis. Alternatively,sometimes numbers are deliberately started higher to fool an observer into thinking that moreis present than actually is (examples frequently seen are the counting done during a workout,putting money in the tip jar at the start of the shift to encourage future patrons to be generous, orcheckbooks starting with the ﬁrst check as 100 or higher so the recipient does not believe it is froma new account).A PPENDIX

A. T HE G ERMAN T ANK P ROBLEM AND L INEAR R EGRESSION

The German Tank Problem is frequently used in probability or discrete math classes, as it il-lustrates the power of those two disciplines to use binomial identities to great advantage. It’s alsoseen in statistics classes in discussing how to ﬁnd good estimators of population values. Focusingon these examples, however, neglects another great setting where it may be used effectively: asan application of the power of Linear Regression (or the Method of Least Squares). We quicklyreview how these methods yield the best-ﬁt line or hyperplane, and then generalize to certain non-linear relationships. We show how simulations can be used to provide support for formulas. Thisis extremely important, as often we are unable to prove conjectured relationships. Returning toWorld War II, the Allies could run trials (say drawing numbered pieces of paper from a bag) tomodel the real world problem, and use the gathered data to sniff out the relationship m , k and N .Additionally, we use this appendix as an opportunity to discuss some of the issues that can arisewhen implementing the Method of Least Squares to ﬁnd the best ﬁt line. While these do not occurin most applications, it is worth knowing that they can happen and seeing solutions.A.1. Theory of Regression.

Suppose we believe there are choices of a and b such that given aninput x we should observe y = ax + b , but we don’t know what these values are. We could observea large number of pairs of data { x i , y i } Ii =1 , and use these to ﬁnd the values of a and b that minimizethe sum of the squares of the errors between the observed and predicted values: E ( a, b ) = I X i =1 ( y i − ( ax i + b )) . We cannot just add the errors, as then a positive error could cancel with a negative error. We could take the sum ofthe absolute values, but the absolute value function is not differentiable; it is to have calculus available that we measureerrors by sums of squares. y setting ∂E∂a = ∂E∂b = 0 , after some algebra we ﬁnd the best ﬁt values are (cid:18) b a b b (cid:19) = (cid:18) P Ii =1 x i P Ii =1 x i P Ii =1 x i P Ii =1 (cid:19) − (cid:18) P Ii =1 x i y i P Ii =1 y i (cid:19) ; see for example the supplemental material online for [2]. What matters is that the relation is linearin the unknown parameters a and b (or more generally a , . . . , a ℓ ); similar formulas hold for y = a f ( x ) + · · · + a ℓ f ( x ℓ ) . For a linearly algebraic approach to regression see for example [3].Regression is a rich subject; we wish to try to ﬁnd the best ﬁt parameters to relate N to m and k ; however, we’ll shortly see that our initial guess at a relationship is non-linear. Fortunately, bytaking logarithms, we can convert many non-linear relations to linear ones, and thus the formulasabove are available again. The idea is that by doing extensive simulations we can gather enoughdata to make a good conjecture on the relationship. Sometimes, as will be the case with theGerman Tank Problem, we are able to do a phenomenal job in predicting the functional form andcoefﬁcients, while other times we can only get some values with conﬁdence.To highlight these features we ﬁrst quickly review a well-known problem: The Birthday Paradox(see for example [2]). The standard formulation assumes we have a year with D days, and askshow many people do we need in a room to have a 50% chance that at least two share a birthday,under the assumption that the birthdays are independent and uniformly distributed from 1 to D .A straightforward analysis shows the answer is approximately D / √ log 4 . We now consider theclosely related but less well-known problem of what is the expected number of people P we needin a room before there is a match. Based on the ﬁrst problem it is reasonable to expect the answerto also be on the order of D / , but what is the constant factor? We can try a relation of the form P = BD a , and then taking logs (and setting b = log B ) we would get log P = a log D + b . SeeFigure 3.The two simulations both have similar values for a , with both of them consistent with an ex-ponent of 1/2. Unfortunately the values for b wildly differ, though of the two parameters we caremore about a as it tells us how our answer changes with the number of days. There is an importantlesson here: data analysis can often suggest much of the answer, but it is not always the full storyand there is a role for theory in supplementing such analysis.A.2. Issues in Applying to the German Tank Problem.

Building on this lesson, we return tothe German Tank Problem. What is a reasonable choice for N as a function of m and k ? Clearly N is at least m , so we try N = m + f ( m, k ) , which transfers the problem to estimating f ( m, k ) .We expect that as m increases this should increase, and as k increases it should decrease. Lookingat extreme cases is useful; if k = N then f ( N, N ) should vanish, as then m must equal N . The The resulting matrix is invertible, and hence there is a unique solution, so long as at least two of the x i ’s differ.One can see this through some algebra, where the determinant of the matrix is essentially the variance of the x i ’s; ifthey are not all equal then the variance is positive. If the x i ’s are all equal the inverse does not exist, but in such a casewe should not be able to predict how y varies with x as we are not varying x ! As a nice exercise, use linearity of expectation to show that we expect at least two people to share a birthday when P = D / √ . .0 7.5 8.0 8.5 9.0123456 7.0 7.5 8.0 8.5 9.0123456 F IGURE

3. Plot of best ﬁt line for P as a function of D . We twice ran 10,000simulations with D chosen from , to , . Best ﬁt values were a ≈ . , b ≈ − . (left) and a ≈ . , b ≈ . (right).

20 40 60 80 100 120 14050100150 F IGURE

4. Plot of best ﬁt line for N − m as a function of m/k . We ran 10,000simulations with N chosen from [100 , and k from [10 , . Best ﬁt values for N − m = a ( m/k ) + b for this simulation were a ≈ . , b ≈ . .simplest function that ﬁts this is f ( m, k ) = b · m/k with b as our free parameter, and we are led toconjecturing a relationship of the form N = m + b mk = m (cid:18) bk (cid:19) . Note that this guess is quite close to the true answer, but because the observed quantities m and k appear as they do, it is not a standard regression problem. We could try to ﬁx this by looking at N − m , the number of tanks we need to add to our observed largest value to get the true number.We could then try to write this as a linear function of the ratio m/k : N − m = a mk + b, where we allowed ourselves a constant term to increase our ﬂexibility of what we can model.While for a = − b = 1 this reproduces the correct formula, ﬁnding the best ﬁt values leads to aterrible ﬁt, as evidenced in Figure 4.Why is the agreement so poor, given that proper choices exist? The problem is the way m and k interact, and in the set-up above we have the observed quantity m both as an input variable and s an output in the relation. We thus need a way to separate m and k , keeping both on the inputside. As remarked, we can do this through logarithms; we discuss another approach in the nextsubsection.A.3. Resolving Implementation Issues.

We look at our best ﬁt line for two choices of k ; The leftside of Figure 5 does k = 1 while Figure 6 is k = 5 . Both of these show a terrible ﬁt of N as alinear function of m (for a ﬁxed k ). In particular, when k = 1 we expect N to be m − but ourbest ﬁt line is about . m + 2875 ; this is absurd as for large m we predict N to be less than m !Note, however, the situation is completely different if instead we plot m against N (the right handside of those ﬁgures). Clearly if N linearly depends on m then m linearly depends on N . Whenwe do the ﬁts this way, the results are excellent. Least Squares Best Fit Line vs Theory: NumTanks = a MaxObs + b Least Squares Best Fit Line vs Theory: MaxObs = c NumTanks + d F IGURE

5. Left: Plot of N vs maximum observed tank m for ﬁxed k = 1 . Theory: N = 2 m − , best ﬁt N = . m + 2875 . Right: Plot of maximum observed tank m vs N for ﬁxed k = 1 . Theory: m = . N + . , best ﬁt m = . N + 10 . . Least Squares Best Fit Line vs Theory: NumTanks = a MaxObs + b Least Squares Best Fit Line vs Theory: MaxObs = c NumTanks + d F IGURE

6. Left: Plot of N vs maximum observed tank m for ﬁxed k = 5 . Theory: N = 1 . m − , best ﬁt N = 1 . m + 749 . Right: Plot of maximum observed tank m vs N for ﬁxed k = 5 . Theory: m = . N + . , best ﬁt m = . N + 25 . .Note that from the point of view of an experiment, it makes more sense to plot m as the depen-dent variable and N as the independent, input variable. The reason is the way we simulate; we ﬁxa k and an N and then choose k distinct numbers uniformly from { , . . . , N } . e end with another approach which works well, and allows us to view N as a function of m .Instead of plotting each pair ( m, N ) for a ﬁxed k , we instead ﬁx k , choose an N , and then do 100trials. For each trial we record the largest serial number m , and then we average these, and plot ( m, N ) where m is the average. This greatly decreases the variability, and we now obtain a nearlyperfect straight line and ﬁt; see Figure 7. M(cid:2)(cid:3)(cid:4)(cid:5)(cid:6) (cid:7)(cid:8)

Least Squares Best Fit Line:

A(cid:9)(cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:15)(cid:16)(cid:17)

NumTanks = a (cid:18)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23) + b F IGURE

7. Plot of N vs maximum observed tank m for ﬁxed k = 1 . Theory: N = 1 . m − , best ﬁt N = 1 . m + 171 . .A.4. Determining the Functional Form.

We consider the more general relation N = Cm a (cid:18) bk (cid:19) , where we expect C = a = b = 1 ; note this won’t have the − summand we know should be there,but for large m that should have negligible impact. Letting C = e c for notational convenience, weﬁnd log N = c + a log( m ) + log (cid:18) bk (cid:19) . If x is large then log(1 + 1 /x ) ≈ /x , so we try the approximation log N ≈ c + a log( m ) + b k . Figure 8 shows the analysis when C = 1 (so c = 0 ), as the analysis then reduces to the usual casewith two unknown parameters. We chose to take C = 1 from the lesson we learned in the analysisof the Birthday Problem.The best ﬁt values of the parameters are a = 0 . and b = 0 . , which are reasonablyclose to a = b = 1 . Thus these numerics strongly support our conjectured relation N = m (1 +1 /k ) , and shows the power of statistics. While we were able to see the arguments needed toprove this relation essentially holds, imagine we could not prove it but still have our heuristic IGURE

8. Plot of log N against log m and /k . We ran 10,000 simulations with N chosen from [100 , and k from [10 , . The data is well-approximated bya plane (we do not draw it in order to prevent our image from being too cluttered).arguments and analysis of extreme cases which suggest it is true. By simulating data and runningthe regression, we see that our formula does a stupendous job explaining our observations, andthus gain conﬁdence to use it in the ﬁeld.We end with one last approach. Let us guess a relationship of the form N = a ( k ) m + b ( k ) ,where a ( k ) = 1 + f ( k ) (we write a ( k ) as f ( k ) as we know there have to be at least m tanks).We can ﬁx k , and ﬁnd the best ﬁt values of a ( k ) and b ( k ) . In Figure 9 we plot the best ﬁt slope a ( k ) versus k , as well as a log-log plot. For the log-log plot we look at a ( k ) − , subtracting theknown component. We see a beautiful linear relation, and thus even if we did not know it shouldbe m plus a constant times m/k , the data suggests that beautifully! Speciﬁcally, we found the bestﬁt line was log( a ( k ) −

1) = − .

999 log( k ) − . , suggesting that a ( k ) = 1 + 1 /k ; we obtain thecorrection functional form just by running simulations!R EFERENCES [1] M. Cozzens and S. J. Miller,

The Mathematics of Encryption: An Elementary Introduction , AMS MathematicalWorld series , Providence, RI, 2013.[2] S. J. Miller, The Probability Lifesaver , Princeton University Press, Princeton, NJ, 2018. https://web.williams.edu/Mathematics/sjmiller/public_html/probabilitylifesaver/index.htm .[3] G. Strang,

Introduction to Linear Algebra , Fifth Edition, Wellesley-Cambridge Press, 2016.[4] Probability and Statistics Blog,

How many tanks? MC testing the GTP , https://statisticsblog.com/2010/05/25/how-many-tanks-gtp-gets-put-to-the-test/ .[5] Statistical Consultants Ltd, The German Tank Problem , .[6] WikiEducator, Point Estimation - German Tank Problem , https://wikieducator.org/Point_estimation_-_German_tank_problem .[7] Wikipedia, German Tank Problem , Wikimedia Foundation, https://en.wikipedia.org/wiki/German_tank_problem . N(cid:24)(cid:25)(cid:26)(cid:27)(cid:28)1(cid:29)(cid:30)(cid:31) !"

BestFit

S;<=> [email protected] - - - - - LFG - HIJ PKOQ oR

Best Fit Slope vT UVWXYZ[ \]^_‘acde wfgh ijklmn F IGURE

9. Left: Plot of a ( k ) , the slope in N = a ( k ) m + b , versus k . Right: Log-Log Plot of a ( k ) − versus k . In log( a ( k ) − versus log k , theory is log( a ( k ) −

1) = − k ) , best ﬁt line is log( a ( k ) −

1) = − .

999 log( k ) − . . Email address : [email protected] D EPARTMENT OF M ATHEMATICS AND S TATISTICS , W

ILLIAMS C OLLEGE , W

ILLIAMSTOWN , MA 01267

Email address : [email protected] D EPARTMENT OF M ATHEMATICS AND S TATISTICS , W

ILLIAMS C OLLEGE , W

ILLIAMSTOWN , MA 01267

Email address : [email protected],[email protected] D EPARTMENT OF M ATHEMATICS AND S TATISTICS , W

ILLIAMS C OLLEGE , W

ILLIAMSTOWN , MA 01267

Current address : Department of Mathematical Sciences, Carnegie Mellon University, Pittsburgh, PA 15213: Department of Mathematical Sciences, Carnegie Mellon University, Pittsburgh, PA 15213