Its All on the Square- The Importance of the Sum of Squares and Making the General Linear Model Simple
!! Its All on the Square- The Importance of the Sum of Squares and Making the General Linear Model Simple
Alexander Nussbaum St John's University Analytic Medtek Consultants LLC Richard Seides Seton Hall University Statistics is one of the most valuable of disciplines. Science is based on proof and it alone produces results, other approaches are not, and do not. Statistics is the only acceptable language of proof in science. Yet statistics is difficult to understand for a large percentage of those who will be evaluating and even doing research. Reasons for this difficulty may be that statistics operates counter to the way people think, as well as the widespread phobia of numeracy. Adding to the difficulty is that undergraduate textbooks tend to make statistical tests seem to be an unorganized conglomeration of unrelated procedures, and this leads to a failure of students to understand that all of the parametric procedures they are studying in an introductory course are ultimately doing the same thing and stem from common sources. In statistics, precisely because the material is complex, the presentation must be simple! This article endeavors to do just that.
The Problem in Teaching Statistics
Statistics is among the most valuable subjects for anyone to learn but unfortunately, it is difficult to teach. It requires a probabilistic method of thinking that is different from that in other courses -- even other math courses. Science produces knowledge about empirical reality, and statistics is the language of proof in science. As put by the father of statistics, Sir Francis Galton (1889), "Some people hate the very name of statistics, but I find them full of beauty and interest…They are the only tools by which an opening can be cut through the formidable thicket of difficulties that bars the path of those who pursue the Science of man." Over 120 years later, statistics are still universally seen as the basis
There is a big difference between complexity of material and simplicity of presentation. Only a presentation that strives to simplify statistics to the greatest extent possible, and maximizes the use of everyday language and vivid examples, can possibly succeed in explaining the complexity of statistics to a group of tentative students. The situation statistics faces in common with science is well illustrated by a quote by computer scientist and musician Jaron Lanier (Brockman, 2003):
3 The arts and humanities have been perpetually faced with the challenge of making simple things complicated. So there exist preposterously garbled academic books about philosophy and art….Science faces the opposite problem . (P.368) P arametric statistics and the concept of the Sum-of-Squares What we want to do here is to provide a simple explanation for what the parametric statistical procedures the undergraduate students will be taught are doing, and also emphasize the underlying identicalness of all these procedures. Simply put, parametric statistics assume a shape for the population distribution, generally a normal distribution, and makes inferences about the measures of that distribution, such as the mean. We believe that the key to presenting introductory statistics in its simplest and most comprehensible form is an approach that emphasizes the importance and understanding of the idea of the Sum of Squares is [1]. Not only is this not currently commonly done, some textbooks go as far as presenting the definitional formula of sum of squares when going over the standard deviation, and then shift to a computational formulas when doing ANOVA or correlation [2]. Students thus fail to grasp that we are still computing the same thing, the Sum of Squares. When students understand what Sum of Squares is doing, they will better understand various statistical procedures. In most parametric statistics the total variability (how much scores vary from one another) is calculated and the concern is determining what
The Sum of Squares
The sum of squares, short for the sum of squared deviations from the mean, is omnipresent throughout statistics. Statistics is all about variability, how much numbers vary from one another. Our basic measure of variability is called the “variance,” which is sum of squares divided by the number of scores that are free to vary, namely N, the total number of scores, in a population and n-1. i.e., degrees of freedom, in a sample. The concept of degrees of freedom will be explained below. Undergraduate students, graduate students, and even some faculty members have a hard time understanding statistics. Early in a statistics course, the critical concepts of Sum of Squares, variance, and standard deviation pose difficulty. For our basic measure of variability we want a gauge of average differences from the mean. Thus, we subtract the mean from each score. It may at first blush seem we could merely add up these differences and divide by how many scores, however the sum of deviations from the mean is always zero. We get around this by squaring each deviation, hence the sum of squares [4]. The definitional formula for the Sum of Squares is: ! This indicates subtracting the mean from every score, squaring the result and then adding all the squared deviations from the mean. The sample mean is symbolized xbar, (x ̄ ) . The i=1 means you start summing with the first score, the n being the total sample size, and you continue summing until the last score. When more data is added, the sum of squares will generally increase simply because there are now more scores that can differ from the mean. So we divide by degrees of freedom giving us the variance. The idea of degrees of freedom is rather simple. If you are told a mean is based on 3 scores and given the mean, there is no way for you to know what the scores are. However if you are told two of the scores, the third is locked in; you therefore know it. Likewise, if a mean is based on ten scores and you are given nine, the last one is known. Thus with n scores and a mean, n-1 are free to vary, the last is locked in place. A degree of freedom is lost when the sample mean, which is only an estimate of the population mean, must be used to calculate sum of squares [5]. While we account for how many scores could have produced variability by dividing by degrees of freedom, it is important to understand that the Sum of Squares by itself is a measure of variability. Textbooks generally do not seem to make the above clear, which prevents using the Sum of Squares to demonstrate the unity of all parametric statistics covered in undergraduate course. A close analogy is hits in baseball. Number of hits by History
The Sum of Squares is essential to Standard Deviation, and understanding the normal distribution implies understanding the latter. By 1733 Abraham de Moivre had described the normal distribution (Hald, 2003) which meant that he understood the parameter of standard deviation, though not naming it as such (McGarth, 1999). The discovery and first use of the Method of Least Squares, which is at the heart of regression, turned into a dispute in priority between Carl Gauss and Adrien-Marie Legendre (Stigler 1981). Legendre published the method in 1805; Gauss had a claim to have discovered it in 1794. The methods of least squares is the technique of minimizing the sum of the squared errors. In correlation for example, the regression line, our best formula for predicting one variable from another, is the line that minimizes the squared distances of all the data points to the line. This minimizes the sum of the squared actual minus predicted scores. In any case, this concept utilizing Sum of Squares is just over 200 years old. The term “standard deviation” was coined by Karl Pearson in 1893 (Gillispie, 1970) although as we just saw the idea is much older. Concepts very close to the current “Standard deviation” have been referred to as “error of the mean square” or “mean error” since the early 19 th century, Carl Gauss using the latter term in 1821 (Sprott, 1978). Partitioning the Sum of Squares-What does it Mean?
Endeavored below is a simple way of explaining to introductory statistics students what we are doing in an ANOVA. Say we have four scores: 11, 7, 30 and 20. By how much do they vary? We want to know how much each score differs from the mean, this seems an intuitive way to get variability -- so we subtract the mean from every score. We find that the sum of deviations from the mean is always zero (this may not be immediately intuitive but is a consequence of what the mean, or average, actually is). So we square each deviation and add them together. We now have the sum of squared deviations from the mean or the Sum of Squares. Obviously, at this point we could divide the Sum of Squares by N or n-1in order to get the variance and then take the square root to get the standard deviation. However, for simplicity’s sake, only the Sum of Squares as the measure of variability will be discussed. Consider our limited data set below: (11-17) = 36 (07-17) = 100 (30-17) = 169 (20-17) = 9 Σ = 314 Our Sum of Squares is 314. This is the measure of total variability in our four scores. Let us say, however, that these four scores came from two groups, the scores in group one were 11 and 7 and the scores in group two were 30 and 20. This can represent a very = 4 (7-9) = 4 The Sum of Squares group one is 8 (4 + 4). By how much do the 30 and 20 differ? The Sum of squares of group two is 50. To illustrate: (30 – 25) = 25, (20 – 25) = 25; 25 + 25 = 50. The Sum of Squares within both groups combined is 58. (Note we are now using the within groups mean not the overall or grand mean.) Why do the scores in group one differ from one another? Why do the scores in group two differ from one another? The answer can be explained by subject (participant) differences and/or error in measurement). Think about what we have said so far. The total variability is 314. The variability due to subject differences is 58 , which means the variability due to group (due to being in a low noise versus high noise environment, in our example) is 314-58, which equals 256 SS. Let us look at the ratio of variability due to the group (the treatment) to the total variability, which is the total SS. This is simply all we will be doing in regression analysis or Analysis of variance, which are the same thing . We have just covered the underlying logic of parametric statistics. The ratio of the variability due to treatment (high noise vs. low noise) over all the variability is 256/314 or .815. We call this number r or the coefficient of determination. . By partitioning the Sum of Squares, it is now easy to perform an Analysis of Variance (ANOVA). We can enter the following into our ANOVA table: SS between = 256, SS within = 58 and SS total = 314. With two groups df between is going to be 1 (it is groups -1), df within is n -1 + n -1, here 1+1=2. Between groups represents differences in scores (variability) due to what group they came from and within groups represents differences in scores (variability) due to subject differences. Our full ANOVA table is: T-tests are a special case of ANOVA with just two means, t = F. If this was treated as an independent groups design for which a t was computed, t would equal 2.97. Analysis of Sum of Squares df Mean Square F Sig. Between Groups 256.000 1 256.000 8.828 .097 Within Groups 58.000 2 29.000 Total 314.000 3 : [1]The authors recognize that there have been techniques developed for handling parametric data which do not involve sum of squares, however these are generally not covered in undergraduate statistics courses. There are methods of parameter estimation such as maximum likelihood estimation that do not employ the least squares method, however in addition to being beyond what undergraduates can reasonably be expected to encounter in a statistics course, the two methods obtain similar results if the assumption of normality is met. [2] Before the days of calculators the computational formula was somewhat easier to use, because it did not require subtracting mean from each score, however though of Dictionary of Scientific Biography.
New York : Charles Scribner's Sons Gorard, S. (2005), Revisiting a 90-year-old debate: The advantages of the mean deviation.
British Journal of Educational Studies, . Hald , A. (2003) A history of probability and statistics and their applications before 1750. Wiley. McGrath, K.A. (ed.) ( World of Scientific Discovery.
Historia Mathematica , 5(2), 183- 203. Stigler, S.M. (1981)
Gauss and the Invention of Least Squares, Annals of Statistics, Volume 9, Number 3, 465-474. Tukey, J. (1960) A survey of sampling from contaminated distributions, in Olkin, I.,
14 Ghurye, S. , Hoeffding, W., Madow, W and Mann, H. (Eds.) Contributions to probability and statistics: essays in honor of Harold Hotelling, Stanford: Stanford University Press. Vince, R. (2007) The Handbook of Portfolio Mathematics, New York: John Wiley & Sons.14 Ghurye, S. , Hoeffding, W., Madow, W and Mann, H. (Eds.) Contributions to probability and statistics: essays in honor of Harold Hotelling, Stanford: Stanford University Press. Vince, R. (2007) The Handbook of Portfolio Mathematics, New York: John Wiley & Sons.