In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures the linear relationship between two sets of data. This coefficient is the ratio between the covariance of two variables and the product of their standard deviations. It is actually a standardized measure of covariance so that the result is always between −1 and 1. This means that it can help us understand the relationship between variables, but only within the context of linear correlation.
"If the Pearson correlation coefficient of two variables is 1, then there is a perfect positive correlation between them."
For example, suppose we examine the relationship between age and height of students in an elementary school. It is expected that the Pearson correlation coefficient for these two variables will be greater than 0 but less than 1 because it is unrealistic to have exactly the same age and height.
The Pearson correlation coefficient was developed by Karl Pearson in the 1880s, based on the concept of correlation proposed by Francis Galton. It is worth noting that the naming of this invention reflects Stigler's Law, which states that "the inventor's name is often ignored."
“The development of statistics is not only the evolution of numbers, but also the exploration of the stories behind the data.”
From a geometric point of view, the correlation coefficient can be derived by considering the cosine of the angle between the points representing the two sets of data. This allows the Pearson correlation coefficient to be used as a measure of the correlation of a particular data set, and its value is between −1 and 1, with 1 being 1 when all points lie on the same straight line.
Pearson's correlation coefficient is defined as the covariance of two variables divided by the product of their standard deviations. This form of the definition involves a "product" that is the mean (the first momentum around the origin) multiplied by the mean of the random variable; hence the "product" qualifier.
When applied to a population, the Pearson correlation coefficient is often denoted by the Greek letter ρ (rho) and is called the population correlation coefficient or the population Pearson correlation coefficient. For example, consider a pair of random variables (X, Y), whose correlation coefficient can be expressed as the product of the covariance and standard deviation of the variables. However, due to the complexity of its definition, it is not convenient to show the specific formula form here.
“Covariance is the key to understanding the interactions between variables.”
When the Pearson correlation coefficient is applied to a sample, it is usually represented by the symbol r and may be called the sample correlation coefficient or the sample Pearson correlation coefficient. This value is based on the estimation of covariance and variance in the sample and can reflect the relationship between the two variables.
Although the Pearson correlation coefficient is widely used, it can only reflect linear relationships and ignores other types of associations, which requires us to be particularly careful when using it. Specific results or patterns may vary depending on the choice of data or the analysis method, which is not limited to the direct calculation of statistics but also includes interpretation and application.“Data cannot speak for itself, but its potential meaning is revealed through proper interpretation.”
Ultimately, the Pearson correlation coefficient provides a powerful tool for understanding the relationship between variables, but we should always use it with critical thinking. Have you ever considered whether there are other factors in your life that might affect the relationship between the two variables?