In this video I answer the common question of why we divide by n-1 when calculating variance from a sample, known as Bessel’s Correction. I focus on conceptual understanding of why this adjustment is needed and why n-1 is the appropriate adjustment on average, rather than making up a population and possible samples to illustrate this. I show why x-bar tends to underestimate the squared deviations, then provide 2 arguments for why n-1 adjusts for this; one based on degrees of freedom, and the other based on trying to estimate the average amount of bias of the sample variance.
Video Transcript
Hi, I’m Michael Corayer and this is Psych Exam Review. In this video I’m going to explain why we tend to divide by n minus 1 when estimating variance rather than dividing by n. This is known as Bessel’s Correction.
Now there’s a lot of videos out there that will make up a population and then make up some sample from that population and try to show you that n minus 1 is a better estimate rather than n in this particular case. But I don’t think this necessarily helps you to understand the underlying logic. I think if you really understand why n minus 1 is going to give you a better estimate than n on average then you can apply this to any data set. So I’m going to focus on explaining this conceptually rather than using a made-up population or some made up samples.
The first part of our conceptual understanding will be to see why it is that using X bar instead of mu will almost always give us an underestimate of our sum of squared deviations. This is really the source of our problem, and then we can see why it is that using n minus 1 rather than n is the appropriate adjustment for this amount of underestimate. To understand why using X bar will tend to underestimate our deviations, let’s imagine that we had a normally distributed population with a population mean at mu. And let’s assume that we don’t know what mu is. This is generally the case. We don’t usually know our true population mean but we don’t need to know the population mean in order to see that using X bar will tend to give us an underestimate of our deviations.
So now let’s imagine that we have some sample from our population and this gives us an X bar that’s below mu. Now we can think about what that’s going to do to our deviations. And what we can realize is that at the point exactly between X bar and mu from that point and below all of those scores will underestimate the deviation because all of those scores are now closer to X bar than they are to Mu. And any score is above that point are going to be overestimates of the deviation because now they’re farther from X bar than they would be from mu. So the question is, well, how do we know if our sample has more underestimates or more overestimates?
And the answer is, we know because we know where X bar is. In order for X bar to be where it is then the sample must have more lower scores in it. And so if we think about drawing our sample distribution, we’ll assume that it’s a normal distribution, although it doesn’t have to be, so long as X bar is where X bar is, we can see that more of the scores in that sample are going to be underestimates compared to overestimates. And so when we sum up all those deviations we’re going to end up with an underestimate of the deviations from mu.
We can also see that the exact same thing will happen if our X bar is an overestimate of mu. In this case there’ll be a point and any scores above that point are going to underestimate the deviation because they’re closer to X bar than mu, and any scores below that point will now be the overestimates, because they are farther from X bar than they are from mu. But again we can see we must have more scores that are in the underestimate territory and we must have fewer scores that are overestimates in order to have X bar where it is. And so once again in total we’re going to end up with an underestimate because X bar differs from mu. So any time that X bar differs from mu we’ll get an underestimate of the deviations.
Now it is also possible that you could get an X bar that actually equals mu and if this happens then you actually won’t be underestimating the deviation. You’ll be getting the deviations just right because X bar equals mu so comparing to X bar is exactly the same as comparing to mu. But this is going to happen in a very small minority of the possible samples you might draw from the population. And when this does happen where you get an X bar that equals mu, you won’t know that it happened because you actually don’t know what mu is. And so you won’t be able to say for sure that that’s what’s happened. And so what’s generally safer to assume is that your x-bar probably differs from mu because most samples will get an X bar that differs from mu and therefore you should just always assume you have an underestimate. That’s going to be more accurate in most cases and it’s going to be incorrect in a very small number of cases.
We can also note here that as our sample size gets larger and larger then our X bar will be a better and better approximation of mu. So as our sample size gets larger, X bar gets closer and closer to mu and what this means is that as our sample size gets larger and larger our underestimate of the deviations will get smaller, and smaller, and smaller.
So now we have the reasoning for why our sum of squared deviations using X bar is probably going to be an underestimate. And so now we can start thinking about, how do we adjust for that? And we’re going to look at two different ways of thinking about this. One that’s based on the idea of degrees of freedom, and another that’s based on trying to estimate the amount of bias that our underestimate will cause.
So let’s start with this idea of degrees of freedom. You might see n minus 1 referred to as our degrees of freedom, but you might not be sure what that really means. So this is related to the idea that if we know X bar then we know something about what our sample is composed of, right? We saw that in order for X bar to be where it is then the sample must be composed of scores that will put it there. Now we can think about this as degrees of freedom. So degrees of freedom are the values that we have that are free to vary; they could be any value at all. And what we see is if we know X bar then not all of the values in our sample are free to vary, only n minus one of them are. The very last score has to be a particular score in order to get to the X bar we said we have. So all the other scores can be anything they want but our final score has to bring us to X bar.
To demonstrate this, imagine that I say you have a sample of 5 and your X bar for the sample is 5. So you can pick any values you want, but you have to get to an X bar of 5. And what you’ll realize is you can freely pick the first 4 values; they really can be anything that you want. But the 5th value is going to be the one that you need to get to that X bar. So if you say well my first value is 0, my next value is 3, my next value is 7, my next value is 9. I say well you have to get to X bar now so what’s your 5th value? What does it have to be? What you see is if you start adding those up you say “well I’m at 19 I know I need to get to a total of 25 because 25 divided by 5 will give me an X bar of 5” and so what you realize is that final score has to be 6. There’s no other value that it could possibly be and so it was not free to vary. This is why in that case your degrees of freedom would be 4 rather than 5. So it’s n minus one because you know X bar.
So what we’re saying is our nth value in our sample wasn’t free to vary; it had to be that value in order to get to the X bar that we’ve been using for all our deviations. So once we have n minus 1 scores and we know X bar then that last value is sort of a given, and so it’s not really adding any new information. It’s information that was already contained in n minus 1 scores and X bar. So then when we think about dividing up the contribution that each of these values gave us to understanding the variance, we might say we shouldn’t divide by n. If we divide by n we’re saying they all contributed equally. But they didn’t. The last one didn’t really give us anything new. It still has a deviation from X bar that we want to include, but it was just the deviation that had to be there, right? It had to be this value for x, therefore it had to be this deviation from X bar. And so we include it as a deviation but when it comes to sort of spreading the contribution amongst all the scores, only n minus 1 of them gave us new information, and so we should only divide by n minus 1, our degrees of freedom.
We can also see that using n minus 1 takes our sample size into account. If we only have 5 scores in our sample then we probably have a pretty large underestimate and so the difference between dividing by 4 and dividing by 5 is fairly large. But if we had a sample size of 10,000 then dividing by 9,999 will only make a very small change. But that’s okay because we only need a very small change because if our sample size is 10,000 then our underestimate is probably really small. We’re probably already very close to the correct estimate of the variance and so we only need a minor adjustment.
So this is sort of the degrees of freedom argument for why we would use n minus 1. But it doesn’t directly address the idea of thinking about how much bias do we have, and is n minus 1 the appropriate adjustment for that amount of bias?
In order to address this we have to try to think about how much bias we have. Now we could say that our biased sample variance gives us the population variance minus whatever amount of bias is in the biased sample variance. And we know that that amount of bias is going to depend on the sample size that we have. As our sample size gets larger, and larger, and larger, our bias gets smaller, and smaller, and smaller. So how can we express this in mathematical terms? What can we put in here to represent our bias that will also take our sample size into account?
Let’s imagine that we had a sample size of 1. In this case X would be equal to X bar, right? It’s the average of itself. We only have one score and so if we try to calculate the deviation we’ll get zero. X minus X bar, just the same number minus itself, gives you 0. And so our estimate of the variance if we only have one score is always going to be 0. And that makes sense because one score doesn’t vary. But then we can say well, how much would we be off by? What would our amount of bias be? And we see that with a sample size of one we’ll always estimate the variance to be 0 and so we’ll be off by the entire population variance. There’s a population, it has some variance, we estimated that to be zero, and we’re wrong to the amount that whatever the population variance actually is. And so we could say if our sample size is one then we’re going to be off by Sigma squared. We’re going to be off by the whole population variance.
But what if we had two scores? Well now we have two values for X, we have an X bar between them, we can think about the deviation from X1 to X bar and from X2 to X bar. We can think about that as being two variances, right? And we’re taking the average of it. So in the case where we had a sample size of one then we were off by Sigma squared but if we think about a sample of 2, when we think about all the possible different samples of n equals 2 that we could get, on average we should cut our bias in half. And so we could say that our bias is Sigma squared divided by 2.
And then if we think, well if I had 10 in my sample size I’m calculating 10 deviations from X bar, and so I’m taking the average of those 10, and across all the possible samples of 10 that I could collect I should cut my bias by a factor of 10. And so my bias would be Sigma squared divided by 10. And so we see that we can represent the bias as Sigma squared divided by n.
This makes sense because as n gets larger, and larger, and larger, our bias gets smaller, and smaller, and smaller. And as n approaches infinity then we have no bias at all. If we can actually sample every possible value of x then we won’t have any bias. At that point our biased estimate is actually going to be equal to the population variance because there’s no bias.
Now we can take this mathematical representation for the bias and put it into the equation that we had earlier. We can say that our biased sample variance equals the population variance, Sigma squared, minus the bias, which is Sigma squared over n. Now we can do some algebra with this. We can say okay our biased sample variance equals Sigma squared times 1 minus 1 over n. So we’ve factored out our Sigma squared there. Then we can say well we can get Sigma squared by itself if we divide by 1 minus 1 over n, so we divide both sides by that and we see that now we have the biased sample variance divided by 1 minus 1 over n equals the population variance.
Then we can realize that dividing by 1 minus 1 over n is the same exact thing as multiplying times n over n minus 1. And you can test that by just putting in a value for n. If you put in 4 you see 1 minus 1/4 so 3/4. Dividing by three-fourths is exactly the same as multiplying by four thirds. So now we have that our biased sample variance times n over n minus 1 equals the population variance.
Now we can write out our formula for the biased sample variance. We can see that Sigma (x minus X bar) squared divided by n, times n over n minus 1 equals the population variance. And we see that our n’s cancel and so we get the population variance equals Sigma (x minus X bar) squared over n minus 1.
So this shows us that dividing by n minus 1 is the appropriate amount of adjustment for the estimate of how much bias we have. Now of course this isn’t true in all cases. This doesn’t always work. It’s possible you could have a very small sample size but you happen to get X bar equal to mu and so you actually don’t have an underestimate, and therefore you don’t have any bias. And yet, because you have a small sample size and you’re going to use this adjustment you’re going to be adjusting for bias that isn’t actually there. As a result you’ll get an overestimate of the variance. But this is only going to happen in a very small minority of cases and given that we usually don’t know mu we don’t know when it happens. So the safe bet is to assume that we always have some bias, Sigma squared divided by n, and the most appropriate way to adjust for that on average is to just divide using n minus 1 rather than n.
So I hope that has helped you to understand the conceptual logic of why we have an underestimate and why n minus 1 will usually be the appropriate adjustment for that underestimate. Let me know if you found this helpful in the comments, ask other questions that you might have, like and subscribe for more, and make sure to check out the hundreds of other psychology tutorials that I have on the channel. Thanks for watching!