Statistical Significance & Probability Values

In this video I explain the concept of statistical significance, including alpha levels, probability values (p-values) and the critical region or rejection region of a comparison distribution.

Video Transcript

Hi I’m Michael Corayer and this is Psych Exam Review. In the video on research and null hypotheses I said that we want a result to be unlikely in order to decide to reject the null hypothesis. But just how unlikely does a result have to be? This brings us to the concepts of statistical significance and the calculation of probability values or p values.

So we’ll start with this idea of statistical significance, or saying that a result is significant. Now the first thing to note is that when we say a result is significant we don’t mean that it’s important or meaningful or substantial. We just mean that it has met certain mathematical criteria in terms of how unlikely it is.

So what are these criteria? Well deciding on these is actually a fairly complicated process, but there is a simple answer: and the simple answer is to say that a result can be considered significant if it would only occur in 5% of similar samples by chance alone. If it meets this criteria we can say that it’s significant, or more specifically, statistically significant at the 0.05 level.

This probability of 5% is what’s known as our alpha level, using the Greek letter alpha. So if we set our alpha to 0.05 then we’re saying that we only consider results to be significant if they would occur in just 5% of similar samples by chance alone. If instead we were to set our alpha level to 0.01 that would mean that in order to reach significance we’d have to have a result that would only occur in 1% of similar-sized samples by chance alone.

Now an important point here is that we need to set our alpha level prior to analyzing our data. So we can’t change it in order to get a significant result. We can’t say “well I’d like to have an alpha level of 0.01 but my result didn’t quite reach that to be significant so I’m just going to change and use an alpha level of 0.05 and now it’s significant” right? We have to set it in advance.

While the alpha level sets a threshold for reaching significance, we also have what are called probability values or p values. And what a probability value does is it tells us the probability that’s associated with our particular result for our statistic and results that are more extreme, extending beyond that point. And so it’s actually finding the area under a curve, and the curve is the distribution of possible results that we could get. And then we look at the point that we actually got and then we just calculate the area under the curve beyond that point and that gives us the probability value.

This is exactly the same thing that we did when we looked at z-scores and probabilities associated with a normal distribution. So we had the idea that we could look up a particular z value and then we could find the area under the curve above or below that point and that would tell us the probability of getting a z score in that range. And that’s exactly what we’re doing with a probability value.

The only difference is we might not be looking at z scores. We might be looking at some other calculation but we’re still doing the same thing; we’re looking at the distribution of that calculation, the values we could possibly get when the null hypothesis is true and how frequent they would be, and then we look at our particular result and then we find the area under the curve from our result and extending to values that are more extreme. So the smaller the area under the curve the lower the probability that we would observe those kinds of results by chance alone when the null hypothesis is true.

Now if you look at older papers you’ll see they often reported the probability value as something like p less than 0.05 and they didn’t give a specific answer to what their actual probability value was. And the reason for that is that calculating that area under the curve for a specific value was a slow and painstaking process and so you didn’t want to have to do that by hand if you didn’t have to. And if you knew what your threshold was, if you knew your alpha level and you knew the statistic result that was associated with that particular alpha level, and you saw that your result was more extreme or less extreme, then you could figure out whether you had reached significance or not, without actually calculating the area under the curve.

And so in older papers they would often do that. They would say “okay, here’s the threshold, our statistic result, you know, fell beyond that range, therefore it’s significant. Therefore we can reject the null hypothesis and the specific area under the curve isn’t really all that important. It’s not going to change our decision.” Whereas in modern papers you’ll see that often researchers report the precise p value associated with their result and the reason for that is that it’s very easy to calculate. Now you just, you know, put some things into your software, a few clicks, and here’s your precise p value. And so in modern papers you’ll still see mention of the alpha levels, you might say p was less than 0.05 that was what we used as our significance level, you know, but then they say, you know, p = 0.032 or something. And that would mean that if you conducted this same type of study many, many times and the null hypothesis is actually true, then you’d only observe results like that you know 3.2% of the time.

Now this brings us to the most important thing to keep in mind about probability values and that is they are just that; they’re just probability values, okay? They can only tell us the probability of the result that we observed or results that are more extreme. And they can only do that if we assume that the null hypothesis is true, right? In order to have that distribution curve that we’re finding the area underneath we have to assume that the null hypothesis is true because when the null isn’t true we don’t really know what the curve looks like, right? So we draw the curve assuming that the null is true and we calculate the probability based on that. And so the probability value, no matter how small it is, no matter how low the p value is, it can’t tell us whether or not the null is actually true, right? It’s just not possible for us to know that.

And it can’t tell us the probability associated with the null. It can’t say, you know, sometimes students have a misconception that you know p = 0.05, well, that means there’s a 5% chance the null is true. That’s a misconception; that that’s an incorrect interpretation, right? The probability is just about your test statistic, about your particular calculation. It’s the probability of that result or more extreme. It’s not the probability of the null and it’s not the probability of the research hypothesis; it’s not any of those things. All it says is if you assume the null is true, which we had to do to do this calculation, then here’s how often you would observe results at least that extreme.

So if we look at a hypothetical comparison distribution for a test statistic here, this represents the frequency of all possible values we might get for our calculation if the null hypothesis is true. And this curve is what allows us to calculate probabilities. Now the alpha level is going to set our threshold for marking off a point in the distribution, and in the case of an alpha level of 0.05 we want this threshold to indicate where the most extreme 5% of values would be. This would indicate the region where only 5% of possible values would occur by chance. And we can call this the “critical region” or you may also see this referred to as the “rejection region,” as values that fall in this region will allow us to reject the null hypothesis.

Now in this example the distribution of the statistic is a normal distribution and right away we can notice that means we have a few options for deciding what we mean by the most extreme 5% of scores. Do we mean the top 5% or the bottom 5% or a combination of both?

And this brings us to the difference between a one-tailed, or directional, test and a two-tailed, or nondirectional, test. In a one-tailed test we define the most extreme 5% of scores as located in only one tail of the distribution; either the upper or the lower end. If our hypothesis predicts a difference between groups in a particular direction then we can set our entire critical region on one side of the distribution. And so by placing the entire 5% rejection region on one side this actually makes the threshold for reaching significance a little bit closer to the center of the distribution and therefore makes it a bit easier to reach significance, provided that we’ve predicted in the correct direction. So any result that falls in this 5% of the curve here would be unlikely enough that we could reject the null hypothesis.

Now the downside to a one-tailed test, however, is if we happen to get a result in the opposite direction of what we’ve predicted, even if it’s a very unlikely result, then we still can’t reject the null hypothesis, because as we can see there’s no rejection region on that side of the distribution. There’s no possible results in this other tail that can lead us to reject the null hypothesis, no matter how unlikely those results might be.

Now in a two-tailed test what we do is we spread the critical region evenly across both sides of the distribution by dividing the alpha level in half and placing one half on each side. So in the case of an alpha level of 0.05 we would have a critical region of the most extreme 2.5% of the curve on each side of the distribution. And that would still give us a total of 5% of the overall area under the curve. Now you can notice that this moves the threshold for significance a little bit farther away from the center of the distribution compared to a one-tailed test, and so this does make it a little bit harder to reach significance. But now we can reject the null hypothesis if we get an unlikely result in either direction.

So now let’s look at the difference between the alpha level threshold for the rejection region and the specific p value of a result. In the case of a one-tailed test here we can see that the threshold for an alpha level of 0.05 means that 5% of the curve would be to the right of this threshold. In this case the probability value would represent the area under the curve starting at our particular test result and extending to the right. So a p value of 0.03 would mean that 3% of the area under the curve extends beyond this point which was our result for our statistic.

In the case of a two-tailed test the p value is also spread across both tales of the distribution, and so it’s the area under the curve that is at least as extreme as our result or more extreme, and so we have to include that on both sides of the distribution. We have to think about extreme being how far we are from the center of the distribution. So if we look at one tail here we can see the threshold for 2.5%, that would be an alpha level of 0.05 split across both halves of the distribution, but now we can see if we got a result where 1.5% of the curve was at that result or more extreme then that would actually be a p value of 0.03, because we have to include those values on the other tail as well; they are at least as extreme if not more extreme as our value, just in the opposite direction.

And so since the curve is symmetrical, the area under the curve is going to be the same on both sides. So if we have 1.5% on this side of the curve then we have to include the other 1.5% on the other side that’s just as extreme if not more extreme as our result, just in the opposite direction. And so that’s how we would get to a total of 3% of the area of the curve, would be in those values that are more extreme than our result. And that’s how we’d get to a p value of 0.03, and again, that would be less than our alpha level of 0.05 and so this would indicate a significant result.

Now since these thresholds for the rejection region will vary depending on whether you’re doing a one- tailed or a two-tailed test, it’s important to determine that in advance. You have to decide which of those is appropriate for the data that you’re looking at and you have to do it before you analyze the data. And so researchers can’t say, you know, well we want to do a two-tailed test and then they get the result and they find it doesn’t quite reach the rejection region and so they just change their analysis to a one- tailed test, right? They move the entire 5% to one side of the curve and they say “Oh, well, now it’s a significant result” right? So just like you have to set your alpha level in advance, before you analyze the data, you have to do the same for determining whether it’s appropriate to do a one- tailed or a two-tailed test.

So what does significance really mean? This is a very complicated and difficult question to think about and the reason for that is that all of these calculations and probabilities, you know, we’re finding areas under a curve, we’re thinking about the distribution of possible results that we might get by chance if the null hypothesis is true, all of this is long run probability. And what that means is it’s thinking about if you did this study many, many times in exactly the same way. Here’s how often you’d expect to see results at least as extreme as what you observed, right? That’s what the calculations can tell us. The problem is that it’s really hard to think that way when we look at an individual study.

So let’s say that I do a study and I find a p value of 0.03, and so I have a significant result. And I might think, well, that means that if I did this study exactly the same way with the same sample size 100 times, 3 of them would show a result like this out of 100.

But it doesn’t answer the question, is my study one of those 3 that happens by chance, or is it something else? Is something else is causing this difference? It’s not chance, it’s a real effect And unfortunately all of these calculations can’t really help me to figure that out. The only way to really think about that is to actually do a 100 studies and then see, okay, we’d expect to get 3 with this result and we got 15. Like, oh, that suggests chance alone isn’t a good enough explanation. But the problem is, it’s really hard to do 100 studies in the same way with the same sample sizes and to actually think about that long run probability, right? That’s just something that, unfortunately, just doesn’t happen very often, right? You know, we might have multiple studies but they’re different sample sizes and maybe they use different manipulations or they had different measurements, and so we really struggle to think about what our results actually mean on an individual basis.

We have to remember that the probability values are not single event probabilities. They don’t tell us the probability for our individual study, they tell us the probability if we did the study many times. And, like I said, unfortunately that doesn’t happen very often, right? So we have to keep that in mind when we interpret results because any result could happen by chance. Any p value is possible. You could do an individual study and you could just happen to get a very, very low p value even though it’s chance, because it could always be chance. And so the fact that you got a really low p value doesn’t really help you be certain that you have a real effect.

You always have that possibility in the back of your mind that, like, well this could just be the one chance event. And the only way to try to figure out if that’s true is to do some replications, to do the study over again; see, well, okay, if I get that kind of result again like that would be even more unlikely. And if I get it again and again and again then now we’re entering the sort of situation where it’s unlikely chance is a good explanation, like, there might be something else going on here. There might be some real treatment effect. But the only way to figure that out is to do many, many studies. And so we want to avoid the temptation to look at a single study and say “ah here’s, you know, the evidence that this is true.” Well, it could always be chance.

And so the only way to get around that is to look at multiple studies and this is actually a good thing. It’s very annoying when it comes to trying to interpret an individual study, but the good thing is, it prevents us from making hasty conclusions, right? It prevents us from saying “Well this one time this happened therefore it’s real; therefore there’s a significant effect, therefore there is a reason to believe that this treatment actually works.” We can’t really draw those sorts of conclusions; we shouldn’t draw those sorts of conclusions, right? We always have to think in the context of, you know, an entire body of research on a topic.

And the other thing to keep in mind here is that there’s a difference between what we call statistical significance and “practical significance.” And this involves going back to our definition that we started with. I said that significant doesn’t mean important or meaningful or substantial, and we have to remember that when we look at a result. So we might find a result that is meeting certain mathematical criteria; that doesn’t mean that it really matters.

So for example let’s say that I do a really large study for patients with depression and I compare some different treatments and then I have them do some depression rating scale and I find mathematically that there’s a significant difference between my groups. Let’s say it’s one point on this depression rating scale. And if I have large enough groups, one point on a depression rating scale might be unlikely. It might be something that is statistically significant. But that doesn’t mean that it’s meaningful, it doesn’t mean it’s a substantial effect. In other words, patients who differ by one point on a depression rating scale might not really feel all that different. And so it’s unlikely that I’d find this difference between these large groups of people, but that doesn’t mean that these two groups really feel all that different. So we could have statistical significance; we’d have a result that says this is a very unlikely result, we wouldn’t expect to see this difference of one point based on how large our groups were when we average them. But does one point really matter? Are these people really better off? Well, that’s much harder to judge.

And that brings us to what we call practical significance; thinking about whether the effect that might meet mathematical criteria for significance is something that we should actually care about. Does that mean I should invest lots of resources in using this treatment for other patients with depression? Well, that’s a hard question to answer and it’s one that, like I said, unfortunately a single study isn’t going to be able to provide the answer to. It’s not going to be able to tell us for sure that this is definitely what we should do, this, this treatment is definitely superior. You might say “well, it led to an unlikely result mathematically” but does that really matter? It’s very hard to say.

Okay so with all of these things in mind I thought we would look at a few practice questions. These are conceptual questions to help you to think about the most important concepts related to statistical significance, alpha level, and probability values.

What does it mean to say that a result is statistically significant?

What does the alpha level refer to?

What are the differences between a one- tailed or directional test and a two-tailed or nondirectional test?

What is the most commonly used alpha level in psychological research and what does it mean if a test statistic falls in the critical region for this alpha level?

What’s the difference between single event and long run probability and why is this so important for interpreting the results of a study?

What is the difference between statistical and practical significance and why is this important to keep in mind when interpreting results?

Hopefully you now have a better sense of how to answer these conceptual questions and I’ve written some sample answers that you can find in the video description box below.

Okay so that’s an overview of the concepts of statistical significance, probability values, and the alpha level and critical region. I hope you found this helpful; if so, let me know in the comments. Let me know if there’s questions that you still have and I’ll try my best to answer them. Be sure to like and subscribe and check out the hundreds of other psychology and statistics tutorials that I have on the channel.

Thanks for watching!

Share on Facebook

Post on X

Save

Leave a Reply Cancel reply