In this video I explain probability density functions and how these are used to describe the distribution of a population and estimate the probabilities for different ranges of scores within that distribution. I also explain why the probability for a specific value of a variable is always 0, even then we are still able to estimate probabilities using the area under the probability density curve.
Video Transcript
Hi, I’m Michael Corayer and this is Psych Exam Review. In most of the previous videos in this series we’ve focused on things like descriptive statistics, how we can describe a sample set of data, and we’ve seen how we calculate things like the mean, the variance, the skew, or the kurtosis, and we’ve also seen how we could visually present a sample using something like a histogram.
In this video we’re going to see why we did all these background steps and what they were leading us towards. And what they’re going to allow us to do is make a shift in our thinking from thinking about the specific details of our sample to thinking about the population that that sample might have come from. And this is going to be really important for some later analyses because what it’s going to allow us to do is start thinking about probabilities.
And so for a continuous variable, a common way that we’re going to think about probability is to use a probability density function. Now there are also probability mass functions for discrete variables and cumulative distribution functions, but we’ll cover those when they become more relevant for some other analyses later. But generally if we have a continuous variable we’re often going to refer to a probability density function which is going to allow us to estimate probabilities for a population.
In order to understand what a probability density function is we can start by looking at a histogram for a sample. And so in histogram we have our range of values for our variable X on the x-axis and then on the y-axis we have the frequency of those scores in our sample. So we could say for this range of values for X we had 10 scores or for this range of values for X we had 7 scores in our sample. But the problem is it’s too specific; it’s only about our sample and so we have to recognize that these frequency values could change dramatically if we had a different sample size. So if we had a larger sample maybe we would have had 20 scores here or 30 or 100.
And so in order to think about our population and what scores we might expect with other sample sizes, we could change from thinking about the frequency on the y-axis to thinking about the relative frequency. So we could take our values, like saying we had 10 scores for this particular range of X, and think well if that’s 10 scores out of a sample size of 100 we can think of that as being 10% of our sample. And so we could switch to thinking about the relative frequency of different values for x.
And so in a relative frequency histogram we still have our range of values for our variable x on the x-axis but on the y-axis we’re showing the relative frequency. We’re thinking about the percentage of scores in our sample that fell into that range and so instead of saying we had 10 scores for this range of values for x now we’re saying we had 10 scores out of a sample size of 100 and so we had 10% of our scores for this range of values for x. And so the way that we’d find that for each of these bars is we’d take our frequency and we divide it by the sample size and that would tell us the percentage of scores that we had in our sample for that particular range of values for x. And this is a little better for separating ourselves from our sample size because we can think well if I had a different sample size, maybe I’d still find about 10% of scores in that particular range of values for x. And that helps us to start thinking about the population. Maybe we would expect to see about 10% of scores in the population falling into that range.
But we do have some limitations here and one of these is that our sample might not be able to represent all of the different ranges of values for x because some scores just won’t show up in our sample. So here we had a point where we didn’t have any scores in our sample for this range of values of X. But just because we had a relative frequency of 0% doesn’t mean that we don’t think those scores exist at all in the population. And so we realize that that applies to any of these bars. Maybe we had 10% in our sample but in the population it’s actually 8% or 12%, right? So any of these has some uncertainty to them and we might think think about how we could get a better approximation of the population.
Now one way we could visualize this is to think about drawing a line connecting all of these bars in our histogram and this might give us somewhat of a sense of what the population might look like but we’ll notice that when we do this we have these jagged edges, right? We have these parts sticking out from the line and then these other parts where we’re sort of missing some data and the reason is that when we jump from one bar to the next we’re sort of assuming where the line might fall in between those but we didn’t actually measure it. And so we might realize that we could measure it if we had a narrower bin size. If we made our bins thinner and maybe we collected some more data we could figure out what exactly happens between these two values here, right? Is the line here or is it maybe a little higher or a little lower?
We could also think about maybe this central bar here, we say well we had 10% of scores here, say well what if we measured that more precisely? What if we divided this into two bins, right? They wouldn’t both have a relative frequency of 10%. Maybe they’d be 5% and 5% or maybe they’d be 4% and 6%, but we’d have a more precise understanding of what the line looks like between these bars. And so if we made our bin smaller then we’d have a more precise sense of what this line looks like. And as our bin size gets smaller and smaller and smaller then these jagged edges would also get smaller and smaller and smaller.
So we think about changing our bin size, that’s going to change our percentages here for our relative frequency and it’s going to give us a more precise idea of what this line for the population might look like. And so if we had data with smaller bin sizes here then each of these bars would be a little bit narrower and that means the gaps between them would be smaller and so when we thought about drawing a line connecting these what we’d see is now the line would be smoother, right? These jagged edges would be smaller. And if we thought about making our bin size even smaller then this would smooth this out even more. And then we could think about that if we have a continuous variable then it can be divided into an infinite number of possible values.
So if we thought about having an infinite number of these infinitely thin bins here then we realized that these jagged edges would go away, because each bin would be infinitely thin and they’d be infinitely close together. And so if we connected all of them we’d end up with a perfectly smooth line and this perfectly smooth line could then be described by an equation, such as the equation for a normal distribution. And so now we’re thinking about a probability density function.
And so you’ll notice this change in the y-axis here. In this case we were thinking about relative frequency because each of these bars here has a certain width to it and so we can say you know how many scores fell into this range of x, you know, this percentage, this percentage of scores fell into this bin here, this percentage of scores fell into this bin here. But when we move to this line we can no longer do that because each bin is infinitely thin. And so you can’t really think about scores falling into an infinitely narrow bin. And so instead we’re labeling this as the probability density. So what does that mean?
The probability density refers to the rate of change of probability as we move across the x-axis, or as we move across different units of X. And so you could think about this with something like mass density in physics. So if I have some substance and I know its mass density and I want to to determine how much mass I have, I also need to know the volume; how much of that substance do I have? And then I could figure out the mass. And the same is true for thinking about probability here. If I know the probability density across a given range of X I also have to know how wide is that range of X, what’s the sort of unit that I’m using here along the x axis? And then I can determine the probability.
And in the case of a probability density function the probability is represented by the area that falls under the curve for that range of X. And so the units on this y-axis here are going to vary depending on exactly what our distribution looks like; what normal distribution we have and what its standard deviation is and where its mean falls, and that’s going to determine the labels on the y-axis, so that the total area under the curve ends up equaling 1. And that would mean that we have a probability of 1 that a randomly selected variable will fall somewhere on the possible values for X for this distribution. So if we say that this covers the entire range of possible scores that exist in the population, well if I randomly select one, the chance that it’s somewhere on that line has to be 100% And so the total area under the curve for a probability density function will be 1.
And then we can think about how that probability will change as we look at different sections of X and that’s what the height of the line is telling us. And so to go back to our physics analogy you could imagine that we have a block of some substance but let’s imagine that it has different density at different points. And so now if you wanted to determine the mass you’d have to know the different densities across different parts of it, you’d have to know the volume that you’ve selected, and you’d also have to know, where does that volume fall on this block of the material, right? Because the same volume in one section might have a different total mass than the volume somewhere else, because the density is changing as you move along this block. The same is true for the x-axis. The probability is changing as we move across different scores for X. So I could take the same width of X on this x-axis, let’s say five points for whatever my measurement is, and a 5 point range here may have a very low probability whereas the same 5 point range in the middle of the distribution might have a much higher probability. And so that’s really what the line is telling us.
And the important point here is that even though it looks like you could look at a single value of x and see where the line is on the y-axis and know the probability, unfortunately you can’t do that. The probability for any particular value of x is going to be 0, right? This is just like we saw with the the mass density example. If you have an infinitely small volume then your mass is going to be 0 regardless of what the mass density is. The same is true here. If you pick an infinitely small value for x then the probability that you’ll get that exact infinitely small value is going to be 0, regardless of what the probability density is at that section of X, in that area, because if you have an infinitely narrow bar here then it doesn’t have any area under the curve and so the probability associated with that would have to be 0. So we said that the total area under the curve will be equal to 1, and then we could think about choosing different sections of X and finding the probabilities associated with them.
So a simple way to do this would be to think about let’s say I have my population mean here and I want to know what’s the probability that I randomly select a score from the population and it ends up being equal to or greater than the population mean. And what I would do is say, well, here’s my population mean here so the probability that my score is equal to this point or greater would be anywhere on this area under the curve. And the area under the curve at that point would be 0.5 because this normal distribution is perfectly symmetrical, and so half of the curve is going to be starting at this point at the population mean and moving upward. And so if I randomly select a score, there’s a 50% chance it will be equal to or greater than the mean, and of course there’s a 50% chance that it would be equal to or less than the mean.
And then, of course, I could think about other ranges. I could pick some more extreme point here and say what’s the probability that I would get a score at least this extreme, so this value or greater? And so in that case that’s going to be this area under the curve here, right? Whatever this probability is. Now when it comes to actually calculating these areas under the curve, you know, we could use calculus to do this. What we’re going to see is generally we’ll use software to calculate this or we can use a standardized version of this where instead of using particular values for X like whatever we actually measured, we’re going to convert that to standard deviation units and that’s what we’re going to look at in the next video. But for now we can just say we could find the probability for any range of values on the x-axis.
So similarly we can say what’s the probability of getting a score here or lower? And so that would be this area under the curve here. And so whatever the area under this curve is that would be the probability of getting a score at least that extreme, right, that low or lower. And then of course if we wanted we could also pick, you know, two points in the middle of the distribution. We can say what’s the probability of getting a score between, you know, this point and the mean right? Between A and B here? And so we’d think about what’s this area under the curve here and we’d calculate that area and that would tell us the probability of getting a score within that range of values for X.
Now as I said many students still get stuck with this idea that the probability for a precise value of x is still going to be 0, right? This idea that we have an infinitely thin line, therefore it has no area under the curve, therefore it has a probability of 0. And so they struggle with this idea because they think well, how can the probability of every single point be 0 and yet we still can calculate probabilities for a range of values? And so for this I’m going to turn to an analogy that will hopefully help you to think about this, because it’s hard to think about an infinitely thin bar here.
So as an analogy let’s imagine that this is a dart board here and I’m going to randomly shoot a dart at this board and it has to land somewhere on the board. So this area here represents a total probability of 1; if we shoot a dart it has to land somewhere in this area. It can’t land outside of it, but where it lands within this area is completely random. And so the probability of the dart landing somewhere in the area would be 100%.
And then we could think about dividing the dart board up into different sections and thinking about the probabilities associated with different sections. And so if we just divided the dart board in half then we’d say what’s the probability that the dart lands on this side of the board or this side of the board? And assuming that these are equal sizes here, that I’ve divided it perfectly, then of course the probability for each of these would be 50% or 1 over 2. And so we have a probability of each section as 50% or 1 over 2, and then we have two of those sections, and so our total probability is still equal to 1.
And then we could think what would happen if we divided this dartboard up into more equal-sized sections? And so if I divided the dart board up into eight equal sections then I could say, what’s the probability of the dart landing on any of these eight different points? We’d say, well they’re all the same size and there’s 8 of them, so the probability of it landing on each of these would be 1 over 8. And so we have a probability for each as 1 over 8, and we have eight of these sections and so we still get a total probability of 1. It’s going to land somewhere on the board and each of these equal-sized sections has a 1/8 probability of the dart landing on it. But then we could think, what if we tried to divide this up into an infinite number of sections?
Now obviously I can’t draw this, divide it up into an infinite number of sections, but you could imagine that I kept this doing this process of cutting this up into infinitely small strips here. What we’d find in this case is that each of these little sections, there’d be an infinite number of them, and so just like we did before we said well if there’s two sections that are equal then the probability for each is 1 over 2, if there were 8 then it’s 1 over 8, and so the probability for any given section here would have to be 1 over infinity, because there’s an infinite number of them. And then you say well how do we find the total probability? Well, there’s an infinite number of these infinitely small sections and so in order to find the total probability we’d have to multiply by infinity and what we see is the infinities would cancel and we’d still get a total probability of 1. And so we can say the probability of each infinitely small section of the board is 1 over infinity, which we can essentially say is a probability of zero.
If the value were any greater than zero then this little equation here wouldn’t work. If we said well actually the probability is 00000000, we could do this for a thousand 0s, but as soon as we put a one there then we’re going to have a problem. Because if we take that and say well that’s the probability of an infinitely small section, well then to get the total you have to multiply by infinity, and what would happen is now we get a total result of infinity, and that’s not possible, right? So we see that the probability for an infinitely small section of the board can’t be any greater than zero, right? It has to be 0. Or we could say 1 over infinity although, you know, infinity isn’t a real number so we can’t really say that, but we say it basically has to be 0. If it’s any greater than zero then our total probability no longer adds to 1, right? And so that becomes impossible; we can’t have an infinite probability, right? We can only have a total probability of 1 for the dart landing somewhere on the board. And so if the sections are infinitely small, the probability of each has to be represented as 1 over infinity.
And so this same logic applies to our probability density function. If we tried to pick a precise value of x, we have an infinitely thin bar here, and so the probability for an infinitely thin section of the curve will always be equal to 0, or we could say 1 over infinity, right? In order to calculate a probability we need to have a range of values and so if we say we have a point A and a point B then we can say that the probability is this area under the curve. So we can find the probability that x falls between two points, A and B, and these could be very close together but they still have to be defined. So we could say, you know, the probability that x falls between, you know, points that are 0.00001 apart, we could actually calculate that probability. We’d have a width and therefore we’d have an area under the curve. But any time we try to pick a single point we’ll get a probability of 0.
An important reminder here is that once we’ve made this switch to thinking about a probability density function for the population, we’re not that concerned with the specific details of our sample anymore. So we’re using this perfectly smooth idealized line in order to estimate our probabilities. We’re not using the details of our sample. So for a particular range of values for X maybe we had 15% of scores in our sample but when we look at the equation for a normal distribution and we use this probability density function we might find that maybe we’d only expect about 13% of scores there in the population. What that means is for future analyses we’re going to assume it’s probably 13% we’re not going to use the 15% that we had in our specific sample.
Now this also means that we have to remember there’s some uncertainty in making the switch from our sample to our population. So we used the sample to estimate things like the population mean or the standard deviation for the population, but those are just estimates and so there’s always going to be a bit of uncertainty there. And this is something we’re going to come back to when we think about things like standard error.
Now we’re also going to see in the next video that we’re going to make a shift from thinking about the specific way that we measured our variable, so the units that we used for our x-axis here, and instead of using the specific measurement that we had, we’re going to shift that to using standard deviation units. And what that’s going to allow us to do is to estimate probabilities for any normal distribution. So once we have our estimates of the parameters and we think it’s a normal distribution then we can shift to standard deviation units in order to calculate probabilities more easily.
So hopefully this gave you a better sense of what a probability density function is and why it’s so useful for estimating probabilities. Let me know in the comments if this was helpful for you, be sure to like the video if it was, and share it with others so you think it might help. Feel free to ask any questions that you still have and join the tens of thousands of subscribers to this channel in order to see future videos on other topics in psychology and statistics. As always, thanks for watching!