Exploratory Data Analysis: Stem & Leaf Displays

In this video I explain stem-and-leaf displays or stemplots, which are part of exploratory data analysis developed by John W. Tukey. I explain the basic design of stem-and-leaf displays as well as several variations including back-to-back stem-and-leaf displays and stretched stem-and-leaf displays. I also discuss the role of rounding and cutting data when dealing with large numbers or decimals, and the usefulness of more complex stem-and-leaf displays for storing data.

Video Transcript:

Hi I’m Michael Corayer and this is Psych Exam Review. In this video we’re going to look at stem and leaf displays which are an example of exploratory data analysis from the work of John Tukey and Tukey thought of exploratory data analysis as doing a kind of “detective work”. We’re just looking for clues in our data that might aid our understanding. And in the case of these stem and leaf displays we actually don’t have to do any math or any calculations. All we’re doing is looking at how we can arrange the data in a way that might give us some clues about the distribution of scores.

The stem and leaf is sort of a hybrid of a table and a figure. It involves taking each response and splitting it into two parts; the stem and a leaf, then these are organized into a table that also gives us a visual representation of the distribution of scores. So if I were to give you these 20 student exam scores it would be hard to immediately see patterns in the data. You could find them if you look closely but it’s not immediately apparent. What I could do is put these into a frequency distribution table. The problem that we have is there’s a broad range of scores so I need to have a lot of rows if I want to keep the original values. If I want to make the table shorter I could put them into class intervals but then I would lose the original values. The stem and leaf display is a clever solution to this problem. It allows us to keep all of the original values while also getting a visual representation of what the data looks like.

The way this works is that we’re going to split each score into a stem and a leaf and then the stems in this case will represent the tens unit for each score while the leaf will provide the ones unit for each score. Our scores range from the 50s to the 90s so our stems will be 5 6 7 8 and 9. Each leaf will give us an individual score combining the tens unit in the stem with the ones unit shown in the leaf. So a score of 54 would be a stem of 5 and a leaf of 4 and 59 will be right next to it; the same stem of 5 and a leaf of 9 and so on for all the scores.

Now we have a concise table of all the individual data that also acts as a figure because we can see how the scores are distributed across the stems. We can see that most scores were in the 70s and they tapered off in frequency above and below that stem. And unlike a grouped frequency distribution table we have the original values preserved. If we want to add more information to our stem and leaf plot we can include a cumulative frequency column summing how many scores have been accounted for up to that point in the table. This can also help you to check that all of your scores have been included in the table as the final cumulative total should equal n for your sample.

In addition to allowing us to see the distribution of a single set of scores, we can use a version of the stem and leaf display to directly compare two sets of scores. So if I wanted to compare the performance of two different classes on the same exam, what I could do, rather than creating two separate stem and leaf plots, is combine leaves into what’s called a back to back stem and leaf display. And so in this case the stem is going to be shared in the middle and then we’ll have one set of leaves on the left and another set of leaves on the right. Let’s see what this would look like with two sets of scores. We’ll use our first set of scores from before and then compare them to a second set of scores.

Just a quick glance at this table allows us to see that the second class performed better with more scores in the 80s and 90s compared to the first set with higher frequencies in the 60s and 70s. We aren’t yet doing any analysis of these differences but we can immediately see that the distributions differ. Note that the order of leaves is reversed on the left side set of leaves with the lower numbers closer to the stem and the higher numbers farther away, just as it is on the right. And if you want to add a cumulative frequency column for these sets you would add it to the left of the first set and to the right of the second set.

Sometimes a regular stem and leaf display doesn’t clearly show us the distribution of scores because too many scores are clustered in certain stems. And in this case we might want to use what’s called a stretched stem and leaf display. Here’s a set of scores where a basic stem and leaf isn’t very revealing. This stem and leaf tells us that most scores are in the 70s but since there are so many scores in that stem and so few in the others we don’t have a clear sense of the spread of scores within that 70 stem. To make a stretched stem and leaf display, each stem is stretched across two rows; one for leaves 0 to 4 and the second, marked with an asterisk, for leaves 5 through 9.

This stretched stem and leaf plot gives us a clearer view of how our scores are distributed instead of having to look carefully at the many scores in the 70s stem, we can easily see that the majority of these were between 75 and 79 because now they have their own row. In addition to using an asterisk for the upper half of each stem you may sometimes see the lower half indicated with a period or a full stop. This isn’t always done but if you see it just keep in mind this is not indicating a decimal place, it’s just indicating that it’s the lower half of that stem. At this point you might be wondering if stem and leaf displays are only useful for two-digit responses, so that we have one digit in the stem and one in the leaf. And while we generally are only going to keep one digit in the leaf, we can have more than one digit in the stem. So if we look at this set of IQ scores we can see we have a mix of two and three digit numbers. The stem will still represent the tens unit but we can use stems with two digits to indicate scores greater than 100. So a stem of 10 represents 10 tens or 100. 11 would represent 11 tens or 110 and so on.

This does raise a limitation of the stem and leaf plot which is that even though we can have stems with more than one digit the leaf is generally a single digit and we can’t have gaps in the units between our stems and our leaves. That means it’s difficult to display a very broad range of numbers. If we had a very large range of scores say from 613 to 1 347 it will become an impractical to use units of tens for the stems because we’d need so many rows. If we use units of 100 for the stem then we have to use units of 10 for the leaves and that means we have to round our values. So 613 could be rounded to 610 and then represented by a stem of six hundreds and a leaf of 1 for 10. And 1238 could be rounded to 1240 and then represented by a stem of 12 hundreds and a leaf of 4 for 40. Now we lose some precision and we’d have to decide if this is appropriate for the data that we have.

This also brings up that when we read stem and leaf displays we have to look carefully at the units, and if we’re making a stem and leaf plot we need to make sure that we clearly label things to avoid any confusion. If we have data with many places but we don’t actually need to be that precise, then we could engage in cutting rather than rounding. So cutting refers to simply leaving off some of the digits from each response. So if we had really large numbers like populations of countries we might decide to cut some of the final digits rather than trying to include them in our stem and leaf display. If we had a value like 13,432,851 we might simply cut the last five digits and present this with a stem representing millions and a leaf of four hundred thousands. In this case we’re only keeping detail to the level of hundreds of thousands but for something like country populations more precision may not be necessary.

Tukey wrote that our goal is to “move to easily understandable units that are more helpfully related to the numbers at hand“. We don’t want to round or cut values if it means losing important information but if that level of precision is unnecessary then rounding or cutting might be clearer. The same would be true for decimal places. Tukey considered decimals a liability; they make data harder to look at and they increase the chance of making errors. So he suggested using units that eliminate the need for decimals whenever possible. Large numbers with decimals like 2.14 million could be represented with a stem of 21 hundred-thousands and a leaf of four ten-thousands. Or instead of having a table labeled with thousandths of an inch and using 1.27 this could be labeled as hundredths of thousandths of an inch, allowing us to use 127 hundredths of thousandths of an inch, although my struggling to say hundreds of thousands of an inch several times is a pretty strong argument for just adopting the metric system.

There are other variations of stem and leaf displays that allow us to keep the original values for larger numbers by doing things like having more than one digit in the leaf and separating these with commas or by having what are called mixed stems where different units of stems are used in the same table. But Tukey writes that these are mostly for storing data. They’re not so useful visually; they no longer give us that sense of the distribution.

Now today these might seem a bit obsolete because we have computers and we have spreadsheets that can store information much more clearly than one of these more complex stem and leaf displays but we have to remember that when these were developed, you know, researchers didn’t have computers directly collecting data from participants and they didn’t have spreadsheets to easily organize and store all their data. And so they might have to go through by hand with a stack of individual responses and transcribe those into a table. And so in this case using a more complex stem and leaf display would give them a way to do that and it would also allow them to see something about the distribution of the data while they’re doing it.

So it was a bit more efficient then although it’s somewhat unnecessary today. I recommend doing some of these simpler stem and leaf displays with some data. So take a set of data, make one up, and just try out some different things. See what happens if you have very large numbers or very small numbers, different units that you might use for the stems and the leaves, or how you might do different types of rounding or cutting, and just see what it does to your sense of the data, right? What you’re trying to develop is what Tukey called a general “feeling for the numbers as a whole“.

So that’s the basic idea of stem and leaf displays, I hope you found this helpful. If so like and subscribe for more educational content related to psychology and statistics, let me know if you have any questions in the comments, and make sure to check out the hundreds of other psychology tutorial videos that I have on the channel. Thanks for watching!

Leave a Reply

Your email address will not be published. Required fields are marked *