In this video I explain the difference between validity and reliability and then describe several ways of assessing reliability including split-half reliability, test-retest reliability, equivalent-form reliability, and the related concept of standardization.
Don’t forget to subscribe to the channel to see future videos! Have questions or topics you’d like to see covered in a future video? Let me know by commenting or sending me an email!
Check out my book, Master Introductory Psychology, an alternative to a traditional textbook: http://amzn.to/2eTqm5s
Video Transcript
Hi, I’m Michael Corayer and this a Psych Exam Review. In the previous video I talked about different types of validity for assessing an assessment, so how do we determine whether or not an assessment is valid. But we also want to determine whether or not a test is reliable.
So what does reliability refer to? This is the idea that if we measure the same object, or in the case of an IQ test, the same person, and we use the same measure or the same assessment then we should get the same result. So if you take an IQ test and you score 130 and then six months later you retake the same IQ test you should score 130 again. If you took the test again and you scored 70 the second time then that would indicate that the test is not reliable, assuming that you hadn’t had some sort of traumatic brain injury that would explain this large drop in your IQ. We’d probably question the reliability of the assessment.
Now it’s important to recognize that reliability and validity are different things. So it’s possible for a test to be reliable even if it isn’t valid. In the previous video I mentioned this idea of measuring intelligence by asking you your favorite color. I said that wouldn’t be a valid way to assess intelligence and that’s true, but it would be a reliable assessment and by that I mean each time I ask you “what’s your favorite color?” you might always tell me blue. So I get the same result each time I use this assessment, but that doesn’t mean that it’s valid. So it’s important remember that reliability and validity refer to different things. Ok, so how do we go about assessing reliability?
One way we can do it is what’s called split-half reliability. What we do in split-half reliability is we split the test in half. So let’s say you’ve taken an IQ assessment and it has 200 questions. Well, what I might do is randomly divide all of those questions into two groups and then I calculate two IQ scores for you.
And so what should be the case is that you should get roughly the same score on each half of the test. If you get drastically different scores on the two halves of the test then that means I might need to look at this test more closely. I might need to reconsider some of the questions. Why is it that you would get a much higher or much lower score on half? Which questions were in there? Maybe I should find a way to balance that out. Maybe I need to remove some questions or maybe to add some other types of questions onto my assessment, right?
So that’s split-half reliability and it works even better if you have a very large assessment. So if I had an IQ assessment with a thousand questions on it and I was splitting it up into two groups of 500 then it should be the case that you should get pretty much the same score on each half of the test. And if not, then we might question the reliability of the test. Ok, so that’s one way to assess reliability.
Another way is called test-retest reliability and this is actually how I introduced the idea of reliability. You take an IQ test and then 6 months later or so you take it again. So each participant simply retakes the same test and the idea is if it’s a reliable test then they should get the same result each time. Now the thing about test-retest reliability is that it also helps us to look at the administration of the assessment. So if you take an IQ test it’s going to be administered in person by somebody who’s trained to do this type of assessment and if we look at the test-retest reliability we might be able to determine whether or not there’s bias among certain examiners or administrators of a test.
So what I mean is, let’s say a hundred people took an IQ test and then they all took the same IQ test again with a different examiner and now if all of their scores mysteriously dropped by 20 points the second time then we might wonder about the person who’s doing the examining, the person who’s assessing them. We might think that there’s a possibility of bias there. So this also allows us to eliminate potential bias and the idea is regardless of who gives you your IQ assessment you should get the same result and if that’s not the case then we can question either the test or maybe something has happened to you that would change your IQ. Or there’s some possibility that there’s bias in the administration of the test.
Now you might wonder if we’re taking the test over again maybe people will get better just anyway, you know, maybe they’ll just have gotten some more practice and that means they’re going to score higher the second time. Maybe they remember some of the questions from before or maybe they’ve had more time to think about those same exact questions or maybe they’ve looked some of them up online they found some answers to a question that they just couldn’t seem to solve in the assessment and they remembered it later. They looked it up and then the second time maybe they’ll get it right. So how do we get around this problem?
Well, what we try to do is have what’s called equivalent-form reliability and so the idea of equivalent-form reliability is that there’s multiple versions of a test but they’re all equivalent; that the score that you get will be the same regardless of which version of the test you take. Now you might wonder how are we able to do this? How do we make sure that the questions are equally difficult each time? And this brings us to thinking about the SAT because the SAT is a great example of equivalent-form reliability because the SAT is offered multiple times per year and thousands and thousands of students take it each time and their scores are compared without considering which test they took.
So we don’t say “oh well you took the SAT in October and that test was a little bit easier compared to the June test from last year so we have to think about your score differently”. We don’t want to have to deal with that. We want to assume that your SAT score is going to be pretty much the same regardless of when you take the SAT. And because you can take the SAT multiple times, we need to have different versions of the test. We don’t want students to know the answers from last time and we want all of them to be equivalent.
So how does, you know, how does the College Board go about doing this for the SAT or for other exams like AP exams? How do we make sure that this happens? Well this brings us to the idea of standardization and you’ve probably heard of these tests called standardized tests. So what does that mean exactly? Well, when we say that the test is standardized, what we’re saying is that there are rules. And these rules are clear and they’re for the administration and scoring of the tests.
So if you’ve taken the SAT before or you’ve taken an AP exam, you know that there are rules that the proctors have to follow for administering this test. If those rules aren’t followed then the exams don’t get scored. So if an examiner decided “hey you can all have ten more minutes to work on this section” then that would be a big problem because that would be violating the rules and that means that the test is no longer standardized. Now this doesn’t really address the question about how to make sure that the questions are equally difficult. So if you take some SAT from October and compare it to June, you might wonder “well how do we know that these two different reading passages and questions are equally difficult? How do we ensure that that’s the case?”.
And this brings us to this idea of a standardization sample as we want to test the questions out with a group of people, a sample, that allows us to standardize the test. And that means we want to have a representative sample. We want to have people who represent the people who are actually going to be taking this test and what we want to do is establish norms. That means we want to figure out, you know, how difficult should this particular section of the test be? You know, what what percentage of students do we expect to be able to answer this question correctly?
So how do we, how does the SAT establish norms for the examinations? Well what they do is they have a very representative sample in the case of the SAT. It’s actual high school students taking the SAT. So what I mean by that is when you take the SAT there are some questions on there that aren’t actually going to count towards your score. They’re experimental questions and they’re going to be graded.
So the examiners are going to look and see what percentage of students were able to answer this question, what percentage of students were able to answer these questions on this reading passage. Maybe this isn’t an appropriate reading passage. How does the difficulty compare to all of the other SATs that we’ve given? So each time that students take the SAT it’s also a way of establishing norms for the test. Some of the questions are being sort of tested and assessed for whether they’ll be included on a future SAT as real questions that will count towards your score.
And the same will be true for intelligence assessments, right? We’d have certain questions that we’re sort of testing out and we’re seeing how well should people be able to do on this particular question. We test it out with lots of people and then that tells us something about how difficult it is and we can be confident that it’s of equal difficulty to some other question that’s on a similar version of the test. And that’s another way that we get at this equivalent-form reliability.
Ok, so those are the different types of reliability and how we can assess them. I hope you found this helpful, if so, please like the video and subscribe to the channel for more. Thanks for watching!