Chapter 2: Research Methods — Master Introductory Psychology

Master Introductory Psychology · Chapter 2 of 16

Research Methods

How psychologists collect reliable data, avoid bias, and draw valid conclusions — the scientific toolkit that separates psychology from guesswork.

📖 16 sections ⌛ ~35 min read 🔑 35 key terms ✎ 10 review questions

A Challenging Task

As we saw in Chapter 1, we cannot always trust our own experience to determine what is true. With this in mind, psychology has adopted methods for collecting data, analyzing results, and reporting findings that aim to eliminate bias and get more objective information.

There are a few reasons for the greater difficulty that psychology faces compared to other sciences. Human behavior is especially complex. It can be difficult or impossible to narrow something like happiness or aggression down to a single measurement. In addition, psychologists often aim to measure states that are variable and fleeting. Your subjective emotional response to a particular question or task may be markedly different from the participants before and after you. A happiness rating may not represent how you will feel 10 minutes later. Strong feelings of aggression may last only a few moments. Imagine trying to measure the length of a desk that was constantly changing. Psychological research must try to accommodate for the variation between individuals as well as the variation that occurs within a single individual, whose level of happiness, aggression, or depression may be in constant flux.

Psychology must also deal with the fact that measuring people can actually change those people. A desk doesn't shrink in fear or expand with pride when you attempt to measure it, but you may find that human subjects do react to your curious prodding. They may exaggerate their happiness to put you at ease, or downplay their joy to reduce your potential jealousy. They may constrain their aggression to fit social expectations, or they may lash out to display their dominance. Circumstances which create the possibility that participants may be adjusting their responses are known as demand characteristicsdemand characteristicsCues in a study that lead participants to guess its purpose and adjust their responses, biasing the results. and this reactivity poses a number of challenges to psychological research. One particular example of a demand characteristic is social desirabilitysocial desirabilityThe tendency of participants to give responses they believe are more socially acceptable rather than truthful., which is when a participant will give a response that they believe to be more socially acceptable or “correct” even if that response does not accurately reflect the truth.

Imagine that I were to conduct a study on illegal drug use among all of my students. I arrange to have each student come to a personal interview in my office, along with their parents, and supervised by a school administrator. After collecting my data I proudly declare that not a single one of my students has ever used or even considered using illegal drugs. Do you see the problem with my approach? Of course! No one is going to admit to that behavior in that situation. The demand characteristics are overwhelming, and as a result I will end up with bad data. This example may seem obvious, but we must remember that demand characteristics can sneak into any scenario. Depending on the context, people may want to seem more depressed or less depressed, more anxious or less anxious, more responsible or less responsible, and it is the researcher's responsibility to design situations which reduce these influences.

Sometimes even the very act of measuring can cause changes in participants and it can be difficult to assess whether those effects are the result of the measurement process itself or the experimental change in the environment. From 1924 to 1933 a number of experiments on worker productivity were conducted at The Hawthorne Works, a Western Electric factory outside Chicago. When investigating workplace conditions, researchers found that increases in illumination improved productivity. Strangely, they also found that dimming the workplace lighting increased productivity. Both increases stalled after the study ended, however, which suggests that the productivity increases were due to being observed, or perhaps feeling attended to, rather than directly related to lighting levels. Similar experiments involving modifications to work areas, cleanliness, and workspace re-locations also found temporary changes in productivity which lasted only a few weeks then returned to baseline. This effect of changes in the environment (regardless of what the changes are) is referred to as the Hawthorne EffectHawthorne EffectThe finding that people's behavior changes simply because they know they are being observed. and demonstrates the difficulties of measuring complex variables like work performance.

With all of these potential biases and effects, collecting psychological data is tricky. Yet we still want to know things, so we try to do the research anyway, eyes open to the difficulties we will inevitably face.

✎ Quick check — Section 1

What are demand characteristics?

What Can We Believe?

The most important aspect of conducting or considering psychological research is adopting an attitude of skepticismskepticismThe scientific attitude of questioning all claims and requiring evidence before accepting them as true.. We cannot simply believe whatever we read or hear. In science, all claims are immediately met with questioning.

I find it helpful to understand the skeptical mindset by looking at claims that we may naturally be more skeptical of. Claims of psychics, paranormal activity, and ESP represent cases where many people may naturally be more skeptical (though perhaps not as many people as we might hope). For example, when we hear someone claim the ability to bend metal with his mind, we probably find ourselves adopting a skeptical posture. We shouldn't settle for just hearing the claim, we should want to see a thorough investigation, followed by evidence either supporting or refuting the claim. James Randi, a magician and promoter of greater skepticism in education (through the James Randi Educational Foundation) has spent decades exposing psychic frauds (like the metal-bending Uri Geller) by designing carefully controlled studies which prevent the use of deception and trickery. Since 1964 Randi has offered a cash prize (which has grown to $1,000,000) to any person whose paranormal claims can meet the terms of scientific testing. No psychics, mentalists, or mediums have ever claimed the prize, and most won't even agree to testing, knowing full well that they would quickly be exposed as charlatans.

While it may be easier for us to be skeptical of paranormal claims, we should take the same approach when considering the more “normal” types of claims we see on a daily basis. We are constantly faced with claims that psychologists might want to scientifically investigate, from brain enhancement and sleep improvement to happiness boosters and mate-attraction strategies. What claims are being made and who is making them? How can we assess the validityvalidityThe degree to which a test or measure actually captures what it claims to measure. of these claims? Who is collecting the data, how are they collecting it, and how are they analyzing and presenting the results? Developing a skeptical mindset and insisting on evidence will not only help us to avoid scams, it will help us to understand the world.

Even when we have carefully reviewed the evidence and come to a belief about a claim, we must remain humble. We must avoid overconfidence and be willing and in fact eager to accept new knowledge that overturns what we've come to believe. The goal in science is not to always be right but to always seek better explanations. Should a result support our hypothesis, that's great, we have some evidence that the hypothesis applies to this particular situation. Should the result fail to support our hypothesis, that's great too, we now have evidence that a new explanation may be needed. By remaining open to contradiction and critique we remain open to new ideas and new discoveries.

In our humility, we also recognize that we are human, and just like our participants, we are prone to bias and error. In 1963, Robert Rosenthal and Kermit Fode published the results of a study demonstrating that researchers themselves are prone to error. In this study, introductory psychology students were given rats to care for and run in some experiments. Half of the students were told they had “maze bright” rats who were bred and selected for excellent maze abilities. The other half were told they had “maze dull” rats, selected for poor maze skills. The students then ran a number of maze tests with their rats over the course of 5 days. Not surprisingly, they found that the “maze bright” rats outperformed their “maze dull” counterparts by a significant margin in both correct completions and speed. While the student experimenters weren't surprised by this difference, they should have been. What they didn't know was that all of the rats were randomly selected from the same group and therefore should have all performed comparably. There were no bright or dull rats and it was the pre-existing beliefs of the students that were responsible for the different measurements.

How could this be? Perhaps the result was due to a number of small unconscious differences in the treatment of the rats that ended up having a cumulative effect. Anything from gentler handling, a subtle push at the start of the maze, encouragement along the way, or a faster trigger-finger on the stopwatch may have influenced the results. These tiny unconscious influences created a significant difference between the groups, making it appear that there really were bright and dull rats. We must realize that our expectations can influence our observations. This is an example of observer bias.

One way of reducing both demand characteristics and observer bias is a double-blind study, in which neither the participants nor the observers who are collecting the data (perhaps a computer, research assistants, etc.) know about the true purpose of the study or which group a participant is in. In Rosenthal and Fode's case, if the students hadn't known whether their rat was “bright” or “dull”, they probably would not have found a difference between the groups of rats.

✎ Quick check — Section 2

What is the most important attitude a researcher should adopt toward scientific claims?

Defining Properties

With any psychological question, what we are really trying to establish is better understanding of a property. We want to know about properties like happiness, depression, aggression, intelligence, etc. The problem with wanting to know about psychological properties is that they can be especially difficult to measure.

In other sciences, it can often seem easy to determine how to define a particular measurement. For example, the general property of length can be defined in terms of centimeters. It's important to remember, however, that centimeters themselves are really just one way of talking about length, they are not the property itself. In other words, measures do not detect properties, they simply provide more convenient representations for us to talk about.

It's simply been agreed upon that centimeters are a good way to talk about length and ideally your centimeter is the same as my centimeter (if not, the measure would become useless). When I decide to measure in centimeters, this is my operational definitionoperational definitionA specific, measurable procedure that defines how a variable will be observed or measured in a study. for the property of length. We may say that the property of length has been operationalized as number of centimeters. Someone else may choose miles, nanometers, or furloughs, which would just be different operational definitions for the same property (length). Depending on the situation, some operational definitions may be more appropriate than others. So if we're at IKEA it's easier for me to tell you that a desk is 80 centimeters in length rather than tell you that the desk is 0.000497097 miles. Even though both of these may be accurate, one is better for helping you decide if the desk will fit in your bedroom.

The same concept is true in psychological research. We may refer to particular measures like depression scales, personality inventories, or satisfaction-with-life assessments, and these are ways of attempting to agree on measurements of psychological properties. As you might guess, however, agreeing on how to define a property like happiness is far more difficult than agreeing to define length in centimeters. I may choose to measure happiness by counting smiles, recording laugh time, completing a survey from -5 to +5, or by looking at patterns of brain activity.

No matter which approach I use to measure happiness, I'm hoping to have construct validity. Construct validity means that the operational definition that I've chosen has a clear relationship with the property in question. For example, we might say that using smiles as a measure of happiness has construct validity because we know that there is a relationship between being happy and smiling. In fact, this is an operational definition we probably use quite often when assessing happiness in daily life. We see someone smiling and immediately assume he is happy. Or we try to make someone happy and gauge how successful we are based on whether or not they smile. This doesn't meanmeanThe arithmetic average of a set of scores — the sum of all values divided by the number of values. that smiling always represents happiness (as there are fake smiles and happy non-smilers) but in general the relationship seems clear.

If instead of counting smiles I chose to count cars as my measure of happiness, the relationship may not seem as clear. While I may argue that owning cars makes some people happy and therefore more cars equal more happiness, chances are that many people would disagree with me, citing happy bikers or depressed millionaires with garages full of vehicles. Or they may even cite the car dealer who is happiest each time a car leaves his possession. In this case, my operational definition (measuring happiness as # of cars owned) would be criticized as having low construct validity, meaning that it does not represent the property I'm hoping to study.

✎ Quick check — Section 3

An operational definition is best described as:

Types of Studies

Just as there are many ways of defining properties, there are many ways to go about collecting our data. One approach is to focus on unique situations or special individuals and then study these cases in-depth. This is known as a case studycase studyAn in-depth investigation of a single individual, group, or event, providing rich detail but limited generalizability.. The advantage here is that we can study a participant in detail and use many different operational definitions. For example, if I study one person who is extremely happy, I can consider their happiness in terms of many behaviors, thoughts, and feelings and get a fuller picture of what happiness means for that person.

The problem, however, is that a single individual will only tell me about happiness for that one person. I cannot assume that this is representative of happiness for all people, who may experience it differently. Unfortunately, case studies only tell us about individuals, and we can't necessarily use that information to know about the “average” person. Knowing that an approach works for one person may be nice, but we also want to know about whether the approach will work for others. Still, case studies are important when we want to understand remarkable traits (like high IQ or rare ability) or when we want to study events or situations that we couldn't ethically impose on someone (like brain damage, extreme stress, or rare diseases).

The Survey Method

Instead of a case study, I might decide to collect survey data from a much larger number of people. This will give me a better idea of the average, though I probably won't be able to get nearly as much detail from each individual in my study. I need to bear in mind that demand characteristics may influence my survey and how people respond. How I choose to word my questions may have a subtle but important effect on the responses I receive. For example, asking whether an organization should “limit access to inappropriate websites” may sound more agreeable than asking if they should “censor web content”. Even though both questions refer to the same idea, the different framing of the question may shift the results in a particular direction. This shift may mean results do not accurately represent how people actually think.

I should also consider that there may be something different about the people who actually took the time to respond to my survey. This is known as reporting bias. For example, if most depressed people fail to return my survey on happiness, but happy people all report back, I may end up with an overly optimistic view of average happiness. This might bring me to consider more carefully who is participating in my research. Researchers want to know about a particular population of people and the term population refers to anyone in that group who might be studied. This target population might be teenagers, the elderly, college students, patients suffering from depression, or soldiers returning from combat. Ideally we could study every individual in a population, but in most cases, the populations we want to know about are far too large. So while the population includes everyone we might study, the group of people that we actually manage to study is known as the sample, and this sample is used to represent the entire population.

The best way to create our sample would be to choose individuals from the population at random. This random sampling technique would ensure that every member of the population had an equal chance of being selected. Theoretically this is ideal, but in practice this approach is often unrealistic or impossible to implement. Instead we may use an opportunity sample or convenience sample, based on those members of the target population that are readily available to us. For example, if I want to know about college students I may choose a nonrandom sample from one local college, rather than attempting to randomly select from all college students everywhere.

Researchers may also want to use stratified sampling to ensure that sub-populations within the population are well-represented in the sample. For example, I may want an equal number of males and females in my study or proportional representation of different ethnic groups, and a truly random sample would not guarantee this.

So while random samples are the ideal, most studies don't use random sampling. In fact, many studies are criticized for not even having a representative sample because the participants tend to be college students, which means they are often younger, better-educated, wealthier, and whiter than the population they are meant to represent.

But rather than throwing up our hands in frustration because we can't get a sample that will represent all people from New York to New Guinea, we just do the research, non-representative sample and all. In many cases, non-representative data is better than no data. We can't expect every study to find a representative sample, and we might still learn something from the study. Finding that a particular effect happens in just one limited sample may be interesting on its own. Or we might be able to gauge the representativeness by comparing a large number of non-representative samples from many different studies. So if I only study white, middle-class suburban teenagers and someone else only studies recent immigrants living in a large metropolis, we can compare our findings to see if the results generalize from one sample to another. Finally, sometimes we may not care if a sample is representative because we can assume that the results apply anyway. For example, if a new drug killed all the people who took it you would be wise not to try it, even if everyone in that sample was a different gender, age, or ethnic group than you.

Correlational Studies

Our minds are constantly looking to connect ideas. We want to know how things are related, so we naturally look for patterns in the world around us. Most of the time, this pattern-seeking works well and allows us to make intuitive predictions about how the world works. Of course, our minds don't work perfectly all the time, and as a result, we may occasionally make errors in connecting two ideas, seeing a pattern where no pattern actually exists. This is known as an illusory correlationcorrelationA statistical relationship between two variables — does not imply that one causes the other.. For example, I may notice that when I aced my psychology exam I was wearing a blue shirt. I realize I was also wearing a blue shirt when I did well on a math exam. I may come to believe that there is a relationship between wearing blue and performing well. While this is a silly example, it does happen in real-life (think lucky charms and gambling superstitions) and illusory correlations can become far more troublesome when they involve incorrect assumptions about gender or race. I may notice a particular behavior being performed by a person of a particular ethnic group on more than one occasion and then come to believe there is a relationship between the behavior and the entire ethnic group.

This can be dangerous, because once we come to believe an illusory correlation we may pay more attention to examples that fit our belief and disregard or ignore any contradictory examples. This is known as confirmation bias, and it can strengthen our belief in patterns that aren't actually there. In order to safeguard ourselves from these false patterns, we may need to actually collect and analyze data to better understand the relationship between two variables. When doing so, we are performing a correlational study.

Moderate Negative Correlation (r = -0.6)

In a correlational study, researchers simply measure two variables and then look for a pattern of variation. One way of looking at the measurements from correlational research is to create a scatterplot, which simply plots the scores from each variable, one on the x-axis and one on the y-axis. Just by looking at a scatterplot, we can see if there is a general pattern between the variables. Perhaps as one variable increases we see that the other variable increases or decreases in a predictable manner. Of course, seeing isn't quite enough and we want to have a way to quantify the strength of the relationship between the variables. To do this we calculate the correlational coefficient (represented by the letter r). The r value can range from -1 to +1. Positive r values indicate positive correlations. In a positive correlation, as one variable increases the other also increases. In a negative correlation, as one variable increases the other decreases.

We can think of the r value as telling us how accurately we could predict one of the variables if we knew the other. The closer to 1 or -1, the greater the accuracy of our predictions. As we move closer to 0, the ability to make predictions becomes less accurate. An r value of +1 indicates a perfect positive correlation, and -1 indicates a perfect negative correlation, both of which mean that one variable predicts the other variable perfectly accurately (the difference between positive and negative is just whether the other variable increases or decreases).

In perfect correlations, each data point falls neatly on a line, with no exceptions. Naturally, this type of relationship is rare and exceptions are common. As exceptions build up, the line becomes more and more spread out, until it no longer resembles a line at all. An r value of 0 indicates no relationship between the two variables, meaning that the data are completely random. As we can see from the scatterplots below, the closer our data are to a line (closer to +1 or -1) the stronger the relationship, and the more spread out the line, the weaker the correlation. (Note that the example in the second diagram shows a negative correlation, and thus the line slants downward, as X increases Y decreases).

Even when we find a strong correlation, we have to remember that this is not evidence for causation. While correlations tell us that two variables are related, they can't tell us the type of relationship. If I were to measure my students' study time in hours and then measure their exam scores, I may find a positive correlation, meaning that longer study time is related to higher scores.

But simply knowing that long study time and high scores are correlated doesn't tell me anything about causation. There are three ways that causation might be working:

1. Longer study time does in fact cause higher scores. (X causes Y)

2. Earning higher scores encourages students to spend more time studying. (Y causes X)

3. Some other variable that wasn't in my study influences both study time and scores. (Z causes X and Y)

Correlational data has no way of indicating which causation is occurring. While we may sometimes be able to eliminate one of the first two directions of causation, the third option always remains a possibility. This other variable influencing both is referred to as a third variable.

✎ Quick check — Section 4

A case study provides detailed data about one individual but is limited because:

The Third-Variable Problem

The possibility of a third variable makes things difficult, because the third variable could be anything. This is known as the third-variable problemthird-variable problemThe possibility that a correlation between two variables is actually caused by a separate, unmeasured variable., because there's always a possibility that some variable we never thought of is responsible for the correlation. For example, in the case above, it could be parenting style causing both high study time and higher scores. Or it could be interest in the class material. Or it could be teaching style, or peer competition, or caffeine intake, or something else in the diet, or hormones, or just about anything we could possibly imagine, and then anything we couldn't possibly imagine. We can never eliminate the third variable problem because the number of possible third variables is infinite. If we haven't actually measured the third-variable in question, we can't eliminate it from consideration, even if it seems ridiculous.

If we suspect a particular third variable, we can try to eliminate it by measuring it and comparing participants who are matched for that variable. For example, if I'm concerned that caffeine intake might be increasing both study time and scores, I would want to compare students with equal amounts of caffeine intake. If my sample reports drinking an average of 3 cups of coffee per day, I might want to collect another sample with the same 3 cup average to compare. If this second group drinks the same amount of caffeine but has different study times or scores, this indicates that caffeine is not the main cause of the variation. This approach would be called a matched sample.

If I want to get more specific with my matching, I might compare individuals directly. So if I have a participant who drinks one coffee per day, I want to compare him to another individual who also drinks one coffee per day. If another participant drinks 8 coffees a day, I would want to compare her results to someone who also drinks 8 coffees per day. Each individual in my study would have a partner with the same coffee intake. This would be known as matched pairs.

While matched samples and matched pairs help us rule out particular third variables, they can't help us eliminate them all, because third variables are infinite. So we will never able to match for all of them. Instead, we turn to another solution to the third-variable problem; experimentation.

✎ Quick check — Section 5

If ice cream sales and drowning rates both increase in summer, this is an example of:

Experimentation

The best method we have for trying to get around the third variable problem is to perform controlled experiments. An experimentexperimentA research method in which the researcher manipulates one variable and randomly assigns participants to conditions to determine cause and effect. manages to minimize the third-variable problem by manipulating one (and only one) of the variables in the study, rather than simply measuring it, then seeing if this manipulation causes a change in another variable. To use our example above, if I forced some students to study longer but I didn't change anything else, and then I found that their scores increased, I could be more confident that it was the study time that was causing the higher scores. In this case, I don't need to wonder whether parenting style, motivation, or coffee was causing them to study more because I know that the experiment was what was causing them to study more. This reduces the possible influence of all other third-variables.

Before we go through all the detailed steps of designing an experiment, let's first consider how we come up with ideas for experiments in the first place. In order to conduct an experiment, I first need to have some idea of the type of relationship I'm looking for. I don't simply change any variable at random and then measure another random variable and hope for a pattern to emerge. Instead, I begin with a theory. A theory is a general explanation of how something works or how a phenomenon occurs. For instance, I may theorize that exercise improves problem-solving ability.

A theory is about general properties, so I can't test it directly. I can, however, come up with specific definitions based on my theory that will be testable. This is how a hypothesis is generated, and it allows us to make a specific prediction about how something will happen, then see if that prediction was accurate.

For example, if I start with the general theory that “exercise” improves “problem solving ability”, I need to come up with testable definitions for each of these properties. I may decide that one session of 15 minutes of walking at 65% of maximum heart-rate will be my operational definition of “exercise” and that “problem-solving ability” will be measured by how many Sudoku puzzles a participant can correctly solve in 30 minutes. These certainly aren't the only possible definitions, but they allow me to create a specific prediction. Now I have a hypothesis based on a general theory.

Theory: Exercise improves problem-solving ability.

Hypothesis: Participants who engage in the specified exercise will solve significantly more Sudoku puzzles than participants who do not engage in the specified exercise.

With a testable hypothesis, I'm now able to collect quantitative data which will either support or contradict my prediction. Depending on whether my hypothesis is supported or refuted, I may decide to refine, revise, or reject my general theory. I may decide that the theory is still appropriate, but that my operational definitions were poorly chosen and therefore I should try again with new operational definitions. Maybe I need to change the exercise session, or maybe I need to change the problem-solving task. Or I may decide that the operational definitions were valid and the theory itself should be questioned.

An important point is that a theory can never be proven. When we find data which support our hypothesis, we have evidence that the theory is accurate, but no amount of evidence can conclusively determine that a theory will always be true. Even after collecting mountains of data it's always possible that tomorrow we will find subjects for whom exercise has no effect on problem-solving, or for whom it even has the reverse effect.

Until we have tested every possible subject in every possible way (which will never happen) we can't be 100% confident that a theory will always be true. While this may seem like scientists are all being sticklers, we should think of it as an exciting example of infinite possibilities. Each single study's evidence offers a world of opportunity for greater study and new knowledge. If 15 minutes of walking does have an effect, what about 14 minutes? How about jogging for 10 minutes or sprinting for 30 seconds? If Sudoku-solving is improved, what else might be? Is reaction time improved? How about overall IQ? Do these results continue to grow with long-term commitment to exercise? Could this type of exercise be used to help patients with cognitive impairments? We'll never reach the end of these possible questions, and that should serve to excite us and keep us endlessly fascinated with even the simplest of theories.

But before we get too hung up on the types of conclusions we can draw from research, let's take a closer look at exactly how we go about testing a hypothesis and how we need to design an experiment in order to get accurate data.

Manipulation of the Independent Variable

The most important aspect of experimental design is the manipulation of one of the variables. In the case above, the manipulation would be the exercise session. By having some participants perform the exercise while other participants don't, I can compare their performance and look for a difference. This is the essence of an experiment.

The participants who are assigned to receive the treatment (in this case, exercise) are referred to as the experimental groupexperimental groupThe group in an experiment that receives the treatment or independent variable manipulation., while the participants who do not receive the treatment (no exercise) are referred to as the control groupcontrol groupThe comparison group in an experiment that does not receive the treatment, used as a baseline.. In drug studies, the control group often receives a placebo pill, an inert substance with no medicinal properties, that they believe is actually a treatment. This is because simply receiving a treatment (such as taking a pill) can cause patients to feel better (known as the placebo effectplacebo effectImprovement in participants who receive an inert treatment, due to their expectation of improvement.), and we want to be sure that the real treatment's effects are even stronger than the placebo effect. In our exercise/problem-solving study, participants will know if they exercised or not, so we might give the control group some other neutral task (instead of exercise) so they don't realize they are the control group.

The variable which is manipulated is known as the independent variableindependent variableThe variable that is manipulated by the experimenter to observe its effect on the dependent variabledependent variableThe variable that is measured to see whether it changes as a result of the independent variable.., because whether someone receives the treatment or not is independent from anything else. It doesn't matter if they wouldn't choose it when given a choice, whether they enjoy it or not, whether they drank coffee or tea that morning or any other possible variable.

In order to be sure that the manipulated variable is truly independent of all other factors, I must use random assignmentrandom assignmentRandomly placing participants into experimental or control groups, distributing individual differences equally across groups., which is to say that placement of a participant in the experimental group or control group is determined randomly. I need to avoid self-selection, where some existing difference between the participants determines which group they are in (such as allowing them to choose whether they want to exercise). I can't even trust my own judgment in assigning groups, because perhaps I will unconsciously assign the more fit individuals to the exercise group, or the more attractive individuals to the no-exercise group, or the people who arrived first to the exercise group because any of these differences might be having an influence on their later cognitive performance and I want to be sure that this isn't the case.

So, random assignment (also called random allocation) is a must. Even with random assignment, however, I must remember that demand characteristics may still be influencing the results (perhaps participants in the exercise group suspect the purpose of my study and then devote more effort to their problem-solving in order to be “good” participants).

Once I've randomly assigned my participants to the experimental and control groups and then manipulated my independent variable, I need to assess possible effects on the dependent variable. The dependent variable is what is measured following the manipulation. If the treatment has had an effect, then I should see a pattern of variation in the dependent variable. If the treatment has not had an effect, then I won't see a difference in the dependent variable. Recall Rosenthal and Fode's rat experiment and you'll remember that I also need to be careful that my own expectations and biases don't affect my measurement of the dependent variable.

I may wonder if certain differences between my participants are affecting the dependent variable. For instance, what if members of the control group are exercising outside the lab? The idea behind randomization is that if people are exercising outside the lab, some of them will end up in the experimental group and some will end up in the control group, so the effect should balance out, provided that I have a large enough sample (see note below). But for some variables that might be affecting my results, (known as confounding variables) I may also want to create some controls. I may request that participants refrain from outside exercise for the duration of the study. This would help control for outside exercise. If I'm concerned that different meal times are influencing problem-solving ability, I could control this by asking all participants to fast for 8 hours prior to the experiment. While I won't be able to control all possible confounding variables, I may implement controls for the ones I believe are most likely to affect my data, and assume that randomization will balance out the impact of others. This combination of controls and randomization can help ensure that any effect on the dependent variable is actually being caused by the manipulation of the independent variable. (Note: it is also possible to use matching techniques in an experimental design rather than using randomization, especially in smaller samples, but for the sake of keeping this guide simpler we will be ignoring these designs).

✎ Quick check — Section 6

What is the defining feature of a true experiment?

Descriptive Statistics

Once I've collected my data, I need to figure out what it means. In order to do this, I'll probably start with some descriptive statistics. Descriptive statistics simply describe the data and give us an idea of how scores are distributed in each group.

First, I'll probably want to know about my experimental and control groups in general. For this, I will look at measures of central tendency. There are 3 main measures of central tendency, each with their own strengths and weaknesses, depending on the situation. First is the mean (or the average). This is calculated by finding the sum of all the scores, then dividing this by the number of scores.

Mean = sum of scores

# of scores

The mean can be useful, but it doesn't tell the whole story. Imagine that I give an exam to a class of 10 students. 9 of the students score a 90 on the exam, and the remaining student scores a 0. In this case, the average score will be 81 (810/10). When I tell my students this, 9 students will think their performance was “above average” even though they each actually just performed better than 1 student. What has happened is that the one extreme score has had a strong effect on the mean. This is the main problem with the mean; it is sensitive to extreme scores or outliers. Just one extreme score can heavily distort the average, especially in small samples.

When we have extreme scores, we may not want to use the mean and instead look to another measure of central tendency: the medianmedianThe middle score in a ranked distribution — half of scores fall above it, half below.. The median is calculated by simply lining up the scores in order, then finding the middle score. For example, in the following distribution, the median would be 5.

2, 4, 5, 7, 8

In a distribution with an even number of scores, the median will be the average of the two middle scores. So in the following:

2, 4, 5, 6, 7, 8

the median would be [5+6] / 2 = 5.5

Back to my exam example above, the scores would line up as follows:

0, 90, 90, 90, 90, 90, 90, 90, 90, 90

And the median in this case would be 90. The distorting effect of the extreme score (the 0) has been reduced. In this case the median gives the students a more accurate picture of how their performance compares with their classmates.

The final measure of central tendency is the modemodeThe most frequently occurring score in a distribution.. The mode looks at the frequency of each score and then tells us which score occurs most frequently. In the example above, the mode would be 90. It should be noted that it's possible to have more than one mode, known as a multimodal distribution, where multiple scores are equally high frequency. It's also possible that we have a uniform distribution, meaning that all scores are equally frequent and there is no most common score.

Measures of Variance

In addition to wanting to know about the central tendency of my distribution, I also want to know how the scores compare to one another. To consider this, I need to look at measures of variance. The simplest way to measure variance is to calculate the range. The range is the distance between the highest score and the lowest score. This tells me how spread out my distribution is overall, but it doesn't tell me much else. It also suffers from being sensitive to extreme scores, since the range is all about the extremes. Just one very high or very low score can have a dramatic impact on the range.

So we also want to look at another measure of variance. The standard deviationstandard deviationA measure of variability indicating the average distance of scores from the mean. tells us how each score compares to the mean. In general, are scores gathered closely around the mean, or are they spread widely from the mean? The standard deviation is calculated by comparing each score to the mean, then taking the average of all those comparisons.

The standard deviation gives us a much better idea of how all the scores relate to one another. We know if the distribution is close together or spread out. The larger the standard deviation, the more the scores are spread out from the mean. A small standard deviation means that scores are all clustered together near the mean.

The Normal Distribution

In graphing our data, we may create a frequency distribution: showing possible scores on the X-axis and the frequency of each score on the Y-axis. In measuring some traits, we will find that the frequency distribution creates what is known as a normal distributionnormal distributionA symmetrical bell-shaped curve where most scores cluster near the mean and fewer appear at the extremes. or a bell curve. This means that the mean, median, and mode are all in the center of the distribution, and frequency drops off symmetrically in both directions, with half of the scores on each side of the mean.

Knowing that we have a normal distribution also allows us to quickly estimate the percentage of scores that will fall within a given range: about 68% of scores will fall within one standard deviation from the mean (in both directions), about 95% of scores will be within 2 standard deviations of the mean, and about 99.7% of scores will be within 3 standard deviations of the mean (we'll see this in more detail when we look at intelligence testing).

✎ Quick check — Section 7

The mean of the scores 2, 4, 6, 8, 10 is:

Significance

You may be wondering how large the effect of a manipulation needs to be in order to be meaningful. To return to our exercise/problem-solving example, how much better does the exercise group have to perform for us to take notice? How much is enough to conclude that exercise is having an effect? In order to determine this, we need to use inferential statistics. Inferential statistics allows us to make judgments about the data we have collected and make inferences about what conclusions might be appropriate.

Significance can be calculated in a number of different ways depending on the type of data we have collected, and calculations are based on the number of participants in our sample, as well as the effect size, or how large the difference was between our experimental group and our control group. For example, if I claimed to have developed a smart drug, then I randomly gave one student the drug and one student a placebo, then told you that the student who took the drug scored 95 on an exam while the placebo student scored 80, you might be intrigued, but you would also realize that the odds that the better student just happened to receive the drug are too high because there were only two students. For this reason, we generally want to have as many participants as possible in order to reduce these kinds of coincidences and be more confident in our conclusions. This idea that having more data points is always better is known as the Law of Large Numbers.

Similarly, if I randomly assigned 100 students to take the drug, and 100 students to take a placebo, then I found that the experimental group's average exam score was 87 while the control group's average was 86, you should still be skeptical, even though technically I found a difference. The problem here is that the effect size is too small. It's probably just a coincidence, because if we take the average score of 100 random students and compare it to the average of another random 100 students, we won't get exactly the same average every time. The fact that the difference is only 1 point means it's not convincing evidence that the drug is having an effect.

In calculating significance we come up with a p-value. You can think of a p-value as telling you how likely your data is to occur. We want to collect data that is unlikely to “just happen” on its own. For example, imagine I told you that I could mentally control a fair coin so it always land on heads, and so you want to test this. In testing me, you wouldn't be satisfied with a single coin flip landing on heads, because you know that a single heads is fairly likely to happen anyway, so the p-value would be high. If you flipped the coin 1,000 times and every time was heads, this would be very unlikely to occur on its own, so you might start thinking that this wasn't just chance, and in this case the p-value would be low.

Data that is unlikely to have occurred by random chance suggests that we probably have a real effect, and so a low p-value is a good thing. Usually we want a p-value less than 0.05. When a p-value is 0.05 or lower we say that the results are statistically significant. Because we can never completely eliminate the possibility of our data being a chance occurrence (even 1000 identical flips could happen by chance), we will never have a p-value of 0.

If our p-value is 0.05, this means that the probability that there wasn't a real effect (but we happened to get data that looks like one anyway) is 5%. To be clear, the p-value doesn't tell us the probability that our hypothesis is correct, it tells us the odds of randomly observing the data we have observed. In the above example, a p-value wouldn't tell us how or why this event is occurring, but it would tell us that it's a very, very unlikely event.

Drawing Conclusions from Research

Before drawing conclusions from a study, we need to check that everything has been done correctly. This is known as assessing the internal validity of a study. Think of internal validity as a checklist that we can mentally go through in order to decide whether the conclusions of a study are appropriate. Here are the questions that we should ask when considering the internal validity of a study:

Was the independent variable effectively manipulated?

Were participants randomly assigned to experimental/control groups?

Was the dependent variable measured in an unbiased way?

Was a reliable pattern found between the manipulation of the independent variable and the measurement of the dependent variable?

As you should guess from reading all of the preceding material in this chapter, these questions may not always be answered with a simple yes or no. These can be sources of contentious debate, and this is a good thing. The presentation of data should be followed by spirited argument, alternative explanations, and critique of the process by which the data was collected and analyzed. The truth is that there is no perfect study. There is no one way to define a property, manipulate a variable, or measure an effect. We must humbly recognize the potential for flaws in all studies, and remain skeptical of all conclusions as we gather more data, consider new operational definitions, novel manipulations, and better measurements.

We need to remember that every study will always be limited in the conclusions that can be drawn. We cannot conclude causality based on a single result, or even a pile of results. Each study's conclusions must be limited to the variables as defined in the study and also limited to the sample studied.

If I defined happiness as smiling behavior, my conclusions must only be about smiling. If I only studied teenagers at a public school, I must limit my conclusions to teenagers. I can't draw conclusions about happiness in general or how the data might reflect all people. While the journalistic accounts of research may do this with flashy stories and click-bait headlines, scientists must restrict their writing to the specific evidence collected.

Since every study has its limitations, we shouldn't settle for a single study as conclusive evidence. Instead, we may use triangulation, a technique of examining a subject from multiple angles in order to get a more complete picture.

Another important part of the process of drawing conclusions is replicationreplicationRepeating a study to determine whether its results can be reproduced — a cornerstone of scientific credibility., which involves repeating an experimental design. Generally this would be done using the same operational definitions and manipulation, but with a new sample of participants. Using the same techniques allows researchers to compare results directly. This is also why precise operational definitions are so important for collecting data. We want to be able to repeat the process as closely as possible, which will also allow us to assess the reliabilityreliabilityThe consistency of a measure — a reliable measure produces similar results across repeated administrations. of a particular claim.

Even when we accept the internal validity of a study, we may wonder whether the results are actually applicable to real-life. This is known as external validity or ecological validity. This considers whether the variables or manipulation in a study represent normal or typical ways we might see them in everyday life. Quite often we can criticize a study for artificiality, or a lack of external validity, because the lab environment doesn't recreate a realistic situation. Naturalistic observationNaturalistic observationObserving and recording behavior in its natural setting without intervention or manipulation. (surreptitiously observing and measuring real-life behaviors) and field experiments (manipulating a variable and observing responses in a real-life setting) can be great approaches for high external validity, but both of these reduce the researcher's ability to control possible confounding variables. The laboratory environment can be more carefully controlled, but this control may reduce the realistic nature of the tasks involved.

The Ethics of Research

In addition to all of our concerns about reliability and validity, we also need to consider ethical guidelines that determine how we collect our data. The APA (American Psychological Association) has established ethical guidelines that must be followed for all studies. Before being conducted, proposed studies must go through an Institutional Review BoardInstitutional Review BoardA committee that reviews research proposals to ensure ethical treatment of human and animal participants. (IRB) which will assess any ethical considerations before approval. It's worth noting, however, that a great deal of private research, especially consumer research, is not overseen by ethics committees. So these guidelines may not apply to focus groups, a survey you found on facebook, or other reports which are conducted outside of the university research environment.

In general, the following guidelines must be followed:

Informed ConsentInformed ConsentThe ethical requirement that participants be told enough about a study to voluntarily decide whether to participate.

Freedom from Coercion

Protection from Harm

Risk/Benefit Analysis

Anonymity and/or Confidentiality

Informed consent means that potential participants are given enough information about the study to determine whether or not they want to participate. Informed consent does not, however, mean that the participant must remain in the study. Studies should be voluntary at all times and participants should have the freedom to stop at any point. Studies using children as participants must get informed consent from their legal guardians.

Freedom from coercion means that researchers cannot force participants to be in a study. While this obviously would apply to physical coercion, it also applies to other types of coercion. Offering higher grades, large sums of money, or other rewards for participation may cause participants to agree to tasks that they would otherwise reject. While researchers can offer to pay participants for their time, they cannot offer rewards or payments that might be considered coercive or else participation wouldn't be truly voluntary.

Protection from harm applies to both physical harm and mental harm. Participants should not be placed in situations of extreme stress or have to deal with situations that may have lasting psychological effects. Researchers may ask participants to take small risks, such as answering potentially embarrassing questions, performing cognitively demanding tasks, or even braving minor pain (such as a mild electric shock or submerging a hand in ice water). In order for these risks to be approved, however, there must be a clear benefit. Researchers can't just run around giving out electric shocks for no particular reason. The risks must be balanced with the benefits and improved knowledge the results can potentially uncover.

Ideally, the data collected from participants will be anonymous, meaning that responses will not be connected to the participant's name or identity. This can easily be done for some tasks, especially those involving computer data collection and surveys. In some situations, however, such as face-to-face interviews, it's not possible to have true anonymity. In these cases, it's important for the data to be confidential, meaning that researchers will not share a participant's responses with others.

Finally, researchers must give a debriefingdebriefingExplaining the true purpose and procedures of a study to participants after it is completed. to all participants. A debriefing is a summary of the purpose of the study and it should reveal any deception that was used. It should also attempt to undo any changes in the participant (if a study intentionally did something to make you frustrated, the debriefing should attempt to restore you to your previous emotional state).

✎ Quick check — Section 8

Which of the following is required by ethical guidelines for research with human participants?

A Note About Animal Studies

When it comes to collecting data from animal studies, not all of these guidelines can be followed. Naturally, animals cannot give consent to participate, and as a result, they can be forced to participate in research. Protection from harm, however, is still an important feature. The risk of harm allowed in animal studies is greater than in human studies, however, this risk must have a clear possible benefit. So while it may be acceptable for a study to risk animal harm or even death, the researchers must first demonstrate that the results of the study will have a clear benefit to humans. In addition, the harm must be the minimal amount necessary to receive the benefit and there are still strict standards of care for the animals which must be followed.

While it may be unpleasant to consider these types of animal studies, it's important to remember that these represent a very small number of studies. We should also remember the great advances and benefits of this type of research. Knowledge gained from animal studies has helped to save the lives of millions of people and has also served to improve the lives of other animals. Better understanding of disease, injury, and genetics can help to reduce the overall suffering in the world but sometimes this understanding comes at a cost. We should also be careful to avoid hypocrisy, as most of us implicitly accept the notion that some other organisms may occasionally suffer for human benefit whenever we eat meat, destroy habitats for human use (including farmland), or simply apply antibacterial gel to our hands.

Chapter Summary

Key takeaways — Chapter 2

Conducting research in psychology requires careful methods for collecting and analyzing data.
Psychologists must create clear operational definitions in order to investigate properties which cannot be directly measured.
Data can be collected in a variety of ways including correlational studies, surveys, case studies, and experiments.
Experiments can provide evidence for cause-and-effect relationships by using random assignment, manipulating an independent variable, and measuring effects on a dependent variable.
Data collected from studies can be analyzed using descriptive statistics and inferential statistics.
Psychology has also adopted standards of ethics and approval is required before studies can be conducted.

Review Questions

Chapter 2 — Research Methods

10 multiple choice questions · Select an answer then click Check

Question 1 of 10 Score: 0 / 0

out of 10

Study tools

Practice the Chapter 2 key terms

35 flashcards covering all key terms from this chapter — with instant definitions.

Study Flashcards →