In Psychology And Other Social Sciences, Many Studies Fail The Reproducibility Test

Aug 27, 2018
Originally published on August 27, 2018 6:07 pm

The world of social science got a rude awakening a few years ago, when researchers concluded that many studies in this area appeared to be deeply flawed. Two-thirds could not be replicated in other labs.

Some of those same researchers now report those problems still frequently crop up, even in the most prestigious scientific journals.

But their study, published Monday in Nature Human Behaviour, also finds that social scientists can actually sniff out the dubious results with remarkable skill.

First, the findings. Brian Nosek, a psychology researcher at the University of Virginia and the executive director of the Center for Open Science, decided to focus on social science studies published in the most prominent journals, Science and Nature.

"Some people have hypothesized that, because they're the most prominent outlets they'd have the highest rigor," Nosek says. "Others have hypothesized that the most prestigious outlets are also the ones that are most likely to select for very 'sexy' findings, and so may be actually less reproducible."

To find out, he worked with scientists around the world to see if they could reproduce the results of key experiments from 21 studies in Science and Nature, typically psychology experiments involving students as subjects. The new studies on average recruited five times as many volunteers, in order to come up with results that were less likely due to chance.

The results were better than the average of a previous review of the psychology literature, but still far from perfect. Of the 21 studies, the experimenters were able to reproduce 13. And the effects they saw were on average only about half as strong as had been trumpeted in the original studies.

The remaining eight were not reproduced.

"A substantial portion of the literature is reproducible," Nosek concludes. "We are getting evidence that someone can independently replicate [these findings]. And there is a surprising number [of studies] that fail to replicate."

One of the eight studies that failed this test came from the lab of Will Gervais, when he was getting his PhD at the University of British Columbia. He and a colleague had run a series of experiments to see whether people who are more analytical are less likely to hold religious beliefs. In one test, undergraduates looked at pictures of statues.

"Half of our participants looked at a picture of the sculpture, 'The Thinker,' where here's this guy engaged in deep reflective thought," Gervais says. "And in our control condition, they'd look at the famous stature of a guy throwing a discus."

People who saw The Thinker, a sculpture by August Rodin, expressed more religious disbelief, Gervais reported in Science. And given all the evidence from his lab and others, he says there's still reasonable evidence that underlying conclusion is true. But he recognizes the sculpture experiment was really quite weak.

"Our study, in hindsight, was outright silly," says Gervais, who is now an assistant professor at the University of Kentucky.

A previous study also failed to replicate his experimental findings, so the new analysis is hardly a surprise.

But what interests him the most in the new reproducibility study is that scientists had predicted that his study – along with the seven others that failed to replicate – were unlikely to stand up to the challenge.

As part of the reproducibility study, about 200 social scientists were surveyed and asked to predict which results would stand up to the re-test and which would not. Scientists filled out a survey in which they predicted the winners and losers. They also took part in a "prediction market," where they could buy or sell tokens that represented their views.

"They're taking bets with each other, against us," says Anna Dreber, an economics professor at the Stockholm School of Economics, and coauthor of the new study.

It turns out, "these researchers were very good at predicting which studies would replicate," she says. "I think that's great news for science."

These forecasts could help accelerate the process of science. If you can get panels of experts to weigh in on exciting new results, the field might be able to spend less time chasing errant results known as false positives.

"A false positive result can make other researchers, and the original researcher, spend lots of time and energy and money on results that turn out not to hold," she says. "And that's kind of wasteful for resources and inefficient, so the sooner we find out that a result doesn't hold, the better."

But if social scientists were really good at identifying flawed studies, why did the editors and peer reviewers at Science and Nature let these eight questionable studies through their review process?

"The likelihood that a finding will replicate or not is one part of what a reviewer would consider," says Nosek. "But other things might influence the decision to publish. It may be that this finding isn't likely to be true, but if it is true, it is super important, so we do want to publish it because we want to get it into the conversation."

Nosek recognizes that, even though the new studies were more rigorous than the ones they attempted to replicate, that doesn't guarantee that the old studies are wrong and the new studies are right. No single scientific study gives a definitive answer.

Forecasting could be a powerful tool in accelerating that quest for the truth.

That may not work, however, in one area where the stakes are very high: medical research, where answers can have life-or-death consequences.

Jonathan Kimmelman at McGill University, who was not involved in the new study, says when he's asked medical researchers to make predictions about studies, the forecasts have generally flopped.

"That's probably not a skill that's widespread in medicine," he says. It's possible that the social scientists selected to make the forecasts in the latest study have deep skills in analyzing data and statistics, and their knowledge of the psychological subject matter is less important.

And forecasting is just one tool that could be used to improve the rigor of social science.

"The social-behavioral sciences are in the midst of a reformation," says Nosek. Scientists are increasingly taking steps to increase transparency, so that potential problems surface quickly. Scientists are increasingly announcing in advance the hypothesis they are testing; they are making their data and computer code available so their peers can evaluate and check their results.

Perhaps most important, some scientists are coming to realize that they are better off doing fewer studies, but with more experimental subjects, to reduce the possibility of a chance finding.

"The way to get ahead and get a job and get tenure is to publish lots and lots of papers," says Gervais. "And it's hard to do that if you are able run fewer studies, but in the end I think that's the way to go — to slow down our science and be more rigorous up front."

Gervais says when he started his first faculty job, at the University of Kentucky, he sat down with his department chair and said he was going to follow this path of publishing fewer, but higher quality studies. He says he got the nod to do that. He sees it as part of a broader cultural change in social science that's aiming to make the field more robust.

You can reach Richard Harris at rharris@npr.org.

Copyright 2018 NPR. To see more, visit http://www.npr.org/.

ARI SHAPIRO, HOST:

The world of social science got a rude awakening a few years ago when researchers concluded that many of the studies in this area appear to have deep flaws. Some of those same researchers now report that it's a problem even for the most prestigious scientific journals. But their study also finds that some scientists are surprisingly good at anticipating which studies are likely to stand the test of time. NPR's Richard Harris reports.

RICHARD HARRIS, BYLINE: Science is a process of exploring the unknown. So Brian Nosek at the Center for Open Science and the University of Virginia says we should not expect every result to be repeatable in someone else's lab. The challenge is separating the good from the bad.

BRIAN NOSEK: A substantial portion of the literature is reproducible. We are getting evidence that someone can independently replicate. And there is a surprising number that fail to replicate.

HARRIS: Nosek wanted to see how that plays out in the journals where scientists often take their flashiest and most provocative findings, Science and Nature. Nosek and his far-flung colleagues now report that of the 21 social science papers published in those journals over a recent five-year span, 13 checked out, and eight apparently did not.

One of the eight studies that failed this test came from the lab of Will Gervais. He and a colleague at the University of British Columbia ran a series of experiments to see whether people who are more analytical are less likely to hold religious beliefs. In one test, undergraduates looked at pictures of statues.

WILL GERVAIS: So half of our participants looked at a picture of the sculpture "The Thinker" where you know, here's this guy engaged in deep, reflective thought. And in our control condition, they'd look at the famous statue of a guy throwing a discus.

HARRIS: People who saw The Thinker expressed more religious disbelief. But Gervais, now on the faculty of the University of Kentucky, recognizes that his experiment was really quite weak.

GERVAIS: Our study in hindsight was outright silly.

HARRIS: But what interested him most in the new study involved an interesting twist. Several hundred social scientists were asked in advance to predict which studies would pan out and which ones wouldn't.

ANNA DREBER: They're taking bets with each other against us.

HARRIS: Anna Dreber is at the Stockholm School of Economics and co-author of the new analysis, which is published in Nature Human Behavior. She says those forecasts were spot on.

DREBER: So these researchers were very good at predicting which studies would replicate. So I think that's great news for science.

HARRIS: Dreber says if you can get panels of experts to weigh in on exciting new results, the field might be able to spend less time chasing faulty conclusions known as false positives.

DREBER: A false positive result could make other researchers and the original researcher spend lots of time and energy and money on results that turnout to not hold. And that's kind of wasteful for resources and inefficient. So the sooner we find out that a result doesn't hold, the better.

HARRIS: This is a very intriguing idea, but Jonathan Kimmelman at McGill University says it may be limited to a group of scientists with particular skills. When he's asked medical researchers to make predictions about studies, the forecasts have generally flopped.

JONATHAN KIMMELMAN: That's probably not a skill that is widespread in medicine.

HARRIS: Better than detecting problems in already completed research, scientists like Will Gervais are thinking hard about the incentives that encourage them to do weak, small studies in the first place.

GERVAIS: The way to get ahead and get a job and get tenure is by publishing lots and lots of papers, and it's hard to do that if you're able to run fewer studies. But in the end, I think that's the better way to go - is to kind of slow down our science and be more rigorous up front.

HARRIS: He says that's the approach he is taking. And he sees it as part of a broader cultural change in social science that's aiming to make the field more robust. Richard Harris, NPR News. Transcript provided by NPR, Copyright NPR.

Tags: