The Reformation: Can Social Scientists Save Themselves?

Academic hoaxes have a way of crystallizing, and then shattering, the intellectual pretensions of an era. It was almost 20 years ago, for instance, that a physicist named Alan Sokal laid siege to postmodern theory with a Trojan horse. You may remember the details: Sokal wrote a deliberately preposterous academic paper called “Transgressing the Boundaries: Toward a Transformative Hermeneutics of Quantum Gravity.” He filled it with the then trendy jargon of “critical theory,” and submitted it to a prominent journal of cultural studies called Social Text. Amid worshipful citations of postmodern theorists and half-baked references to complex scientific work, the paper advanced a succession of glib, sweeping assertions (“Physical ‘reality,’ no less than social ‘reality,’ is at bottom a social and linguistic construct”). Social Text published it without demanding any significant editorial changes.

When Sokal revealed that his paper was a practical joke, the media went wild—or as wild, at least, as the media has ever gone over an academic prank. By successfully aping the methods and conventions of postmodern cultural analysis, and using them to serve intentionally ridiculous ends, Sokal had, for many in the public, exposed once and for all how unsound those methods and conventions were.

Two decades later, abstruse postmodern theory is passé, thanks in no small part to the embarrassment that Sokal’s hoax inflicted on it. Today’s intellectual fashions tend instead toward the empirical, the data-driven, and the breezily counterintuitive. Psychology, probably more than any other field, has risen to prominence in this era—expanding its purview, within academia, into other disciplines like economics, philosophy, and law; influencing policymakers; and spawning countless bestsellers like Blink, Nudge, and The Power of Habit. Speaking with the imprimatur of objective science, experimental psychologists have even begun to assume, in the popular imagination, the sort of introspective tasks that are usually assigned to the humanities: The work of explaining what it means to be human.

But experimental science, it turns out, is no less susceptible to a good, thorough hoaxing than postmodern blather was.

The prank announced itself at the outset: In 2011, a psychologist named Joseph P. Simmons and two colleagues set out to use real experimental data to prove an impossible hypothesis. Not merely improbable or surprising, but downright ridiculous. The hypothesis: that listening to The Beatles’ “When I’m Sixty-Four” makes people younger. The method: Recruit a small sample of undergraduates to listen to either The Beatles song or one of two other tracks, then administer a questionnaire asking for a number of random and irrelevant facts and opinions—their parents’ ages, their restaurant preferences, the name of a Canadian football quarterback, and so on. The result: By strategically arranging their data and carefully wording their findings, the psychologists “proved” that randomly selected people who hear “When I’m Sixty-Four” are, in fact, younger than people who don’t.

Imagine that you’re accumulating data points that fall into a beautiful line across the graph, and all of a sudden, some dog stands there like a dummy, refusing to salivate. You ring the bell again, louder—nothing. Is he mentally defective? Is he deaf? What will you trust: the theory you have spent years developing, or the dog?

The statistical sleight of hand involved in arriving at this result is a little complicated (more on this later), but the authors’ point was relatively simple. They wanted to draw attention to a glaring problem with modern scientific protocol: Between the laboratory and the published study lies a gap that must be bridged by the laborious process of data analysis. As Simmons and his co-authors showed, this process is a virtual black box that, as currently constructed, “allows presenting anything as significant.” And if you can prove anything you want from your data, what, if anything, do you really know?

The Simmons paper, which appeared in Psychological Science (a journal produced by SAGE publications, which also—full disclosure—supports this magazine), was not an isolated warning sign of trouble in the field.

For the last several years, a crisis of self-confidence has been brewing in the world of experimental social science, and in psychology especially. Amid a flurry of retracted papers, prominent researchers have resigned their posts, including Marc Hauser, a star evolutionary psychologist at Harvard and acclaimed author, and Diederik Stapel, a Dutch psychologist who admitted that many of his eye-catching results were based on data he made up. And in 2012, the credibility of a number of high-profile findings in the hot area of “priming”—a phenomenon in which exposure to verbal or visual clues unconsciously affects behavior—was called into question when researchers were unable to replicate them. These failures prompted Daniel Kahneman, a Nobel Prize–winning psychologist at Princeton, founding father of behavioral economics, and best-selling author of Thinking, Fast and Slow, to warn in an email to colleagues of an impending “train wreck” in social psychology.

Another recent incident was unsettling precisely because it could not simply be dismissed as deviant. Around the same time that Simmons published his tour de force, a paper by the respected Cornell psychologist Daryl Bem claimed to have found evidence that some people can react to events that are about to occur in the near future—a finding as ludicrous-sounding as Simmons’, but one that has been presented by its author as completely legitimate. Bem’s paper set off a frenzy of efforts within the field to debunk his findings. His colleagues’ concern wasn’t just that his paper seemed unbelievable, but that it threatened the whole enterprise of social psychology. After all, if you can follow all the methods and protocols of science and end up with an impossible result, perhaps there is something wrong with those methods and protocols in the first place.

Something unprecedented has occurred in the last couple of decades in the social sciences. Overlaid on the usual academic incentives of tenure, advancement, grants, and prizes are the glittering rewards of celebrity, best-selling books, magazine profiles, TED talks, and TV appearances. A whole industry has grown up around marketing the surprising-yet-oddly-intuitive findings of social psychology, behavioral economics, and related fields. The success of authors who popularize academic work—Malcolm Gladwell, the Freakonomics guys, and the now-disgraced Jonah Lehrer—has stoked an enormous appetite for usable wisdom from the social sciences. And the whole ecosystem feeds on new, dramatic findings from the lab. “We are living in an age that glorifies the single study,” says Nina Strohminger, a Duke post-doc in social psychology. “It’s a folly perpetuated not just by scientists, but by academic journals, the media, granting agencies—we’re all complicit in this hunger for fast, definitive answers.”

But there’s an important difference between Sokal’s ambush of postmodernism and Simmons’ prank on psychology. Unlike Sokal’s attack, the current critique of experimental social science is coming mainly from the inside. Strohminger, Simmons, and a handful of other mostly young researchers are at the heart of a kind of reform movement in their field. Together with a loose confederation of crusading journal editors and whistle-blowing bloggers, they have begun policing the world of experimental research, assiduously rooting out fraud and error, as if to rescue the scientific method from embarrassment—and from its own success.

THERE’S NOTHING NEW ABOUT scientists who fudge, or even fabricate, their results. Whole books have been written about whether Gregor Mendel tweaked the measurements of his plants to make them better fit with his theory. When attempting to fit the irregular polygon of Nature into the square hole of Theory, all researchers face a strong temptation to lop off the messy corners. Imagine that you’re going along, accumulating data points that fall into a beautiful line across the graph, and all of a sudden some dog stands there like a dummy, refusing to salivate. You ring the bell again, louder—nothing. Is he mentally defective? Is he deaf? What will you trust: the theory you have spent years developing, or the dog? (This is not to cast aspersions on one of the great pioneers of experimental psychology. As far as we know, Pavlov’s dogs really did do what he said they did.)

Then again, maybe there is no dog at all. For scientists in a real hurry to establish themselves, the quickest way to go from arresting hypothesis to eye-catching publication is to skip the research altogether and just make up results.

A few years ago, the psychologist Diederik Stapel set out to prove that people in a messy environment are more likely to engage in racial stereotyping than people in a tidy one. To test this notion, Stapel proposed to corral commuters in a Utrecht train station under two different experimental conditions—when the station was unkempt and when it was clean. He would gather the commuters in a waiting area under the pretext of wanting to interview them, and surreptitiously take note of how near or far they chose to sit from a confederate of either the same or a different race.

That was the method he reported, anyway, in a notable paper titled “Coping With Chaos,” which appeared in 2011 in Science, arguably the most prestigious journal in the world. The paper found strong evidence for Stapel’s hunch—but not because of any commuters in Utrecht. It’s not clear that Stapel ever performed his experiment in the train station. An investigation that same year by a panel of Dutch professors concluded that the paper was tainted by “fabrication of data.”

“Coping With Chaos” was one of dozens of papers by Stapel that have been retracted or withdrawn in the three years since he was confronted by colleagues suspicious of his uncanny string of successes. In his mea culpa, Stapel confessed that he had made up many of his results. He told the New York Times he was led astray by “a quest for aesthetics, for beauty—instead of the truth,” and also by ambition. To actually run the experiment would have been to run the risk that a disordered environment has no effect on racial attitudes—a “null finding” that would mean he’d wasted his time, because journals aren’t usually interested in experiments that don’t prove something new.

After Stapel was outed, psychologists realized that the clues to his deceptions had been there all along. With the benefit of hindsight, researchers found copious evidence of fabrication in the fine print of his charts and tables.

This is, in fact, heartening: It confirms that inventing plausible results isn’t easy. In effect, to do so you have to reverse-engineer a bunch of raw data that conform to what you want to find, while introducing enough variation to seem convincing.

Simmons and his colleagues duly reported the adjustment in their mock paper—but they left out any mention of all the other factors that they had tried and discarded. This was all, Simmons emphasizes, within the bounds of what is considered fair play in most psychology departments and journals.

Uri Simonsohn, one of the co-authors of the “When I’m Sixty-Four” paper, has spent the past few years figuring out how this is done, and how it can be detected. One day in 2011, Simonsohn came across a paper by Lawrence Sanna, then a psychologist at the University of North Carolina. Sanna was doing the kind of pop research that seems ready-made for a TED talk: In a series of lab experiments, he found that human subjects who were physically elevated above their surroundings evinced more “pro-social” behavior than those at a lower level, literally embodying the idea of “high-mindedness.”

Sanna’s method was distantly descended from that of the famous experiments by Stanley Milgram, who instructed subjects to administer fake electric shocks to confederates as a way of showing how obedience to authority can lead ordinary people to commit cruelty. In lieu of electric jolts, Sanna measured his subjects’ willingness to torment a confederate with hot sauce. Specifically, he asked subjects to ladle out portions of a painful concoction (five parts Heinz chili sauce to three parts Tapatío salsa picante) for a second person to consume; the amount they doled out was considered a measure of either aggression or sympathy. The experiment took place in a theater, and subjects were assigned randomly to either the stage or the orchestra pit. As Sanna’s hypothesis predicted, the subjects onstage doled out less than half as much hot sauce as those in the pit.

What caught Simonsohn’s eye was “a troubling anomaly” in one of Sanna’s tables: The standard deviation, a measure of variability, was almost identical in the results from the two groups. In other words, although the dosages of hot sauce administered by the “high” and “low” groups differed greatly on average, the way the results within each group were distributed was extraordinarily similar—almost as if someone had been making them up. After Simonsohn had an exchange with Sanna, the paper, and at least six others by him, were retracted. Sanna, who had since moved to the University of Michigan, resigned.

OUTRIGHT FAKERY IS CLEARLY more common in psychology and other sciences than we’d like to believe. But it may not be the biggest threat to their credibility. As the journalist Michael Kinsley once said of wrongdoing in Washington, so too in the lab: “The scandal is what’s legal.” The kind of manipulation that went into the “When I’m Sixty-Four” paper, for instance, is “nearly universally common,” Simonsohn says. It is called “p-hacking,” or, more colorfully, “torturing the data until it confesses.”

P is a central concept in statistics: It’s the mathematical factor that mediates between what happens in the laboratory and what happens in the real world. The most common form of statistical analysis proceeds by a kind of backwards logic: Technically, the researcher is trying to disprove the “null hypothesis,” the assumption that the condition under investigation actually makes no difference. In Sanna’s experiment, the null hypothesis is that elevation has no effect on how much hot sauce people dole out. If that is actually what the data shows, then the experiment is over, the null hypothesis wins, and the researcher can forget about going on The Daily Show. But in practice things usually aren’t quite as clear-cut. Due to random fluctuation, there is bound to be some difference between groups of subjects exposed to different experimental conditions. Intuitively, a large difference is more significant than a small one, but how large is large enough? Roughly speaking, the p value measures the probability that nature has thrown you a curve, yielding by chance a meaningless result. (Technically, it measures the probability that, assuming the null hypothesis is true, an effect at least as large as the one you claim to have detected could have been the result of chance.)

By convention, the standard for statistical significance in psychology—enforced by the editors of professional journals in deciding whether to accept a paper—is a p value of less than five percent (p < 0.05). In other words, if you did 100 repetitions of the experiment, you should get a similar result in at least 95 of them. There is some mathematical justification for choosing this value, but it is, for the most part, arbitrary. (Many of the life sciences adhere to a similar standard. But some other disciplines, dealing with entities less variable than human beings or biological systems, would consider 0.05 a very weak standard of significance; the experiment that detected the Higgs boson, for example, claimed a p value of around one in three million.)

As Simmons showed, psychologists who deploy enough statistical sleight of hand can find “significance” in almost any data set. How often do researchers give in to this temptation? One way to roughly answer that question would be to study the distribution of p values over a large sample of papers. If researchers are fiddling with the math to get their results just under the 0.05 threshold, then you might expect to see a cluster of values just below 0.05, rather than the more normal distribution that might have arisen as a result of chance. In an analysis of a year’s worth of papers in three leading psychology journals, two researchers found “a peculiar prevalence of p values just below 0.05.”

How does p-hacking work? One common approach is called “data peeking,” a technique that involves taking advantage of the real-time chance fluctuations that nature throws your way as you’re conducting an experiment. For instance, rather than determining in advance how many subjects you will test, you might pause after every five or 10 or 20 to analyze your data up to that point, and you stop when you’ve gotten the results you want. Or maybe the effect you’re looking for shows up in women but not in men, so you only report women’s results. Or you correct for subjects’ height—maybe tall people are less affected by standing on a stage, since they’re used to looking down on everyone else anyway. Or maybe the hot sauce wasn’t hot enough, so you start over with a new batch and declare it a new experiment. In all cases, the old results go in a cabinet that gets emptied into a dumpster when you move offices, illustrating what scientists call “publication bias”—the selective reporting of positive reports—or, more colloquially, the “file-drawer effect.”

This is how Simmons, Simonsohn, and their co-author Leif Nelson were able to show—p < 0.05—that listening to “When I’m Sixty-Four” made a sample of undergraduates younger: To get the results they wanted, they divided their sample of 34 undergraduates into three groups and played for them either the Beatles track, an instrumental called “Kalimba,” or “Hot Potato,” a children’s song by The Wiggles. Thus they could compare results in four different ways: each song matched against one other or all three together. Then the scientists looked at all the answers they collected from their questionnaire. With hundreds of possible comparisons, it was, Simmons says, “highly likely” that they would find at least one pairing that showed just the sort of statistically significant correlation they were fishing for. This turned out to be the group that heard “When I’m Sixty-Four” versus the “Kalimba” group, “adjusted” for the ages of the subjects’ fathers.

The failures prompted Daniel Kahneman, a Nobel Prize-winning psychologist at Princeton, founding father of behavioral economics, and best-selling author of

Thinking, Fast and Slow

, to warn in an email to colleagues of an impending “train wreck” in social psychology. (Photo: Richard Saker/Contour by Getty Images)

Typically, scientists adjust their data to correct for random fluctuations among variables. Suppose, for instance, that across a whole sample of subjects, blondes are, by chance, older than brunettes on average; you “correct” for this by adding an appropriate factor to the ages of the brunettes, and then compare the groups on that basis. This statistical technique, when properly used, can make experimental results more reliable and fair. But improperly used, it can tip results toward a desired conclusion.

Simmons and his colleagues duly reported their adjustment in their mock paper—but they left out any mention of all the other factors that they had tried and discarded. This was all, Simmons emphasizes, within the bounds of what is considered fair play in most psychology departments and journals. Generally the people who do this kind of thing don’t think of themselves as unethical; they may even have faith in their results. As Simmons says: “If I want something to be true, my threshold for believing it is low.”

WHILE IT IS POSSIBLE to detect suspicious patterns in scientific data from a distance, the surest way to find out whether a study’s findings are sound is to do the study all over again. The idea that experiments should be replicable, producing the same results when run under the same conditions, was identified as a defining feature of science by Roger Bacon back in the 13th century. But the replication of previously published results has rarely been a high priority for scientists, who tend to regard it as grunt work. Journal editors yawn at replications. Honors and advancement in science go to those who publish new, startling results, not to those who confirm—or disconfirm—old ones.

As a result, a lot of ideas of possibly questionable reliability have found their way into the scientific canon, even without the malign influence of misconduct or p-hacking. When Daniel Kahneman issued his warning that a “train wreck” was imminent in social psychology, he was referring to signs of trouble in one of the hottest areas of research in recent years, on the influence of unconscious cues on behavior, known as “priming.” The ur-experiment in this area, published in 1996 by Yale psychologist John Bargh, involved giving subjects lists of words to rearrange, then surreptitiously measuring the speed at which they left the room. When the lists contained words, like “bingo” or “Florida,” that primed them to think about aging, they walked more slowly.

It was a phenomenon so simple, intuitive, and provocative that science editors found it irresistible—and suddenly everyone wanted to study it. Yet even as psychologists rushed to find all kinds of related effects (subjects who think about a college professor as they take a quiz perform better than those who imagine a soccer hooligan!), the specific experiment run by Bargh proved maddeningly difficult to replicate. Some attempts worked; others didn’t.

The late John Maddox, legendary editor of the journal Nature, when asked how much of what appeared under its unimpeachable rubric might be wrong, answered “all of it.” What distinguishes science from religion is that it is built from testable—and falsifiable—hypotheses.

Priming experiments, although superficially simple, are hard to do well. The effects are subtle. Subjects sometimes figure out what’s going on and correct for it. Differences in setting, or the selection of subjects, can confound results. Which in a sense just deepens the epistemological quagmire. “It’s only a scientific truth if it happens in the real world,” Simmons says. “Tell me what I have to do to replicate your findings, write me a recipe, and if you can’t do that—if you can only show this in Princeton undergraduates—then I don’t care anymore.”

This challenge is the idea behind the Reproducibility Project, a scientific collaboration directed by Brian Nosek, 41, a forceful and charismatic research psychologist at the University of Virginia. Nosek’s interest in replications partly began with what he describes as a beautiful experiment. He was testing whether political extremists—often described metaphorically as thinking in black and white—really do see the world that way. He recruited nearly 2,000 volunteers online, sorted them by political stance, and asked them to match two shades of gray. In what he calls a “stunning” result, political moderates were significantly better at this than those on the far right or left. The 60 Minutes segment practically writes itself. For a young researcher like Nosek, and his grad student, Matt Motyl, it was a coup.

Then they made what many would consider a fatal mistake. Even though they had no reason to doubt the result, Nosek and Motyl repeated the experiment—and nature, in its infinite capriciousness, took away what it had just given. “Our immediate reaction was ‘why the #&@! did we do a direct replication?’” they later wrote.

Nosek’s ordeal got him wondering: How would other researchers’ experiments stack up in a similar test? Thus was born the Reproducibility Project. The purpose of the effort is not to identify the flaws in other researchers’ work, Nosek emphasizes, but to arrive at a baseline estimate of how much of what we think we know about human psychology meets Bacon’s elementary test. In one sample of 13 experimental results chosen for replication, 10 were fully reproducible, one had ambiguous results, and two failed to replicate. Both of those were priming studies, one purporting to show that viewing the American flag made voters more inclined to support Republicans, the other that money-related words or symbols made one less sympathetic to the poor.

What happens when a finding isn’t reproducible? Nosek emphasizes that a failed replication doesn’t necessarily invalidate the original result or impugn the honesty of the researchers; it may ultimately refine their point. But if a replication effort—or, more likely, some other form of academic detective work—turns up a failure that is especially glaring, then the paper may be retracted, either by the researcher or the journal that published it: withdrawn, declared inoperative, and in theory purged from the scientific literature.

LEST YOU THINK THAT these problems of fraud, statistical analysis, and replication are merely endemic to the “soft sciences,” think again. Over the past few years, the skepticism surrounding high-profile psychological findings has bled over into other fields, raising awareness generally of scientific misconduct and error. The entire field of biomedical research, for instance, was shaken recently when researchers at the pharmaceutical firm Amgen reported that, in search of new drugs, they had selected 53 promising basic-research papers from leading medical journals and attempted to reproduce the original findings with the same experiments. They failed approximately nine times out of 10.

Simply keeping tabs on mistakes that have already been identified is an enormous undertaking in science. Until recently there was no central repository of scientific error, no systematic way of tracking retractions and keeping mistakes from contaminating future research. Since 2010, the Retraction Watch website has been cataloging and reviewing retractions as they are published across thousands of journals, undertaking the Sisyphean task of digging good science out from under the avalanche of error in papers that pour forth at the rate of more than one million a year. The site’s co-founders—Ivan Oransky, the editorial director of MedPage Today, and Adam Marcus, the editor of Anesthesiology News—praise prompt, full, and honest disclosures, and heap scorn on weaselly retractions that seem meant to protect researchers’ (and editors’) reputations.

Retraction Watch can make for disconcerting reading. It discloses a panoply of errors ranging from the obscure and technical (“Female C57BL/6 mice were at 4-6 weeks, not 6-8 weeks of age”), to enhanced and misleading images, to outright faking of data. Plagiarism is surprisingly common, considering how easy it now is to catch. Authors fake the names of collaborators to make their submissions more impressive. They fail to disclose conflicts of interest. They neglect to clear their experiments with their university’s Institutional Review Board, which rules on the ethics of experiments involving human or animal subjects. This, says Oransky, is sometimes a clue that no such experiment actually took place.

In highly specialized fields, journal editors sometimes ask authors to suggest their own peer reviewers. This, as Retraction Watch somewhat gleefully reported in 2012, left the door open for one enterprising author to review his own papers, using the names of real scientists—but corresponding with his editors via invented email addresses that he controlled. An article in a mathematics journal by one M. Sivasubramanian was retracted in 2011 for making “unsubstantiated claims regarding Euclid’s parallel postulate.” The paper is gibberish, but purports to overturn 2,500 years of mathematics in just two pages, including 13 footnotes referencing exclusively the previous work of, yes, M. Sivasubramanian. In their retraction, the editors called this “a severe abuse of the scientific publishing system”—one that they, of course, had helped perpetrate.

In defense of journals, Oransky notes that most of them rely heavily on the volunteer labor of editors and reviewers taking time away from their own research. “There are 1.4 million papers published worldwide” each year, he points out. “Who is going to police all these authors?” He has little sympathy to spare, however, for the perpetrators. “Until recently,” Oransky says, “we operated in the belief that most retractions were the result of inadvertent errors, not intentional misconduct.” He now considers this an instance of misplaced complacency. But just as remarkable is the apparent complacency of the perpetrators, who seem to have proceeded, in many cases, as if they were reasonably sure that no one was watching. Authors bent the rules in full awareness of what they were doing, and with reasonable confidence they would never be caught. But now that’s changing—thanks to Oransky, Simonsohn, and others.

AS A WAY TO hold scientists accountable, Retraction Watch is such an obvious idea it’s a wonder it took until 2010 for it to come into existence. Simmons, the lead author of the “When I’m Sixty-Four” paper, has also proposed a number of reforms designed to make research more transparent and straightforward. Experimenters should decide in advance and report in the paper how many subjects they plan to test, and stop at that point, rather than keep going in the hope that the data will improve. They should describe all the statistical manipulations they performed on the data, and what the results would have been without them. And they should disclose the rationale behind their sample sizes. (Simmons at one point recommended requiring a minimum sample size of 20, but has since walked back from that notion.)

If you can prove anything you want from your data, what, if anything, do you really know?

Sample size is a touchy topic in psychology, because undergraduate subjects often expect to be compensated, and researchers must pay them from grants that are overseen by the tight-fisted guardians of research funding. But because larger sample sizes increase the predictive power of results, Simmons now tries for at least 50 subjects in his own research. “The brutal truth is that reality is indifferent to your difficulty in finding enough subjects,” he says. “It’s like astronomy: To study things that are small and distant in the sky you need a huge telescope. If you only have access to a few subjects, you need to study bigger effects, and maybe that wouldn’t be such a bad thing.”

Many of the same recommendations have been endorsed by Nosek and his colleague Jeffrey Spies at the University of Virginia, under the rubric of the Center for Open Science, which is the parent organization for the Reproducibility Project. Another of their goals is to make raw data available online, even before the related paper is written and published. That would solve the file-drawer problem, since failed experiments would no longer be buried away, and if your results are later called into question, you couldn’t claim the raw data was lost in a hard-drive crash or inadvertently thrown away.

This amounts to a whole new approach to experimental social science, emphasizing cooperation over competition and privileging the slow accretion of understanding over dramatic, counterintuitive results. “It’s completely changed the way I do research,” Simmons says. “Whenever I see an effect, the first thing I do is see if I can get it again. And if I can’t, I’ll say ‘forget it.’ I’ve abandoned whole projects because of this.”

Of course, pretty much all scientific findings are tentative; the late John Maddox, legendary editor of the journal Nature, when asked how much of what appeared under its unimpeachable rubric might be wrong, answered “all of it.” What distinguishes science from religion, and from postmodern theory for that matter, is that it is built from testable—and falsifiable—hypotheses.

“Most of these studies will ultimately end up in the dustbin of history,” says Nina Strohminger, the Duke post-doc, “not because of misconduct, but because that’s how the scientific process works. The problem isn’t that many studies fail to replicate. It’s that we believe in them before they’ve been thoroughly vetted.” Strohminger’s own field of interest, related to priming, is on the effect of emotion on moral judgments. But she has also spent the last couple of years replicating experiments for the Reproducibility Project, out of a broader concern for the future of empirical research in psychology. “I want to spend my life doing this,” she says. “And I want people to believe and to trust it.”

The growing movement to police errors in psychology recently gained a noteworthy recruit. In 2011, a middle-age graduate student at the University of East London grew suspicious of an influential 2005 paper in the trendy subfield of positive psychology, written by Barbara Fredrickson and Marcial Losada. The paper, which had been cited some 350 times, purported to define the precise ratio of positive to negative emotions necessary for human flourishing (2.9013 to one, in case you were wondering). The graduate student, Nick Brown, discovered that the equations in the paper derived from the highly unrelated field of fluid dynamics, and were essentially meaningless as applied to psychology. But he needed the help of more eminent scientists to bring those flaws to the attention of the psychological establishment. In July 2013, Brown and two co-authors published a paper in American Psychologist, debunking the famous positivity ratio. Brown’s most outspoken champion, and co-author, was a physicist by the name of Alan Sokal.*

This post originally appeared in the May/June 2014 issue ofPacific Standardas “The Reformation.” For more, subscribe to our print magazine.

*UPDATE — April 29, 2014: We originally misspelled Barbara Fredrickson’s name.

The Reformation: Can Social Scientists Save Themselves?

Simmons and his colleagues duly reported the adjustment in their mock paper—but they left out any mention of all the other factors that they had tried and discarded. This was all, Simmons emphasizes, within the bounds of what is considered fair play in most psychology departments and journals.

The late John Maddox, legendary editor of the journal Nature, when asked how much of what appeared under its unimpeachable rubric might be wrong, answered “all of it.” What distinguishes science from religion is that it is built from testable—and falsifiable—hypotheses.

If you can prove anything you want from your data, what, if anything, do you really know?

Related Posts

Woman Boss May Lower Men’s Pay, Prestige

The New Bronze Age

Power, Pucks, and Parity: The Ongoing Discrimination of Female Collegiate Coaches