Menus Subscribe Search

Follow us

Genes Are Us


(Photo: bloomua/Shutterstock)

Why Statistically Significant Studies Aren’t Necessarily Significant

• June 06, 2014 • 9:52 AM

(Photo: bloomua/Shutterstock)

Modern statistics have made it easier than ever for us to fool ourselves.

Scientific results often defy common sense. Sometimes this is because science deals with phenomena that occur on scales we don’t experience directly, like evolution over billions of years or molecules that span billionths of meters. Even when it comes to things that happen on scales we’re familiar with, scientists often draw counter-intuitive conclusions from subtle patterns in the data. Because these patterns are not obvious, researchers rely on statistics to distinguish the signal from the noise. Without the aid of statistics, it would be difficult to convincingly show that smoking causes cancer, that drugged bees can still find their way home, that hurricanes with female names are deadlier than ones with male names, or that some people have a precognitive sense for porn.

OK, very few scientists accept the existence of precognition. But Cornell psychologist Daryl Bem’s widely reported porn precognition study illustrates the thorny relationship between science, statistics, and common sense. While many criticisms were leveled against Bem’s study, in the end it became clear that the study did not suffer from an obvious killer flaw. If it hadn’t dealt with the paranormal, it’s unlikely that Bem’s work would have drawn much criticism. As one psychologist put it after explaining how the study went wrong, “I think Bem’s actually been relatively careful. The thing to remember is that this type of fudging isn’t unusual; to the contrary, it’s rampant–everyone does it. And that’s because it’s very difficult, and often outright impossible, to avoid.”

We shouldn’t put much stock in one statistically significant precognition result that defies everything we know about the physical world. Studies with small, unrepresentative samples can be valuable, but we should treat them cautiously before they are replicated with other samples.

That you can lie with statistics is well known; what is less commonly noted is how much scientists still struggle to define proper statistical procedures for handling the noisy data we collect in the real world. In an exchange published last month in the Proceedings of the National Academy of Sciences, statisticians argued over how to address the problem of false positive results, statistically significant findings that on further investigation don’t hold up. Non-reproducible results in science are a growing concern; so do researchers need to change their approach to statistics?

Valen Johnson, at Texas A&M University, argued that the commonly used threshold for statistical significance isn’t as stringent as scientists think it is, and therefore researchers should adopt a tighter threshold to better filter out spurious results. In reply, statisticians Andrew Gelman and Christian Robert argued that tighter thresholds won’t solve the problem; they simply “dodge the essential nature of any such rule, which is that it expresses a tradeoff between the risks of publishing misleading results and of important results being left unpublished.” The acceptable level of statistical significance should vary with the nature of the study. Another team of statisticians raised a similar point, arguing that a more stringent significance threshold would exacerbate the worrying publishing bias against negative results. Ultimately, good statistical decision making “depends on the magnitude of effects, the plausibility of scientific explanations of the mechanism, and the reproducibility of the findings by others.”

However, arguments over statistics usually occur because it is not always obvious how to make good statistical decisions. Some bad decisions are clear. As xkcd’s Randall Munroe illustrated in his comic on the spurious link between green jelly beans and acne, most people understand that if you keep testing slightly different versions of a hypothesis on the same set of data, sooner or later you’re likely to get a statistically significant result just by chance. This kind of statistical malpractice is called fishing or p-hacking, and most scientists know how to avoid it.

But there are more subtle forms of the problem that pervade the scientific literature. In an unpublished paper (PDF), statisticians Andrew Gelman, at Columbia University, and Eric Loken, at Penn State, argue that researchers who deliberately avoid p-hacking still unknowingly engage in a similar practice. The problem is that one scientific hypothesis can be translated into many different statistical hypotheses, with many chances for a spuriously significant result. After looking at their data, researchers decide which statistical hypothesis to test, but that decision is skewed by the data itself.

To see how this might happen, imagine a study designed to test the idea that green jellybeans cause acne. There are many ways the results could come out statistically significant in favor of the researchers’ hypothesis. Green jellybeans could cause acne in men, but not in women, or in women but not men. The results may be statistically significant if the jellybeans you call “green” include Lemon Lime, Kiwi, and Margarita but not Sour Apple. Gelman and Loken write that “researchers can perform a reasonable analysis given their assumptions and their data, but had the data turned out differently, they could have done other analyses that were just as reasonable in those circumstances.” In the end, the researchers may explicitly test only one or a few statistical hypotheses, but their decision-making process has already biased them toward the hypotheses most likely to be supported by their data. The result is “a sort of machine for producing and publicizing random patterns.”

Gelman and Loken are not alone in their concern. Last year Daniele Fanelli, at the University of Edingburgh, and John Ioannidis, at Stanford University, reported that many U.S. studies, particularly in the social sciences, may overestimate the effect sizes of their results. “All scientists have to make choices throughout a research project, from formulating the question to submitting results for publication.” These choices can be swayed “consciously or unconsciously, by scientists’ own beliefs, expectations, and wishes, and the most basic scientific desire is that of producing an important research finding.”

What is the solution? Part of the answer is to not let measures of statistical significance override our common sense—not our naïve common sense, but our scientifically-informed common sense. We shouldn’t put much stock in one statistically significant precognition result that defies everything we know about the physical world. Studies with small, unrepresentative samples can be valuable, but we should treat them cautiously before they are replicated with other samples. As Gelman and Loken put it, without modern statistics most people would not believe a remarkable claim about general human behavior “based on two survey questions asked to 100 volunteers on the internet and 24 college students. But with the p-value, a result can be declared significant and deemed worth publishing in a leading journal in psychology.”

Michael White
Michael White is a systems biologist at the Department of Genetics and the Center for Genome Sciences and Systems Biology at the Washington University School of Medicine in St. Louis, where he studies how DNA encodes information for gene regulation. He co-founded the online science pub The Finch and Pea. Follow him on Twitter @genologos.

More From Michael White

A weekly roundup of the best of Pacific Standard and, delivered straight to your inbox.

Recent Posts

October 31 • 4:00 PM

Should the Victims of the War on Drugs Receive Reparations?

A drug war Truth and Reconciliation Commission along the lines of post-apartheid South Africa is a radical idea proposed by the Green Party. asks their candidates for New York State’s gubernatorial election to tell us more.

October 31 • 2:00 PM

India’s Struggle to Get Reliable Power to Hundreds of Millions of People

India’s new Prime Minister Narendra Modi is known as a “big thinker” when it comes to energy. But in his country’s case, could thinking big be a huge mistake?

October 31 • 12:00 PM

In the Picture: SNAP Food Benefits, Birthday Cake, and Walmart

In every issue, we fix our gaze on an everyday photograph and chase down facts about details in the frame.

October 31 • 10:15 AM

Levels of Depression Could Be Evaluated Through Measurements of Acoustic Speech

Engineers find tell-tale signs in speech patterns of the depressed.

October 31 • 8:00 AM

Who Wants a Cute Congressman?

You probably do—even if you won’t admit it. In politics, looks aren’t everything, but they’re definitely something.

October 31 • 7:00 AM

Why Scientists Make Promises They Can’t Keep

A research proposal that is totally upfront about the uncertainty of the scientific process and its potential benefits might never pass governmental muster.

October 31 • 6:12 AM

The Psychology of a Horror Movie Fan

Scientists have tried to figure out the appeal of axe murderers and creepy dolls, but it mostly remains a spooky mystery.

October 31 • 4:00 AM

The Power of Third Person Plural on Support for Public Policies

Researchers find citizens react differently to policy proposals when they’re framed as impacting “people,” as opposed to “you.”

October 30 • 4:00 PM

I Should Have Told My High School Students About My Struggle With Drinking

As a teacher, my students confided in me about many harrowing aspects of their lives. I never crossed the line and shared my biggest problem with them—but now I wish I had.

October 30 • 2:00 PM

How Dark Money Got a Mining Company Everything It Wanted

An accidentally released court filing reveals how one company secretly gave money to a non-profit that helped get favorable mining legislation passed.

October 30 • 12:00 PM

The Halloween Industrial Complex

The scariest thing about Halloween might be just how seriously we take it. For this week’s holiday, Americans of all ages will spend more than $5 billion on disposable costumes and bite-size candy.

October 30 • 10:00 AM

Sky’s the Limit: The Case for Selling Air Rights

Lower taxes and debt, increased revenue for the city, and a much better use of space in already dense environments: Selling air rights and encouraging upward growth seem like no-brainers, but NIMBY resistance and philosophical barriers remain.

October 30 • 9:00 AM

Cycles of Fear and Bias in the Criminal Justice System

Exploring the psychological roots of racial disparity in U.S. prisons.

October 30 • 8:00 AM

How Do You Make a Living, Email Newsletter Writer?

Noah Davis talks to Wait But Why writer Tim Urban about the newsletter concept, the research process, and escaping “money-flushing toilet” status.

October 30 • 6:00 AM

Dreamers of the Carbon-Free Dream

Can California go full-renewable?

October 30 • 5:08 AM

We’re Not So Great at Rejecting Each Other

And it’s probably something we should work on.

October 30 • 4:00 AM

He’s Definitely a Liberal—Just Check Out His Brain Scan

New research finds political ideology can be easily determined by examining how one’s brain reacts to disgusting images.

October 29 • 4:00 PM

Should We Prosecute Climate Change Protesters Who Break the Law?

A conversation with Bristol County, Massachusetts, District Attorney Sam Sutter, who dropped steep charges against two climate change protesters.

October 29 • 2:23 PM

Innovation Geography: The Beginning of the End for Silicon Valley

Will a lack of affordable housing hinder the growth of creative start-ups?

October 29 • 2:00 PM

Trapped in the Tobacco Debt Trap

A refinance of Niagara County, New York’s tobacco bonds was good news—but for investors, not taxpayers.

October 29 • 12:00 PM

Purity and Self-Mutilation in Thailand

During the nine-day Phuket Vegetarian Festival, a group of chosen ones known as the mah song torture themselves in order to redirect bad luck and misfortune away from their communities and ensure a year of prosperity.

October 29 • 10:00 AM

Can Proposition 47 Solve California’s Problem With Mass Incarceration?

Reducing penalties for low-level felonies could be the next step in rolling back draconian sentencing laws and addressing the criminal justice system’s long legacy of racism.

October 29 • 9:00 AM

Chronic Fatigue Syndrome and the Brain

Neuroscientists find less—but potentially stronger—white matter in the brains of patients with CFS.

October 29 • 8:00 AM

America’s Bathrooms Are a Total Failure

No matter which American bathroom is crowned in this year’s America’s Best Restroom contest, it will still have a host of terrible flaws.

Follow us

Levels of Depression Could Be Evaluated Through Measurements of Acoustic Speech

Engineers find tell-tale signs in speech patterns of the depressed.

We’re Not So Great at Rejecting Each Other

And it's probably something we should work on.

Chronic Fatigue Syndrome and the Brain

Neuroscientists find less—but potentially stronger—white matter in the brains of patients with CFS.

Incumbents, Pray for Rain

Come next Tuesday, rain could push voters toward safer, more predictable candidates.

Could Economics Benefit From Computer Science Thinking?

Computational complexity could offer new insight into old ideas in biology and, yes, even the dismal science.

The Big One

One town, Champlain, New York, was the source of nearly half the scams targeting small businesses in the United States last year. November/December 2014

Copyright © 2014 by Pacific Standard and The Miller-McCune Center for Research, Media, and Public Policy. All Rights Reserved.