Is There a Statistically Significant/real Difference Between the Means? How Do You Know This?

For also long, many scientists' careers take been built effectually the pursuit of a single statistic: p<.05.

In many scientific disciplines, that's the threshold beyond which study results can be alleged "statistically pregnant," which is frequently interpreted to mean that information technology'southward unlikely the results were a fluke, a result of random chance.

Though this isn't what it actually means in practice. "Statistical significance" is as well often misunderstood — and misused. That's why a trio of scientists writing in Nature this week are calling "for the entire concept of statistical significance to exist abased."

Their biggest statement: "Statistically significant" or "not statistically meaning" is too frequently easily misinterpreted to mean either "the written report worked" or "the study did not work." A "true" effect can sometimes yield a p-value of greater than .05. And we know from contempo years that science is rife with false-positive studies that achieved values of less than .05 (read my explainer on the replication crunch in social scientific discipline for more).

The Nature commentary authors debate that the math is non the problem. Instead, information technology'south human psychology. Bucketing results into "statistically meaning" and "statistically non-significant," they write, leads to a too black-and-white arroyo to scrutinizing science.

More than than 800 other scientists and statisticians beyond the globe take signed on to this manifesto. For now, it seems more similar a provocative argument than the start of a real ocean alter. "Nature," for one, "is not seeking to change how information technology considers statistical analysis in evaluation of papers at this fourth dimension," the journal noted.

Simply the tides may be rising against "statistical significance." This isn't the first time scientists and statisticians have challenged the condition quo. In 2016, I wrote about how a large group of them called for raising the threshold to .005, making it much harder to call a upshot "statistically significant." (Concurrently, with the Nature commentary, the journal The American Statistician devoted an unabridged issue to the problem of "statistical significance.") There's a broad recognition that p-values can be problematic.

I suspect this proposal volition be heavily debated (as is everything in science). At to the lowest degree this latest call for radical change does highlight an of import fact plaguing science: Statistical significance is widely misunderstood. Let me walk you through information technology. I call back it will assist you sympathize this debate better, and help you lot run into that there are a lot more ways to judge the claim of a scientific finding than p-values.

Wait, what is a p-value? What'south statistical significance?

Mick Wiggins/Getty Creative Images

Even the simplest definitions of p-values tend to become complicated, so bear with me equally I suspension it downwardly.

When researchers calculate a p-value, they're putting to the exam what's known every bit the null hypothesis. First thing to know: This is not a test of the question the experimenter most badly wants to answer.

Let's say the experimenter really wants to know if eating one bar of chocolate a 24-hour interval leads to weight loss. To test that, they assign 50 participants to eat 1 bar of chocolate a mean solar day. Another 50 are allowable to abstain from the delicious stuff. Both groups are weighed before the experiment then after, and their average weight alter is compared.

The aught hypothesis is the devil's abet argument. Information technology states there is no difference in the weight loss of the chocolate eaters versus the chocolate abstainers.

Rejecting the null is a major hurdle scientists need to clear to show their hypothesis. If the null stands, it means they haven't eliminated a major culling explanation for their results. And what is science if not a process of narrowing downward explanations?

Then how practise they rule out the goose egg? They calculate some statistics.

The researcher basically asks: How ridiculous would it be to believe the null hypothesis is the true answer, given the results we're seeing?

Rejecting the zippo is kind of like the "innocent until proven guilty" principle in court cases, Regina Nuzzo, a mathematics professor at Gallaudet Academy, explained. In court, you start off with the assumption that the defendant is innocent. So you start looking at the evidence: the encarmine pocketknife with his fingerprints on it, his history of violence, eyewitness accounts. As the evidence mounts, that presumption of innocence starts to look naive. At a certain point, jurors get the feeling, beyond a reasonable dubiousness, that the defendant is not innocent.

Goose egg hypothesis testing follows a similar logic: If there are huge and consequent weight differences between the chocolate eaters and chocolate abstainers, the null hypothesis — that there are no weight differences — starts to expect dizzy and y'all can decline information technology.

You might exist thinking: Isn't this a pretty roundabout manner to evidence an experiment worked?

You lot are correct!

Rejecting the nix hypothesis is indirect evidence of an experimental hypothesis. It says nothing about whether your scientific determination is right.

Sure, the chocolate eaters may lose some weight. But is it considering of the chocolate? Mayhap. Or maybe they felt extra guilty eating candy every twenty-four hours, and they knew they were going to be weighed by strangers wearing lab coats (weird!), so they skimped on other meals.

Rejecting the null doesn't tell you anything most the machinery past which chocolate causes weight loss. It doesn't tell you if the experiment is well designed, or well controlled for, or if the results have been scarlet-picked.

Information technology merely helps yous understand how rare the results are.

But — and this is a catchy, tricky point — information technology's not how rare the results of your experiment are. It's how rare the results would be in the world where the nothing hypothesis is true. That is, it's how rare the results would be if nothing in your experiment worked and the difference in weight was due to random chance alone.

Here's where the p-value comes in: The p-value quantifies this rareness. It tells you how frequently y'all'd meet the numerical results of an experiment — or even more extreme results — if the null hypothesis is truthful and there's no departure between the groups.

If the p-value is very small, it ways the numbers would rarely (but not never!) occur by chance lonely. So when the p is minor, researchers start to call up the null hypothesis looks improbable. And they take a leap to conclude "their [experimental] data are pretty unlikely to be due to random adventure," Nuzzo explains.

Here'southward another tricky betoken: Researchers tin can never completely rule out the nix (but like jurors are not firsthand witnesses to a crime). And then scientists instead pick a threshold where they experience pretty confident that they can reject the zip. For many disciplines, that's at present set at less than .05.

Ideally, a p of .05 ways if you ran the experiment 100 times — over again, assuming the null hypothesis is true — y'all'd see these same numbers (or more extreme results) five times.

And 1 final, super-thorny concept that almost anybody gets wrong: A p<.05 does non hateful there'southward less than a 5 percent chance your experimental results are due to random take a chance. It does not mean there's just a five per centum gamble you lot've landed on a faux positive. Nope. Not at all.

Again: A p-value of less than .05 means that at that place is less than a 5 per centum chance of seeing these results (or more than farthermost results), in the world where the null hypothesis is truthful. This sounds nitpicky, just it'due south critical. It'due south the misunderstanding that leads people to be unduly confident in p-values. The false-positive charge per unit for experiments at p=.05 can exist much higher than 5 percent.

Allow's repeat it: P-values don't necessarily tell yous if an experiment "worked" or not

Psychology PhD student Kristoffer Magnusson has designed a pretty cool interactive calculator that estimates the probability of obtaining a range of p-values for whatever given truthful difference between groups. I used it to create the post-obit scenario.

Let'due south say there'due south a written report where the bodily difference between two groups is equal to one-half a standard deviation. (Yes, this is a nerdy way of putting it. Merely call up of it like this: It means 69 pct of those in the experimental group show results college than the hateful of the control group. Researchers call this a "medium-size" effect.) And let's say there are l people each in the experimental group and the control group.

In this scenario, you should merely be able to obtain a p-value between .03 and .05 effectually seven.62 percent of the fourth dimension.

If you ran this experiment over and over and once more, y'all'd really await to see a lot more p-values with a much lower number. That'southward what the following chart shows. The x-axis is the specific p-values, and the y-axis is the frequency you'd detect them repeating this experiment. Look how many p-values y'all'd find below .001.

This is why many scientists become wary when they see besides many results cluster around .05. It shouldn't happen that often and raises cherry-red flags that the results take been cherry-picked, or, in scientific discipline-speak, "p-hacked." In science, it can be much too piece of cake to game and tweak statistics to achieve significance.

And from this chart, y'all'll run into: Aye, you lot tin obtain a p-value of greater than .05 when an experimental hypothesis is true. Information technology just shouldn't happen as often. In this case, around 9.84 pct of all p-values should autumn between .05 and .1.

At that place are meliorate, more than nuanced approaches to evaluating science

Many scientists recognize in that location are more robust ways to evaluate a scientific finding. And they already engage in them. But they, somehow, don't currently hold as much power as "statistical significance." They are:

  • Concentrating on outcome sizes (how big of a divergence does an intervention brand, and is it practically meaningful?)
  • Confidence intervals (what's the range of dubiety built into any given answer?)
  • Whether a event is novel written report or a replication (put some more weight into a theory many labs accept looked into)
  • Whether a study'due south design was preregistered (so that authors tin can't dispense their results post-exam), and that the underlying information is freely accessible (and so anyone can bank check the math)
  • There are likewise culling statistical techniques — similar Bayesian analysis — that in some ways more directly evaluate a study'due south results. (P-values inquire the question "how rare are my results?" Bayes factors ask the question "what is the probability my hypothesis is the best caption for the results we found?" Both approaches have trade-offs. )

The real problem isn't with statistical significance; it's with the civilization of science

The authors of the latest Nature commentary aren't calling for the end of p-values. They'd still like scientists to report them where appropriate, but non necessarily label them "significant" or not.

There's likely to exist argument around this strategy. Some might recollect information technology's useful to accept simple rules of thumb, or thresholds, to evaluate science. And we still demand to have phrases in our language to describe scientific results. Erasing "statistical significance" might just confuse things.

In whatsoever example, changing the definition of statistical significance, or nixing it entirely, doesn't address the real problem. And the real trouble is the culture of science.

In 2016, Vox sent out a survey to more than 200 scientists asking, "If you could change ane matter near how science works today, what would it be and why?" One of the clear themes in the responses: The institutions of science need to get better at rewarding failure.

One young scientist told us, "I feel torn between asking questions that I know will atomic number 82 to statistical significance and asking questions that matter."

The biggest problem in science isn't statistical significance; it'south the culture. She felt torn considering immature scientists need publications to get jobs. Under the condition quo, in society to get publications, y'all demand statistically pregnant results. Statistical significance solitary didn't lead to the replication crisis. The institutions of science incentivized the behaviors that allowed information technology to fester.

georgrith2001.blogspot.com

Source: https://www.vox.com/latest-news/2019/3/22/18275913/statistical-significance-p-values-explained

0 Response to "Is There a Statistically Significant/real Difference Between the Means? How Do You Know This?"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel