Below is the first part of an article in the NYT that looks at why we use tests of statistical significance in scientific research. The issue is a relevant one for medical research as every medical journal article one looks at does use and report tests of statistical significance for the findings described in it.

The discussion below is quite right in reporting that many statisticians think we are too lenient in what we accept as statistically significant. I think that too. Accepting a 5% chance that a correlation may be due to chance alone seems risky in medical research. Medical research should surely observe high standards of proof before conclusions are drawn.

What the article overlooks, however, is that statistically aware researchers have never been much impressed by the 5% standard. It is quite common for results significant by that standard to be seen as preliminary or tentative. For researchers to report their results with any confidence about the reality of what they report, a 1% standard has long been the informal criterion. And to the extent to which that is so, the criticisms below rather fall by the wayside: There is already a stronger standard in informal use.

I taught statistical analysis in a major Australian university for a number of years so have had some opportunity to reflect on why we use tests of statistical significance. And I think the major point that is rather overlooked below is that an effect can be significant in a statistical sense but not in any other sense. All a significance test does is exclude randomness. But even a tiny correlation can be shown as non-random if the sample size is large enough. But a tiny correlation may be of no practical importance or use at all.

So the function of significance testing is simply to act as a filter. Such a test enables us to say of some correlation: “This correlation is NOT EVEN statistically significant”. And in that case its significance in any other sense is unlikely to be worth bothering about. It is, in other words, just a very preliminary filter which we use to help sort out which correlations may be worthy of our attention

In recent weeks, editors at a respected psychology journal have been taking heat from fellow scientists for deciding to accept a research report that claims to show the existence of extrasensory perception.

The report, to be published this year in The Journal of Personality and Social Psychology, is not likely to change many minds. And the scientific critiques of the research methods and data analysis of its author, Daryl J. Bem (and the peer reviewers who urged that his paper be accepted), are not winning over many hearts.

Yet the episode has inflamed one of the longest-running debates in science. For decades, some statisticians have argued that the standard technique used to analyze data in much of social science and medicine overstates many study findings — often by a lot. As a result, these experts say, the literature is littered with positive findings that do not pan out: “effective” therapies that are no better than a placebo; slight biases that do not affect behavior; brain-imaging correlations that are meaningless.

By incorporating statistical techniques that are now widely used in other sciences — genetics, economic modeling, even wildlife monitoring — social scientists can correct for such problems, saving themselves (and, ahem, science reporters) time, effort and embarrassment.

“I was delighted that this ESP paper was accepted in a mainstream science journal, because it brought this whole subject up again,” said James Berger, a statistician at Duke University. “I was on a mini-crusade about this 20 years ago and realized that I could devote my entire life to it and never make a dent in the problem.”

The statistical approach that has dominated the social sciences for almost a century is called significance testing. The idea is straightforward. A finding from any well-designed study — say, a correlation between a personality trait and the risk of depression — is considered “significant” if its probability of occurring by chance is less than 5 percent.

This arbitrary cutoff makes sense when the effect being studied is a large one — for example, when measuring the so-called Stroop effect. This effect predicts that naming the color of a word is faster and more accurate when the word and color match (“red” in red letters) than when they do not (“red” in blue letters), and is very strong in almost everyone.

“But if the true effect of what you are measuring is small,” said Andrew Gelman, a professor of statistics and political science at Columbia University, “then by necessity anything you discover is going to be an overestimate” of that effect.

More here

Posted by John J. Ray (M.A.; Ph.D.). For a daily critique of Leftist activities, see DISSECTING LEFTISM. To keep up with attacks on free speech see TONGUE-TIED. Also, don’t forget your daily roundup of pro-environment but anti-Greenie news and commentary at GREENIE WATCH . Email me here

Be Sociable, Share!