Dealing with missing data by random guessing

October 03, 2024

It's knowledge time!

Ouch! Knowledge hurts

As we all know, data is often incomplete. It's a stinking plague in the biological and social sciences. We need stinking statistics to deal with it. Quanta Magazine writes:

When Data Is Missing, Scientists Guess.
Then Guess Again.

Data is almost always incomplete. Patients drop out of clinical trials and survey respondents skip questions; schools fail to report scores, and governments ignore elements of their economies. When data goes missing, standard statistical tools, like taking averages, are no longer useful.

“We cannot calculate with missing data, just as we can’t divide by zero,” said Stef van Buuren(opens a new tab), the professor of statistical analysis of incomplete data at the University of Utrecht.

Suppose you are testing a new drug to reduce blood pressure. You measure the blood pressure of your study participants every week, but a few get impatient: Their blood pressure hasn’t improved much, so they stop showing up.

You could leave those patients out, keeping only the data of those who completed the study, a method known as complete case analysis. That may seem intuitive, even obvious. It’s also cheating. If you leave out the people who didn’t complete the study, you’re excluding the cases where your drug did the worst, making the treatment look better than it actually is. You’ve biased your results.

But in the 1970s, a statistician named Donald Rubin proposed a general technique, albeit one that strained the computing power of the day. His idea was essentially to make a bunch of guesses about what the missing data could be, and then to use those guesses. This method met with resistance at first, but over the past few decades, it has become the most common way to deal with missing data in everything from population studies to drug trials. Recent advances in machine learning might make it even more widespread.

Outside of statistics, to “impute” means to assign responsibility or blame. Statisticians instead assign data. If you forget to fill out your height on a questionnaire, for instance, they might assign you a plausible height, like the average height for your gender.

That kind of guess is known as single imputation. A statistical technique that dates back to 1930, single imputation works better than just ignoring missing data. By the 1960s, it was often statisticians’ method of choice. Rubin would change that.

Though single imputation avoided the bias of complete case analysis, Rubin saw that it had its own flaw: overconfidence. No matter how accurate a guess might seem, statisticians can never be completely sure it’s correct.

In 1971, a year after completing his doctorate, Rubin started working for the Educational Testing Service in Princeton, New Jersey. When a government agency asked ETS to analyze a survey with missing data, Rubin proposed an unconventional but surprisingly simple solution: Don’t just impute once. Impute multiple times.

Multiple imputation turned out to be both rigorous and versatile. While there are other methods that avoid the drawbacks of single imputation, multiple imputation is the most general: It works any time you might have otherwise tried to use single imputation.

Software for multiple imputation still struggles with the largest and most complicated data sets. But new multiple-imputation software that uses machine learning has been able to impute more complicated data. This, in turn, has introduced multiple imputation to fields like engineering, where ad hoc methods have been more common. That said, some researchers still worry about the mathematical rigor of these new techniques and are more hesitant to adopt them.

Whether scientists are testing a new drug or analyzing voting patterns, random guesses are helping them stay honest about what they know.

Vast amounts of data

By Germaine: He's got a handle on randomness

. . . . and watch out for statisticians too!

Search This Blog

Snowflake's Forum

Dealing with missing data by random guessing

Comments

Post a Comment

Popular posts from this blog

The MonoRacer 130E Fully Enclosed Motorcycle Aims to Redefine Personal Mobility

Invite list - authors on here please take note.

Is this a joke?