When Analyzing Big Data Goes Wrong

Nick Emptage

By Nick Emptage

If you follow health policy these days, you may be finding yourself inundated with the hype over Big Data. Those of us with longer memories might caution that healthcare has leapt from fad to fad almost constantly over the past few decades, always with disappointing results. Yet it hardly seems that a day goes by without reading that massive, interlinked sets of patient data – medical claims, registries, EHRs, social networks, and mHealth apps – are the answer to one (or all) of modern health care’s troubles: high costs, uneven quality, unnecessary variation in care practices, and so forth. With the right data, we’re told, things like wasteful or sub-optimal care, rationing, medical errors, and poor outcomes can be a thing of the uncivilized past.

When reading about the utopian promise of Big Data, I can’t help thinking of a good friend of mine, an audio engineer who’s been nominated for multiple Emmy awards. I think of all his hard work, all the networking, all the time spent learning and honing his skills. And while a studio full of fancy Mac computers has helped him a lot, you’d hardly say they were essential; without all the hard work and expertise, Macs alone can’t make you a great engineer. I also think about things like the creation of Facebook. Mark Zuckerberg had a prescient idea of the way people would use the Internet to socialize, and he had the determination and focus to execute it properly. He also knew a programming language called PHP in which to develop the site, but obviously, knowing PHP was not the primary reason he was successful.

Here’s what Macs and PHP have in common with Big Data: all three of them are tools. They’re all useful in their way, but by themselves they aren’t very interesting. Importantly, they only help you achieve bigger goals if you use them correctly. The glaring flaw in the Big Data hype is that, all too often, the data are presented as a solution in and of themselves, rather than as a means to an end. And there doesn’t seem to be much thought going into the question of how to use them effectively. Coming from the world of health services research, this shouldn’t really be surprising. Whether you’re talking about the cottage industry of claims data research or the mining of survey data, health care researchers are notorious for conducting fishing expeditions into data sources that weren’t designed to study the questions they’re asking, and for over-interpreting the correlations they find.

The simple reality is that Big Data need to be used in a careful way if they’re to fulfill their promise of transforming medicine and not simply serve as a platform for researchers trying to pad their CVs. Like Macs or PHP, Big Data are most helpful if they’re used in a particular way. Specifically, you need to keep three things in mind before you start frantically running regressions:

1.      You need a clear idea of what you’re using the data for. One unavoidable feature of large datasets calls to mind a saying from a project manager I used to work for: “Data will tell you anything if you torture them long enough”. In other words, Big Data will show you pretty much any relationship between two variables that you care to look for, as long as you manipulate your model the right way. To produce valid analyses with these data, you need to start with a clearly defined research question, based on a sound, coherent theory, and an analytic model that addresses it. Otherwise, your ability to produce results that are anything other than worthless and misleading will basically come down to random dumb luck. In other words, you can’t fall back on the habit of letting the data do your thinking for you.

2.     You need to understand that Big Data have inherent flaws. The first of these is the risk of Type II error; remember, just because something is statistically significant and seems superficially plausible doesn’t mean it’s real. In a dataset so large that nearly every relationship you study appears to be valid, you’ll need to think hard and carefully about whether a correlation you’ve observed is genuine or spurious. You’ll need to understand the underlying clinical phenomenon you’re studying, and you’ll need to ask yourselves tough questions about what the data aren’t telling you. Otherwise you risk adding to the echo chamber of bad health services research. And we have quite enough of that already.

3.    The second big flaw of Big Data relates to the hype. These data are often discussed in terms of predicting behavior and outcomes, and clearly when you talk about implementing any change in policy your concern is with the future and not the past. Now (take a deep breath with me), unless I’m not up on the latest advances in the basic ontology of the universe, it’s fundamentally impossible to use historical data to predict the future. And no, “but this is lots and lots of data!” is not really an answer to this. Any analysis of statistical correlation makes the implicit assumption that the underlying relationship holds outside of the specific sample and the specific points in time in which the data were collected. Yet in any social science, where unmeasured factors are always changing and the subjects are capable of learning and adapting, this assumption is absurd.

It would overstate things to say that Big Data have no value. But the risks of taking them too far are real, and underappreciated. When used in the context of a sound research design, and interpreted with great caution, data on historical care delivery can teach us a great deal about what we do right and what we do wrong.  What they can’t be is a magic bullet. They can’t help us predict the future, and they can’t do our thinking for us. Most importantly, they can’t offer what payers and analysts seem to want most: a way out of the responsibility for making difficult decisions about providing and paying for care.