If you’re reading this, chances are you’ve heard about the Oregon Medicaid study recently published in the New England Journal of Medicine. In case you’ve been on vacation, however, the results were, at best, a disappointment to advocates of Medicaid: a large, multi-year randomized controlled trial failed to conclude that coverage through the program produced meaningful health benefits, relative to remaining uninsured. As landmark publications of policy interventions are pretty rare, and insofar as the ACA health reform is counting on a dramatic expansion of Medicaid to achieve its goal of universal coverage, the paper has predictably set off a spirited debate within the health policy community. Those on the right are pointing to the findings in calling for a halt to the ACA Medicaid expansion. Among mainstream analysts, we’ve seen a call for caution in interpreting these results (though similar restraint was not advocated by the same crowd when more positive interim results were published in 2011). As in the case of other studies published in areas of contentious debate (see also: screening mammography), having a high-quality trial testing the question of interest, and a p-value associated with that test, is not enough to clarify the matter. I’ve written at length about the Oregon study already, but I’ll be honest: I don’t much care about the study’s findings. For the simple reason that there’s no way a single study can offer a conclusive answer to a research question.
One reason for this is what you might call the random-effects problem. If you ran the Oregon study again, the influences of random error suggest you’d end up with another set of estimates; run it again in a different sample of people, and you’d get still another set of findings. If you replicated the study a bunch of times, eventually you’d start to see the results converge on the “true” effect. As such, findings from single studies are best thought of as single draws from a distribution of possible results. This, of course, implies that it’s really hard to conclude anything from empirical studies, and that you need an enormous amount of data to say anything with confidence, but … that’s actually completely the case. When you estimate a treatment effect from a trial, there’s no real way of knowing whether that estimate is close to the mean of that theoretical distribution of results, or whether it’s an outlier. (Note: if you’re familiar with the work of John Ioannidis, this won’t surprise you much.)
Many policy researchers, of course, don’t really care about this; they care less about whether the effect estimate is perfectly accurate than about whether it’s “significant”. The debate over whether or not Medicaid has downstream effects on health has had an either/or flavor, and we’ve been told over and over that the ACA’s Medicare pilots and comparative effectiveness research will tell us “what works” in medical care. In practice, concluding that something “works” from empirical research involves looking at the p-value and seeing whether its effect is “significant” (i.e., small enough to conclude that the observed effect is unlikely due to random influences). But does a “significant” test result really tell you that?
I know this runs counter to everything smart people tell you about statistics, but bear with me. The process of inferring “significance” starts with a null hypothesis H, which states that the groups don’t differ. If the null hypothesis is true, then the probability of observing no difference between the groups is high (with “equality between groups” E defined via reference to a statistical distribution suggesting the threshold at which an observed difference is unlikely to occur by chance). Then we do our study, and observe a difference between groups that’s larger than our “likely” threshold (i.e., not-E). The conclusion most researchers will draw is: “then H is probably not true”.
One obvious problem here is that H isn’t a random variable — hypotheses are either true or they aren’t – so discussing it in terms of probabilities is nonsense. But even if we toughen up the inference from our hypothesis test (i.e., “if we see not-E, we assume that H is false”), our conclusion still doesn’t really follow. In many ways, hypothesis testing is a sort of game that researchers agree to play, wherein observing a difference believed to be unlikely based on sampling error alone leads to the conclusion that the null is false. Yet proving a conjecture requires a lot more than this. The conjecture either needs to be impossible to dispute without contradicting yourself, or you need to be able to demonstrate it inductively by observing it over samples of homogeneous subjects. Clearly, the first approach rarely works in health services research, since few claims about services or policy are necessarily true. To prove a contention via the second approach, you’d need to go through all the subjects in each treatment group and show that the difference between groups is maintained throughout the sample. This is a first step – generalizing to different subject samples is one problem, as is the fact that people adapt to policy interventions over time (meaning the underlying relationships aren’t constant) – but the bigger point is that a conclusion drawn from a comparison of mean differences between groups doesn’t come close to either proving or refuting the hypothesis. Put simply, the information from a hypothesis test is not sufficient for proving the truth or falsity of H. And as such, we still don’t know whether or not Medicaid has any effect on health status.
When I write things like the above, I’m sometimes accused of being anti-research or anti-science. Nothing could be further from the truth. What I’m opposed to is the misuse of science. Had the Oregon study found significant effects on health status, I’d feel the same way about the trial. Insofar as physicians, patients and policymakers all carry biases into their work, and most people tend to generalize wildly from their own experiences, repeated observation over time can perform a valuable service in helping us to understand health care delivery in a more objective way. But this is very different from using limited sets of observations to leap to broad conclusions, and from asserting the truth or falsehood of theories without doing anything resembling a rigorous proof. If health services researchers want the responsibility of their work being used to guide medical practice, they really need to start stepping up their game.