"Research malpractice" or best practice? Debating "rigor" in criminal justice reform
Arnold Ventures' criminal justice portfolio lead accused Vera Institute researchers of, well, screwing it up bigtime. I take a look under the hood.
Hey folks! I’ve got a new piece up at the Scatterplot blog which digs into a juicy showdown between Arnold Ventures and the Vera Institute, two big players in criminal justice reform. You could think of it like a follow up to my piece published in Inquest last year which digs into the politics of causal inference methodology. This new piece is wonky, but I did my best to make it accessible – please give it a read and let me know what you think!
If you get through that piece and think to yourself, “I wish it was three times longer and didn’t have an editor” – well, the rest of this post is for you. This is less about the political stakes and more about the technical debate. So if you don’t care about the nitty gritty, just read the Scatterplot piece and tap out. But if you do… read the Scatterplot piece, then come back here and finish the job.
Opiates for the matches
The best use of matching for causal inference is to make quasi-randomly assigned treatment and control groups more similar across observed characteristics. The below figure shows what matching does – if the treatment and control groups differ on some observed trait, matching retroactively mitigates that problem. By the time researchers are considering matching, they’ve already identified some kind of problem with covariate balance. With genuinely random treatment assignment, we likely wouldn’t see these kinds of differences in the first place. Another way to think about matching is that it artificially constructs treatment and control groups from a sample. Ideally, though, you already have some other treatment assignment mechanism, and matching is just there to tweak the balance of the resultant groups.
One reason why some researchers are so dogmatic about randomization (“randomistas”) is that it also gets at unobserved characteristics. The above figure isn’t from the Vera study, but imagine that there are a few more rows on that coefficient plot saying “motivation” and “focus,” and there’s just no blue or red dots for those rows at all, because the relevant characteristics aren’t measured. With randomization, most expert readers are happy to assume that no matter what the unmeasured characteristics are, they’re balanced (or close to it) between treatment and control groups. But that requires an experiment. Sort of.
Ethics of randomization
Doleac took issue with Vera researchers’ assertion that conducting an experimental study would have been unethical. In a randomized controlled trial (RCT), researchers randomly assign participants to treatment and control groups, which allows them to sidestep selection bias problems. RCTs are often called the “gold standard” of causal inference methods, although this slogan and accompanying status may have more to do with the professional interests of development economists.
Vera argues that RCTs would have been the wrong approach to answer this research question. I’m still a little puzzled about why RCTs loom so large in their statement, but I think this is a response to Doleac saying (paraphrase): we don’t have much quality evidence about prison education, but there is an RCT in progress which will be informative. Vera researchers argue that attempting to run an experiment to answer this research question would have deprived otherwise willing participants of the opportunity to earn an education, as RCTs are costly, small-scale, and slow to implement. This is a concern about “equipoise,” a principle which forbids ethical researchers from knowingly assigning human subjects to conditions which are worse than the alternative. Vera researchers are not the first to raise this issue; for example, advocates rang alarm bells in 2021 when researchers randomized certain components of local pretrial detention and legal aid processes. The idea there was – we probably know that giving defendants legal advice is going to improve outcomes, so why deprive the control group of that obvious material benefit for a small improvement to the rigor of the study?
Random vs. “as-if random”
The thing is, Doleac was also dinging Vera for not using a “natural experiment” approach. The set of methods she’s referring to are called instrumental variables (IV), difference-in-differences (DiD), and regression discontinuity (RD). These have more analytical trade-offs than RCTs / they’re more complicated to set up and interpret, but they don’t have any of the same ethics problems. Quoting myself in Inquest:
Enter “natural experiments,” where sources of randomness or sudden change in the social world are exploited by statistical models to isolate the process of interest. These approaches can be more complicated than RCTs. For example, we might believe that the “treatment” is not randomly assigned in the real world; this was the case in my recent study of traffic stops and voter turnout, where Black drivers were stopped at higher rates. We deal with this kind of concern by trying to drill down into narrower groups of people in the data where we can credibly claim an approximation of random treatment assignment.
Vera didn’t touch this argument, which I think was a missed opportunity. A lot of economists sounded off in Doleac’s replies suggesting ways that the researchers could have addressed selection bias without incurring ethical problems: randomizing the timing of participation, comparing outcomes around cutoff points, etc. I got into it a little bit with Doleac’s friend and coauthor Anna Harvey, who was frustrated with Vera for presenting a binary between matching and RCTs. I said: well, by “RCTs,” Vera was probably referring to “RCTs+DiD+IV+RD.” Harvey’s reply? “Disagree.” A day later, Arnold released an RFP specifying it was only interested in proposals using precisely these methods. So I think I got it right in that case. But it was fair for Harvey to respond to the text of the Vera statement, which was quite narrowly focused on RCT ethics.
Brainstorming research designs gets us into speculative territory – I have no idea how many of these ideas the researchers tried. Maybe they genuinely didn’t try enough clever identification strategies, I don’t know. On the other hand, it is possible to make credible causal claims in quantitative research without using any of these designs. Under certain assumptions, regression can be a credible identification strategy – and matching is basically just linear regression with a pared-down sample and some extra steps. In one project I’m collaborating on, we tried using a natural experiment approach and ultimately didn’t think that the model made much sense. So we ended up just presenting a simpler model, because we think that based on our substantive knowledge and model specification, the amount of unobserved confounding is pretty low. (It helps that we observe a lot of stuff that readers would normally point out as a source of selection bias.) But we also make an extensive argument for why that’s the case. Doleac probably would say that our findings for that analysis are also worthless; we disagree, and other researchers probably disagree, too.
The point is, the assumptions are the load-bearing component of causal inference. It's not that matching is NEVER viable. It's just that if you took 100 studies using matching to make causal claims and compared them to 100 studies using difference-in-differences, I’d bet you $5 that the number of credible studies in the second group is larger – primarily because difference-in-differences is set up specifically to deal with selection bias from time-invariant confounders. You can actually combine DiD with matching, too – a study I coauthored combines DiD, matching, and a discontinuity in time – so whenever I see a model with only matching, I think, “Hmm, you must be confident about time-invariant confounders!” In English, this refers to characteristics of studied individuals which do not change over time. We can safely ignore them under certain assumptions underlying certain designs. Random treatment assignment is generally peak confidence on this front, and IV/DiD/RD generally try to emulate this. (IV is when treatment assignment is plausibly semi-random for certain observations thanks to some external source of variation, and RD is when time performs the lottery function.) We say: “as-if random.” I’m just more skeptical that a design with only matching gets us there.
This brings us back to the Vera prison education study. One counterargument against Doleac that was raised in my conversations runs like this (paraphrase): Actually, participation was limited less by motivation and more by the approval process. If the groups did not drastically differ according to motivation (i.e., most of the control group would have liked to participate but weren’t able to for some arbitrary bureaucratic reason), Doleac’s objection would be irrelevant. I mean, it’s true that this would be closer to a lottery, but if this were the case, it kind of sounds like running an experiment wouldn’t have changed much from the status quo, ethically speaking. That just brings us to the argument (posed by Doleac’s peers) that randomly assigning participation or non-participation could potentially be more fair than whatever bureaucratic red tape prevented participation in the iteration of the study that actually happened.
There's kind of an undercurrent in this debate regarding substantive knowledge. The way that we would adjudicate between arguments over the extent of confounding is substantive knowledge about the program and the context in which it's implemented. Part of the issue is that Doleac completely ignores anything not written by economists. But on the other hand, I think Nick Turner could have defended against this kind of objection by actually bringing some of this substantive knowledge to bear on the question instead of asserting it's a non-issue. Doleac’s specific argument about PSM is reasonable and shared among many non-economists, including my own colleagues. I think no matter what Doleac would have gone nuclear here, since for her, if it’s not RCT/DiD/RD/IV, it’s shit. But there are a lot of folks like myself who aren’t as dogmatic as Doleac but still don’t look kindly upon the use of PSM in the presence of unobserved confounding, who may have been convinced by an argument drawing on substantive knowledge.
tl;dr: We need to know about the assumptions that enable causal identification, we want to poke at each of those assumptions until we're confident they're credible, and the way that we gauge this credibility is by drawing on substantive knowledge about the studied case (plus some math stuff regarding model specification).
Rising tides
Another point I want to raise about all of this is that some economists probably have too much confidence in quasi-experimental research designs. (This is actually a major fault line within economics.) In a working paper that I don’t want to publicly distribute in full just yet, a colleague and I argue that these ways of knowing sometimes seem like a house of cards on the verge of collapse. The paradigm that Doleac is working in is about two decades old, and the goalposts are shifting so fast that it’s hard to keep up. Quoting from our paper:
Consider, for example, a recent article arguing that the F-statistic threshold assessing the strength of instruments in single-instrument instrumental variables (IV) analyses should exceed 104.7 (Lee et al. 2022)... applying their newly developed adjustment to single-instrument IV analyses recently published by the American Economic Review “would cause about one-fourth of the specifications to be statistically insignificant at the 5 percent level.” Similarly, a survey of causal analyses in top political science journals using the regression discontinuity (RD) design finds that over-reliance on the method “cause[s] concern that many published findings using the RD design are exaggerated, if not entirely spurious” (Stommes et al. 2023).
In other words, a design that Doleac might herald as top-quality in 2023 might look pretty weak by 2026. I have started to imagine this phenomenon as a rising tide flooding all of the little log cabins we’ve built. If the only kind of evidence we can consider regarding a specific policy has to rely on the same four or five research designs – and the acceptable boundaries of those research designs keeps changing at the present pace – we basically will never know anything about any policy’s effectiveness, at least not for very long. Nancy Cartwright has a beautiful analogy wherein social scientists are Jacana birds building their floating nests. Methods like RCTs are short, rigid twigs, which are useful sometimes – but a nest needs all sorts of materials to float. (I guess in this case the rigid twigs have some sort of planned obsolescence thing going on.)
I don’t actually think that the Vera study’s matching design evades this kind of criticism. It self-consciously adopts the exact same kind of “rigid” analytical framework and, frankly, is not as convincing as it could have been. But Doleac only even clocks this study because it contests the causal inference terrain. She probably doesn’t read research that uses qualitative methods, for example. Yet this kind of research actually holds up much better in the long run! And it might even be better for policy evaluation–after all, detailed descriptions of how a program worked in a historical case can help us formulate predictions about which features of other cases might make the program more or less likely to succeed, assuming similar implementation.
Prediction and policy evaluation
Speaking of policy evaluation, I touched on the notion that evaluation and prediction are actually not very compatible. Causal research questions are primarily historical, and they can either work backwards (What were the most important factors catalyzing social revolutions?) or forwards (We gave money to group A and not to group B; did group A do better on some outcome?). By contrast, the question we actually want to answer about most policies is fundamentally predictive: If we implement prison education programs more widely, will people make more money and stay out of prison when they get out? Causal inference methods can tell us basically nothing about this question, because they function by “zooming in.” Observations for which identification assumptions are not credible are shaved away from the analysis, leaving a “clean” subgroup for which we can credibly make inferences. Economists call this a “local average treatment effect” or LATE, which was humorously invoked in a paper title as part of a related debate (“better LATE than nothing”). But the point here is that the LATE applies to a small group of people, not the entire population from the case in question or the entire population of interest more broadly. Quoting again from my Inquest piece:
We might believe that the “treatment” is not randomly assigned in the real world; this was the case in my recent study of traffic stops and voter turnout, where Black drivers were stopped at higher rates. We deal with this kind of concern by trying to drill down into narrower groups of people in the data where we can credibly claim an approximation of random treatment assignment. This means that credible causal claims in social scientific literature today are extremely narrow.
Don’t get me wrong, I think that drilling down into internal validity is certainly an improvement upon a lot of older quantitative work. We can aggregate across historical cases evaluated using causal inference methods, and that improves our idea of “external validity” or “generalizability” to the same extent. But as David Thacher argues in his discussion of Arnold-funded methods standards, research agendas emphasizing “programs” construe individual cases as general in a way that we might not actually believe. I’ll quote him at length:
Often, the strategies cannot be distinguished from the institutional environment in which they are delivered—at their best, they adapt to and reshape the context that surrounds them—so it becomes hard to isolate a meaningful “strategy” to evaluate. A central goal of interventions like community policing, third-party policing, and problem-oriented policing is to strengthen police capabilities by leveraging cooperation with outside organizations and reshaping the way they operate, so the value of these forms of proactive policing must depend on the character and flexibility of a particular community’s organizational landscape (p. 223). Even a narrow administrative tool like predictive policing is not a self contained intervention but a set of guidelines about how to deploy an organization’s existing stock of data and capacity for crime analysis, which may vary from place to place [...] When the very nature of a strategy is this deeply entangled with so many different layers of its environment, evaluation results are not well understood as the impact of a disembodied intervention but of a complex and highly localized interplay between intervention and environment. The question to ask about what the findings from one place mean for others is not really a question about the external validity of the “treatment effect” but about what the “treatment” actually is.
Predicting stuff in the social world requires social scientists to construe different cases as instances of the same thing. (Again, qualitative research which produces thick descriptions of program evaluations is probably the best way to figure out whether two instantiations of a program are genuinely comparable.) Policy research necessitates some uncomfortable flattening of cases which are conceivably similar in a couple of ways but different in a lot of other ways. I’m not actually sure why causal inference methodologists have such a stranglehold over policy evaluation (and that’s a research question in itself), since prediction and causation so rarely overlap. I prefer to use causal inference in purely academic settings as part of a broader research agenda which aims to explain stuff that already happened. (And causal inference has pretty low explanatory value; it’s used to confirm whether the effect is present, not to explain it.) For policy, it would not necessarily be the first tool I pull out of the toolbox. But again, I also don’t really do policy research. My research agenda emphasizes learning and explaining, not so much predicting and controlling.
Social scientists don’t often talk about this stuff in public settings because, frankly, it undermines our professional authority. It means that when we release a study about a specific program’s effectiveness in a particular place and a particular time, we actually don’t know much about whether a similar program would work somewhere else, later. As Thacher puts it, “it shines a small light in a large, dark room.” If you’re talking to a reporter or a legislator, this kind of statement would make them question why they are even talking to you about your study in the first place! This is also probably why folks like Doleac are so aggressive about establishing the boundaries of their professional authority. No authority, no proximity (or is it the other way around?). In this case, though, Doleac seems to have forgotten that she quit her economist job. It’s no longer her discipline to police – she determines policy evaluation funding now, and she holds all the power. But I guess if you’re head honcho, you can just treat Twitter like one giant peer review, and invite all your friends to shit on whoever is unfortunate enough to capture your attention. The risk of this approach is that Glenn Martin will call you “Research Karen” and tell you to flush your research down the toilet.
I hope that the several thousand words I’ve written about research methods and criminal justice reform have been, at the very least, not a snooze!