[PDF] Closer than they appear: A Bayesian perspective on individual-level heterogeneity in risk assessment

Abstract

Risk assessment instruments are used across the criminal justice system to estimate the probability of some future behavior given covariates. The estimated probabilities are then used in making decisions at the individual level. In the past, there has been controversy about whether the probabilities derived from group-level calculations can meaningfully be applied to individuals. Using Bayesian hierarchical models applied to a large longitudinal dataset from the court system in the state of Kentucky, we analyze variation in individual-level probabilities of failing to appear for court and the extent to which it is captured by covariates. We find that individuals within the same risk group vary widely in their probability of the outcome. In practice, this means that allocating individuals to risk groups based on standard approaches to risk assessment, in large part, results in creating distinctions among individuals who are not meaningfully different in terms of their likelihood of the outcome. This is because uncertainty about the probability that any particular individual will fail to appear is large relative to the difference in average probabilities among any reasonable set of risk groups.

Full PDF

CCloser than they appear: A Bayesian perspective on individual-level heterogeneity in risk assessment

Kristian Lum

Department of Computer and Information Science, University of Pennsylvania, Philadelphia,USA.

David B. Dunson

Department of Statistical Science, Duke University, Durham, USA.

James Johndrow

Wharton School, University of Pennsylvania, Philadelphia, USA.

E-mail: [email protected]

Summary . Risk assessment instruments are used across the criminal justice system to es-timate the probability of some future behavior given covariates. The estimated probabilitiesare then used in making decisions at the individual level. In the past, there has been contro-versy about whether the probabilities derived from group-level calculations can meaningfullybe applied to individuals. Using Bayesian hierarchical models applied to a large longitudinaldataset from the court system in the state of Kentucky, we analyze variation in individual-levelprobabilities of failing to appear for court and the extent to which it is captured by covariates.We ﬁnd that individuals within the same risk group vary widely in their probability of the out-come. In practice, this means that allocating individuals to risk groups based on standardapproaches to risk assessment, in large part, results in creating distinctions among indi-viduals who are not meaningfully different in terms of their likelihood of the outcome. Thisis because uncertainty about the probability that any particular individual will fail to appearis large relative to the difference in average probabilities among any reasonable set of riskgroups.

1. Introduction

Actuarial risk assessment instruments (RAIs) are used throughout the criminal justicesystem to issue data-driven estimates of the likelihood that a contextually relevant outcomewill occur. RAIs have been used in criminal justice for decades (Harcourt, 2008; Solow-Niederman et al., 2019), and for pre-trial decision making since at least the Vera Instituteof Justice’s “point scale” was developed in the 1960s (Paulsen, 1966). In recent years theyhave played a key role in reform eﬀorts designed to reduce pre-trial detention and eliminatecash bail in the United States. For example, a central component of New Jersey’s suiteof reforms put in place to overhaul pre-trial decision making was the use of a popularRAI, the Public Safety Assessment (PSA) (Anderson et al., 2019). Similarly, California’sSenate Bill 10— legislation passed in 2018 with the intention of reforming California’s pre-trial system— mandated the use of risk-based pre-trial decision making across the state(Solow-Niederman et al., 2019).Despite their adjacency to more progressive criminal justice reform eﬀorts, the fairnessof these models has been questioned. For example, in an address to the National Associ-ation of Criminal Defense Lawyers that was otherwise optimistic about the uses of RAIs,former United States Attorney General Eric Holder, cautioned that “[RAIs] may exacer-bate unwarranted and unjust disparities that are already far too common in our criminaljustice system” (Holder, 2014). Legal scholars and civil society groups have raised concernsabout the civil rights implications of their use (Robinson and Koepke, 2019). Perhaps most a r X i v : . [ s t a t . A P ] F e b Lum et al. famously, Angwin et al. (2016) of the investigative journalism organization, ProPublica,published a report claiming that one particular RAI was “biased against blacks,” largelybased on the ﬁnding that the false positive rate of the model was higher for Black peoplethan for white people. This (dis)parity-based notion of fairness has been the focus of muchof the technical literature on fairness in risk assessment over the last several years (Floreset al., 2016; Dieterich et al., 2016; Chouldechova, 2017).Other technical concerns related to the fairness of RAIs center on statistical uncertainty.For example, Hart et al. (2007) argued that standard measures of statistical uncertaintyfor RAIs pertained to groups, not individuals. Moreover, they argued that the existenceof large statistical uncertainty at the individual-level makes such models ill-suited to in-form decisions about individuals . Extending this idea, we believe one related standard of“fairness” is that individuals who are not statistically distinguishable by the model shouldnot be (or rarely be) treated diﬀerently on the basis of the model’s predictions. Assessingwhether RAIs satisfy these standards of fairness requires ﬁtting models explicitly designedto account for individual-speciﬁc probabilities and their associated uncertainty intervals.Without repeated observations of the same individuals over time, this is not possible (Im-rey and Dawid, 2015). Unfortunately, perhaps due to lack of availability of data of thistype, to our knowledge there has been no work in this domain that uses appropriate statis-tical models to meaningfully estimate individual probabilities and associated uncertaintyabout them, leaving the question of whether existing RAIs meet these standards of fairnesslargely unknown.In this paper, we attempt to ﬁll this gap. Using data on over 460 000 arrests from mid-2014 to the end of 2018 in the state of Kentucky, we build a Bayesian hierarchical modelwith individual-level random eﬀects. The outcome we model is failure to appear (FTA)for any court appointment prior to the conclusion of the case. We condition on covariatesavailable at decision time, including the factors used in a popular pre-trial risk assessmentinstrument, Arnold Ventures’ PSA. Taking advantage of our large, longitudinal dataset onFTA in which there are many individuals who appear on multiple occasions, we are ableto estimate this model, and, importantly, learn the distribution of individual probabilitiesacross the population. The main assumption of our model is that, while probabilities offailing to appear do vary across individuals in a way that is not explained by covariates,they only vary across time for a given individual i in a manner explained by covariates.We later verify that our model probabilities are well-calibrated with respect to severalstrategies for creation of risk groups. Analysis of the distribution of individual-level prob-abilities of FTA across the population, in combination with data available to us, gives riseto some surprising conclusions about the extent to which individuals in the data are dis-tinguishable with respect to their individual-level propensities toward the FTA outcome,even when we have been able to observe the individual’s behavior on several previousoccasions. To wit, we ﬁnd that most individuals are eﬀectively indistinguishable fromone another. We conclude by explaining that the result is not an inevitable consequenceof our model construction, and that, were the population-level distribution of individualprobabilities more concentrated, or the available covariates much stronger predictors ofthe outcome, we could have reached the opposite conclusion —i.e. that many individualscan be distinguished on the basis of their individual propensity toward the outcome.

2. What are risk assessment instruments

At their core, actuarial RAIs are statistical models that correlate a set of covariates (often,factors that are grounded in criminological theory) to the relevant criminal justice-relatedoutcome (Desmarais and Lowder, 2019). In this paper, we focus on prediction of FTA.However, the methods and concepts we use could be applied to risk assessment for virtu- loser than they appear ally any outcome variable at any stage of the criminal justice process. In the canonicalsetting, actuarial RAIs estimate the probability, P , of a binary outcome, Y , conditionalon observable covariates, X ∈ X — a framework common to many prediction problemsacross ﬁelds. For an individual member i of a population I with covariates X i , this canbe expressed by the model P i = f ( X i ) for f : X → [0 , f (see, e.g., Goodman(1992); Pretrial Justice Institute and JFA Institute (2012); Lowenkamp (2009); VanNos-trand and Keebler (2009); Levin (2012a)). To facilitate easy manual computation of thescores, sometimes the regression coeﬃcients are transformed or restricted to integer values(Jung et al., 2020; Zeng et al., 2017).Before being displayed to criminal justice decision-makers, the predicted probabilities,ˆ P , under the estimated model are often translated to coarser risk groups, g ∈ G , bybinning the estimated probabilities (see, e.g., Levin (2012b); Pretrial Justice Institute andJFA Institute (2012)). For example, the lowest risk group may be deﬁned as including allindividuals for whom 0 ≤ ˆ P i < c for some policy-maker-determined value c . Similarlythe k th group could be deﬁned as being all individuals for whom c k − ≤ ˆ P i < c k , forsome number of risk groups, k = 1 , ..., K . Though equivalent, in practice the bins may bedeﬁned with respect to some monotone transformation of the ˆ P i s (e.g. the un-transformedlinear predictor in a logistic regression with integer coeﬃcients). We use the notation˜ g ( · ; c ) to denote the function that bins predictions to risk groups. When the mapping isdone using ˆ P i , we write ˜ g ( ˆ P i ; c ) = g to mean that individual i ’s point estimate ˆ P i placesthem into bin g according to thresholds c .A cartoon version is shown in Figure 1, in which the distribution of the predictedprobabilities ˆ P from some hypothetical model are shown. Individuals are divided into riskgroups indicated by diﬀerent colored regions. Risk groups may be numbered as shown (see,e.g., COMPAS decile scores discussed in the validation study of Farabee et al. (2010)).They may also be given qualitative labels. For example, individuals whose predictedprobabilities fall within the light red region could be labeled as “high” risk. Those in thelight orange, yellow, and green regions may be labeled “medium-high”, “medium-low”,and “low” risk, respectively. Fig. 1.

Hypothetical example of how individuals are partitioned into risk groups based on amodel’s predicted value or score.

The risk groups may be used simply to inform the decision-maker about the estimatedprobability that an individual will have the outcome Y or, in some cases, the risk groupsare used to prescribe policy or treatment recommendations. For example, assignment tothe highest risk group in one RAI used in New York City makes defendants ineligiblefor the supervised release program (Lum and Shah, 2019). Elsewhere, the PSA, which iscommonly used for pre-trial decision making, often maps risk groups for three separateoutcomes (failure to appear for court dates, re-arrest, and violent re-arrest) to recommen- Lum et al. dations for conditions of release using a Release Conditions Matrix (Advancing Pre-TrialPolicy and Research, 2020). Similarly, the Virginia Pretrial Risk Assessment Instrumentuses the “Praxis” to map recidivism predictions to release recommendations. Both the Re-lease Conditions Matrix and the Praxis are structured decision-making tools that combinethe risk group assignment and charge information to recommend various levels and typesof pre-trial supervision. † Notably, in Kentucky— the state from which we obtained ourdata— although the PSA is used, risk scores or groups are not associated with particularpolicy recommendations via the Release Conditions Matrix.

3. Quantifying uncertainty in risk assessment models

Given that the model’s predictions and group classiﬁcations aﬀect highly consequentialdecisions determining an individual’s liberty, questions surrounding the uncertainty ofthe predictions and classiﬁcations naturally arise. However, how one should formulatethe problem of uncertainty quantiﬁcation in this context— uncertainty about what , so tospeak — has been a point of contentious debate (see, for example, Hanson et al. (2017);Hart et al. (2007); Imrey and Dawid (2015); Dawid (2017)). Here, we discuss three typesof uncertainty quantiﬁcation relevant to risk assessment in increasing granularity: model-level, group-level, and individual-level.

At the model-level , attention rests on evaluating the predictive utility of the model onthe aggregate. At this level, emphasis is on establishing that the model’s predictionsare positively correlated with the outcome of interest beyond what would be expected tooccur due to chance alone. Interest also lies in establishing that the factors or covariatesincluded in the model are statistically signiﬁcant marginal correlates of the outcome. Thisis sensible from the point of view that one might reasonably demand that any factor thatis used to justify a person’s assignment to a particular risk group or score be based onscientiﬁc evidence that the factor is actually related to the outcome in question. Further,one might equally ask the question of whether the use of all of the factors in combinationare actually, to a reasonable degree of statistical certainty, correlated with the outcome.This type of analysis is often performed via traditional statistical testing or intervalestimation. For example, several manuals on the development of risk assessment toolsreport common predictive performance metrics, such as area under the receiver operatingcharacteristics curve (AUC), with associated uncertainty intervals or measures of statis-tical signiﬁcance for the model’s regression coeﬃcients (e.g. see VanNostrand (2003);Lowenkamp (2009); Levin (2012b); VanNostrand and Keebler (2009)). Sometimes, bivari-ate tests of dependence between each covariate and the outcome variable are also reported(see, e.g. Levin (2012b)). These metrics speak to the model’s overall predictive utilityand whether each of the included factors are related to the outcome of interest. However,this sort of analysis does not preclude the possibility that the rate of failure is statisticallyindistinguishable between risk groups deﬁned by the model’s predictions.

Uncertainty quantiﬁcation at the group-level pertains to uncertainty about sub-groupsof individuals. Typically those sub-groups are deﬁned by risk groups G . Risk assessmenttools can be viewed as a data-driven method for deﬁning policies that partition individualsinto groups, where the same treatment is recommended for all individuals within the samegroup. Recent policy simulations in Kleinberg et al. (2017) and Picard et al. (2019) have † See the VPRAI instruction manual here loser than they appear Fig. 2.

Example group-wise analysis. Bar lengths indicate point estimates of the FTA rate acrossall individuals classiﬁed to each group. Gray error bars show expected length of classical 95%conﬁdence intervals for data generated with 100 observations per group; black error bars showthe equivalent for 1000 observations per group. suggested that if risk assessment were used in place of human judgment to determine whowould be released (i.e. all people within the same group received the same treatment), moreindividuals could be released while maintaining current rates of pre-trial failure. Perhapsfrom the point of view of a policy maker whose focus is on the impact of a proposed policy,the relevant question might then pertain to P g — the rate at which the outcome occursin each group g . That is, if the tool were the instantiation of a policy for making releasedecisions, how uncertain would we be about the expected rate of the outcome among thosewho would be released under the policy?Indeed, most group-level uncertainty analysis in risk assessment has focused on uncer-tainty intervals for P g . For example, Glover et al. (2017) reports 95% conﬁdence intervalsfor the proportion of individuals who were re-arrested within each risk group as deﬁned bythe tool, evidently calculated using a normal approximation to the binomial distribution.Hanson et al. (2017) reports recidivism rates and conﬁdence intervals for each risk groupfor the STATIC-99R, a risk assessment tool used to classify sex oﬀenders into risk groups.DeMichele et al. (2018) report boostrapped 95% conﬁdence intervals for the rate of FTA,new arrest, and new violent arrest for each of the risk groups in the PSA. Scurich and John(2012) purports to take a Bayesian approach to obtain individual-level credible intervals,though in doing so, implicitly assumes all individuals within the same risk group have thesame P i . Notably, these sorts of uncertainty intervals shrink as the sample size grows, asobserving more individuals leads to more certainty about the average outcome within thegroups. This is shown in Figure 2, which shows a prototypical group-wise analysis. Barlengths show the observed rate of the outcome in the data for people classiﬁed to eachgroup. Error bars show expected classical 95% conﬁdence intervals for data simulatedwith 100 individuals per group (gray) and 1000 individuals per group (black). Thus, whilethese intervals are informative about the average rate of the outcome within each group,they have more to do with the sample size used in the validation study than they do withdeﬁning a range of plausible values for each individual in the group. That is, these sortsof analyses say little about uncertainty regarding each individual’s propensity toward theoutcome, only the uncertainty about the average across individuals within each group. This framing raises the question of individual-level variability within risk group anduncertainty quantiﬁcation for individual-speciﬁc probabilities. Hart et al. (2007) and Hartand Cooke (2013) have argued that, for risk assessments to be meaningfully applied toindividual judicial decisions, uncertainty quantiﬁcation at the individual level is of the

Lum et al.

Fig. 3.

Two possible scenarios. The colored bars show the mean probability of the outcome by foreach risk group, while the gray densities indicate the distribution of P i for individuals within eachrisk group. The left panel shows a case where everyone within each risk group has a very similarprobability of the outcome; the right panel shows individual-level probabilities highly dispersedwithin each risk group. utmost importance. While the statistical arguments employed by Hart et al. (2007) tosupport this assertion were ill-formulated (Imrey and Dawid, 2015), the normative argu-ment about the importance of estimates of and uncertainty quantiﬁcation for P i remainsrelevant.To formalize a bit, one might ask: given that an individual has been assigned to group g , how diﬀerent might their individual probability P i be from the group probability P g attributed to them? How certain are we about their P i ? Given the information available,what is the range of plausible values their individual P i could take? For g ( i ) the group towhich individual is i is assigned, is the implicit assumption that P i = ˆ P g ( i ) — a simpliﬁca-tion termed individualized risk (Dawid, 2017)— approximately true or is it the case thatsome individuals within the same group have very diﬀerent probabilities of the outcomethan others? These two diﬀerent scenarios are shown in Figure 3. In the left panel, thedistribution of P i is tight, and individuals within the same risk group all have essentiallythe same probability of the outcome. In the right panel, the distribution of P i is fairlydiﬀuse around the group-wise mean. In this scenario, individuals in the same group canhave very diﬀerent individual level probabilities of the outcome. Their personal P i can becloser to the group-wise rate of the outcome for other risk groups than it is to the one towhich they were assigned.In both of these scenarios, the marginal rate of the outcome and the observed data areexactly the same. However, in an ideal world in which knowledge of these true underlyingdistributions were available, these scenarios may support substantively diﬀerent decisions.For example, suppose the small variance scenario in the left panel of Figure 3 is true anda judge is presented with a case in which the person who has been assessed falls into riskgroup g where P g = 0 .

35 (the highest risk group in Figure 3). If the judge also knows thatall individuals in group g have individual probability P i ≈ P g = 0 .

35 of experiencing theoutcome (the case in the left panel), they might be willing to deny release for all individuals.They might believe that every person in this group is deserving of detention by virtue ofthe fact that they would fail to appear on 35% of all similar occasions at which they werereleased. Suppose instead the large variance scenario in the right panel of Figure 3 is true.Although the group-wise average remains P g = 0 .

35, it is also known that about a quarter loser than they appear of the people in this group have individual probability less than 0 .

2, and another quarterhave probability greater than 0 .

5. The individual probabilities are highly dispersed aboutthe mean. The judge then might not be willing to detain individuals in this group on thebasis of this risk score. The judge might believe that— although they don’t know whichindividuals are those with probability less than 0 .

20— individuals who would only fail atmost 20% of the time are not deserving of detention, even if releasing them also meansreleasing some individuals who would fail 50% of the time. The judge might consider a25% chance of detaining someone whose personal probability of the outcome is only 20%to be unacceptable.Unfortunately, the world is not ideal and we do not directly observe the underlyingdistribution of P i s. Instead, we observe binary Y i s. When only the Y i s are observed, tra-ditional group-wise analysis can’t diﬀerentiate the two scenarios. Whether the underlyingtruth is the “large” or “small” variance scenario, group-wise rates and associated classicalconﬁdence intervals under both scenarios are exactly the same for the same sample size.What, then, can be said about individual probabilities? On a purely mechanical level,it is possible to derive uncertainty intervals for ˆ P i from model-level analysis. For example,consider a standard generalized linear model as in equation (1). Y i ∼ Bernoulli( P i ) , P i = f ( X i β ) (1)Here, f is a function mapping X i to [0 , P i can be obtained asa function of the sampling distribution of ˆ β . Similar to standard estimates of uncertaintyfor ˆ P g , intervals for ˆ P i under equation 1 shrink with increasing sample size, as the samplingdistribution of ˆ β gets tighter with increasing N , the total number of observations. Suchintervals are unsatisfying as measures of uncertainty at an individual level, as implicit inthe model is the assumption that any two individuals with the same measured covariatesnecessarily have the same individual probability of the outcome. That is, P i as deﬁned inthis model does not correspond to a notion of individual probability that recognizes thatindividuals may diﬀer along dimensions that are not captured by covariates. Any system-atic variation across individuals that is not captured by any of the measured covariatesis assumed away by the model’s structure. Estimates of P i under this model, and uncer-tainty quantiﬁcation thereof, then pertain to estimates of a quantity that by deﬁnition isnot a meaningful mathematical embodiment of individual probability as one’s personal,individual-speciﬁc probability of the outcome.This raises the question of what even is an individual-speciﬁc probability of an out-come that only occurs once and how does one write down a model to estimate such athing? This quickly gets philosophical. Our approach embraces a “personalistic” notionof probability described by (Dawid, 2017, pg. 3471), which is essentially Bayesian notionof probability. Fortunately, while the philosophical issues surrounding the notion of per-sonalized probabilities are esoteric, contentious, and diﬃcult, the implementation of suchan approach is, methodologically, very standard. Taking a personalized approach to thenotion of individual probabilities essentially means doing Bayesian analysis. In practice,Bayesian analysis involves specifying a probabilistic model that incorporates all of theinformation at our disposal about the process that generates the data. This includes notonly the choice of prior hyperparameters, but also distributional assumptions and choicesabout how to hierarchically structure the model in order to induce patterns of conditionalindependence that are reasonable given the process being modeled. While frequentistapproaches to analogous models are possible, the Bayesian framework we propose easescomputation and interpretation of uncertainty. In the next section, we construct such amodel for FTA that is a natural choice for the longitudinal data that we seek to analyze. Lum et al.

Importantly, our model incorporates parameters that can fairly be described as pertainingto an individual’s personal probability in that they can vary in individual-speciﬁc waysnot described by covariates.

4. A Bayesian hierarchical random effects model

For the reasons outlined in the introduction, standard regression approaches to criminaljustice risk assessment are not structured to estimate P i that enable interpretation aspertaining to individuals. Hierarchical random eﬀects models, however, are suited toaddressing this problem by explicitly modeling individual-level heterogeneity. Hierarchicalmodels are a central feature of applied Bayesian statistics, and are covered in detail in thewell-known text (Gelman et al., 2013, Chapter 5).Random eﬀects models incorporate individual-speciﬁc parameters to allow for variabil-ity in expected outcomes across individuals that are not explained by covariates (Gelmanet al., 2013). For the parameters of the random eﬀects distribution f η to be identiﬁablewith binary outcome data, it is necessary to have repeated observations on at least somesubjects i in the data. Using this framework, we expand the canonical model in (1) byadding individual-speciﬁc intercepts θ i that are drawn from a common distribution f η Y ij ∼ Bernoulli( P ij ) , P ij = f ( X ij β + θ i ) , i = 1 , . . . , m, j = 1 , . . . , n i (2) θ i iid ∼ f η , where f η is some distribution parameterized by η . Here, j indexes an individual’sobservations, with individual i having been observed on n i occasions. Then, Y ij for j =1 , . . . , n i is the outcome for the i th individual on the j th occasion they were observed. X ij are the covariates of person i at the j th occasion on which they were observed. Under thismodel, θ i represents individual i ’s individualized propensity toward the outcome. Thoughtheir estimated probability may change across occasions as a function of covariates, θ i represents their systematic deviation from what would be expected based on the measuredcovariates alone. If the distribution of θ has small variance, then this indicates thatindividuals’ probabilities do not deviate much from their expected value given covariates.If θ has large variance, this suggests that there is substantial between-person variabilitybeyond what is explained by the covariates in the model.To complete a fully Bayesian model, we assign priors to the regression coeﬃcients β and the parameters η of the random eﬀect distribution β ∼ N ( β , Σ) , η ∼ π η . (3)The prior distribution on β reﬂects our prior beliefs about the relationship between thecovariates in X and the outcome. The β parameters are common across all individuals inthe data. We select prior hyperparameters β = 0 and Σ = 9 I — each β parameter hasan independent N(0,9) prior. We select prior variances of nine because we standardize X before computation, and an eﬀect size of six is very large on the probit scale. We do notexpect to see eﬀects any larger than that.By placing a prior on the random eﬀect parameters η , we learn the distribution of θ i across individuals in the population from the data. To ensure that our qualitativeconclusions are robust to the distributional assumptions about the random eﬀects, weconsider two diﬀerent forms for f η . The ﬁrst is a single normal distribution, θ i ∼ N ( µ, τ ).The second is a discrete mixture f η ( θ i ) = (cid:80) Kk =1 w k δ ( θ i − µ k ), where µ k ∼ N (0 ,

1) and( w , ..., w K ) ∼ Dirichlet(1 /K, ..., /K ). We emphasize the results for the single normaldistribution in the main text but supply full analysis of the discrete mixture in the ap-pendix. The results vary little between the two models. loser than they appear

5. Computation

For computational convenenience, we use the probit link to specify f , and express themodel in the uncentered parametrization. When the data are only weakly informativeabout the θ i , the uncentered parametrization leads to more rapid convergence of Markovchain Monte Carlo (MCMC) algorithms of the type that we employ in this paper (Pa-paspiliopoulos et al., 2007). We consider the case where f η is the density of a normaldistribution and the parameter η = ( µ, τ ) is a vector of the mean and scale of the randomeﬀects distribution.Our complete model in the uncentered parametrization is y ij ∼ Bernoulli(Φ( µ + τ θ i + x ij β )) θ i iid ∼ N (0 , µ ∼ N (0 , , τ ∼ N (0 , , β ∼ N (0 , I ) , where N denotes a normal distribution truncated to the positive half-line and our priorfor the standard deviation τ of the random eﬀects distribution is weakly informative. Weuse a data augmentation blocked Gibbs sampler for computation. In our data, the totalnumber of observations is large ( N := (cid:80) mi =1 n i is in the hundreds of thousands) whilethe number of individuals m is also in the hundreds of thousands, since most individualshave n i = 1 or n i = 2. In this setting— many observations but few observations perindividual— we have found general-purpose probabilistic programming tools such as Stan to have very high computational cost per iteration, and that blocked data augmentationGibbs sampling in the uncentered parametrization oﬀers by far the best performance ofcommonly used algorithms when accounting for both computational cost per step andconvergence rate of the MCMC algorithm. Furthermore, this approach can be used withsome modiﬁcations for both the Gaussian and discrete mixture models for f η . The updatesfor the Gaussian speciﬁcation of f η are as follows:(a) Update ω ij , all conditionally independent, from ω ij ∼ N y ij ( µ + τ θ i + x ij β, , where N y is a normal distribution truncated to (0 , ∞ ) if y is positive and is a normaldistribution truncated to ( −∞ ,

0] if y is nonpositive.(b) Update τ marginal of µ from τ ∼ N ( α τ , s τ )where s τ =  m (cid:88) i =1 ( n i θ i ) − (cid:32) m (cid:88) i =1 n i θ i (cid:33) (1 / N ) − + 1  − (4) α τ = s τ  n i (cid:88) j =1 m (cid:88) i =1 θ i ( ω ij − x ij β ) − (cid:32) m (cid:88) i =1 n i θ i (cid:33)  m (cid:88) i =1 n i (cid:88) j =1 ω ij − x ij β  (1 / N ) −  . Derivation of this full conditional is given in the Supplementary Materials.(c) Update µ given everything from µ ∼ N ( α µ , s µ ) s µ = (1 / N ) − , α µ = s µ (cid:88) i,j ω ij − τ θ i − x ij β  Lum et al. (d) Update θ given everything from θ i ∼ N ( α i , s i ) s i = ( τ n i + 1) − , α i = s i n i (cid:88) j =1 τ ( ω ij − µ − x ij β )(e) Update β given everything from β ∼ N ( α β , S β ) S β = ( X (cid:48) X + (1 / I ) − , α β = S β X (cid:48) ( ω − µ N − τ W θ ) . We run the algorithm for 20,000 iterations, discarding the ﬁrst 5,000 iterations as burn-in.Computation details for the discrete mixture distribution are given in section B.1 of theSupplementary Materials.

6. Data

We obtained data from Kentucky pre-trial services covering all people who were arrestedand booked in the state from July 1, 2009 through the end of 2018. This consists of nearly1.5 million arrests. Associated with each arrest is an “interview” at which time informationabout the person who was arrested was collected, including their age, criminal history, andhistory of failing to appear for court appointments. One interview can be associated withmultiple cases. Here, the unit of analysis is an arrest (or, equivalently, interview). In Juneof 2014, the information recorded at each interview changed when a new risk assessmentmodel was introduced. In order to build a coherent regression model using a consistentset of covariates, we restrict the dataset to only include interviews that took place afterJune 30, 2014. This leaves approximately 611 000 arrests for our analysis. We restrict thedata to only those interviews for which the individual was at any time released pre-trial,since there is no opportunity for FTA if the individual is never released. This encompassesaround 463 000 arrests and associated pre-trial outcomes. We drop 72 observations due tomissingness. We reserve all arrests that took place in 2018, the last year of our data, as aholdout set on which to test our model. The holdout set makes up 13% of our data.Figure 4 shows the distribution of number of releases per-person in the training dataset.The plurality of individuals (around 142 000) had only one arrest with release during thetime period covered by our training data.

Fig. 4.

The distribution of the number of times each individual appears in the training dataset.

Associated with each arrest are several measures recorded at the time of the pre-trialinterview. The factors we use in our model are: whether the current charge is a domesticviolence charge, whether the current charge is for a violent oﬀense, whether the currentcharge is for a violent oﬀense and the arrested person is under 21 years of age, the arrested loser than they appear person’s age at the time of the interview, whether there is currently another case pending,whether the individual has had any prior misdemeanor arrests, whether they have had anyprior felony arrests, the number of failures to appear for court appointments within thelast two years, whether they had any failures to appear prior to two years ago, the numberof prior arrests with violent charges, whether the individual has a prior incarceration, andwhether at the time of the current arrest there was an outstanding warrant for an FTA. Themeasure of past failures to appear encompasses failures to appear for non-arrest-relatedcourt appointments, such as traﬃc violations, as well as failures to appear that occurredin other jurisdictions. This justiﬁes its inclusion as a covariate, since the outcome variablein our model only records arrest-related FTAs.These factors make up the columns of X and are a super-set of the factors used inthe FTA risk model currently used to evaluate people who have been arrested and informpre-trial decisions in Kentucky. Associated with each record is also a FTA risk score thatwas generated at the time of the interview. These risk scores take values one throughsix, where larger values indicate a higher estimated likelihood of failure to appear. Eachnumeric score is considered a diﬀerent risk group. As of November 2017, Kentucky uses apoint scale that runs from zero to seven instead of the six groups considered here. Becausethe six point scale is commonly used and was used for the majority of the time periodcovered by our data, we use the 6 point scale throughout this analysis. We do not usethese risk groups as covariates in our model, but we do use them for later evaluation.The outcome variable used in our analysis is a binary indicator of whether the in-dividual failed to appear for any court appointments for any case associated with theirinterview/arrest prior to their case disposition. ‡ This outcome variable has some concep-tual limitations. Though there may have been several required appointments, a singlemissed appointment results in the failure to appear indicator. This measure of pre-trialfailure does not disambiguate between individuals who nearly always appear as requiredand those who have absconded or habitually miss appointments. However, this deﬁnitionof the FTA outcome variable is frequently used for failure to appear prediction and modelevaluation (see, e.g., DeMichele et al. (2018), for a recent example), and the data requiredto calculate this alternative measure of FTA (all of the court dates to which the individualdid appear as scheduled) is not available. The rate of FTA in these data is 20 .

7. Results

Here we present results for each of the “levels” of analysis in turn. Results under thenormal and discrete random eﬀects distributions gave qualitatively similar results, withthe discrete distribution exhibiting a slightly longer right tail for the distribution of indi-vidual probabilities. Because the results were qualitatively similar, we focus here only onthe results for the normally distributed random eﬀects. The lower-dimensional parame-terization of this model allows for ease of exposition as the τ parameter oﬀers a succinctand convenient way to discuss individual-level variability. Results for the discrete mixturedistribution are given in appendix A. ‡ It may seem redundant to include this measure of FTA as the outcome and also include pastFTAs as a covariate in this longitudinal setting. There are signiﬁcant diﬀerences between thetwo measures: the outcome measure of FTA only pertains to FTAs for cases in this jurisdictionassociated with arrests. The measure of FTA used as a covariate (which is calculated at the timeof the interview) may encompass FTAs from other jurisdictions, including FTAs for violations.However, we ﬁt the model excluding past FTAs as a covariate and had qualitatively similar ﬁndings.The results without FTA as a covariate paint a much less optimistic picture of the potential forrisk assessment to create meaningful individual-level diﬀerences, as the signal in the covariates issubstantially decreased. Lum et al.

Table 1.

Posterior summary statistics of regression coefﬁcients. mean 95% CI descriptionDV Charge -0.07 (-0.08,-0.04) Current domestic violence chargeCurrent Violent -0.27 (-0.28,-0.25) Current violent chargeViolent x Age -0.08 (-0.09,-0.03) Current violent charge and person under 21 years oldAge 21 or 22 0.07 (0.07,0.1) Aged 21-22 at time of interview (baseline under 21)Age 23+ 0.11 (0.11,0.14) Aged 23+ at time of interview (baseline under 21)Pending Cases 0.19 (0.18,0.2) Any pending casesMisdemeanors 0.13 (0.12,0.14) Any prior misdemeanor convictionsFelonies 0.07 (0.07,0.09) Any prior felony convictionsFTA 1-2 0.18 (0.17,0.19) 1 or 2 FTAs in the last two years (baseline 0)FTA 3+ 0.28 (0.28,0.3) 3+ FTAs in last two years (baseline 0)Old FTAs 0.14 (0.13,0.15) Any FTAs older than two yearsViolent 1-2 0 (-0.01,0.01) 1 or 2 prior violent convictions (baseline 0)Violent 3+ -0.02 (-0.03,0.01) 3+ prior violent convictions (baseline 0)Incarcerations 0.12 (0.12,0.13) Any prior incarcerationsCurrent FTA Case 0.42 (0.41,0.43) Current FTA warrant associated with arrest

We begin by summarizing posterior inference for the global model parameters, β and τ .Table 7.1 gives the posterior mean and 95% posterior credible intervals for each of theregression coeﬃcients used in our model. The posterior intervals for most coeﬃcients donot include zero, though most coeﬃcients have small eﬀect sizes. Due to the sample sizeavailable for ﬁtting this model, the posterior credible intervals are all quite tight aroundthe posterior mean.The left panel of ﬁgure 5 shows a histogram of posterior samples of τ from the normalrandom eﬀects model. This posterior distribution is fairly tight around its mean. Bythe law of total variance, posterior uncertainty about P can be decomposed into var( P ) = E (var( P | β, τ ))+var( E ( P | β, τ )), where the ﬁrst term is the expected amount of ‘natural’variability in P across the population after conditioning on model parameters, and thesecond term reﬂects uncertainty about the model parameters. We belabor this pointbecause our posterior uncertainty about P | X can be large for two reasons. We canhave large uncertainty about P because individuals’ probabilities vary considerably abouttheir expectation given covariates, i.e. τ is large. This scenario is morally equivalent to amodel with large residual variance. One could also have large uncertainty for P becausethe available data are insuﬃcient to estimate the model parameters precisely. The lattercase is in some ways less interesting, as learning that we simply have not collected enoughdata to estimate model parameters with any precision presents a less compelling case thanlearning that individuals are much more variable than we’ve previously recognized. Thelatter case could be remedied by collecting more of the same data, while remedying theformer would require discovering heretofore unknown covariates that explain a signiﬁcantportion of the unexplained variability. Or, it may be the case that τ is truly just largein the sense that after conditioning on all information that is legal or feasible to collect,people naturally vary widely in their propensity and ability to appear.To decompose the posterior uncertainty of P into the eﬀects of between-individual vari-ability and posterior uncertainty for global parameters, consider the hypothetical “medianindividual” whose covariate values each marginally take the median value, X med . Weassess posterior predictive uncertainty for this hypothetical individual on the ﬁrst timethey are observed using two methods. In the ﬁrst (called “full”), we sample from the fullposterior predictive distribution. That is, we sample from p full ( P med | data) = (cid:82) p ( P med | X med , β, τ ) p ( β, τ | data) dβdτ . This distribution incorporates our posterior uncertaintyabout model parameters. In the second (called “ﬁxed”), we ignore posterior uncertainty loser than they appear Fig. 5. (left) Posterior samples of τ in the normal random effects model; (right) Density plots ofsamples of P med under the posterior predictive distribution (“full”) and the distribution assumingﬁxed global parameters (“ﬁxed”). in the model parameters and sample from p ﬁxed ( P med | data) ∼ p ( P med | X med , ˆ β, ˆ τ ), ﬁxingmodel parameters at their posterior mean. A comparison of these distributions is shownin the right panel of ﬁgure 5. In this ﬁgure, we see that there is virtually no diﬀerence inthe posterior predictive distribution of P med in these two cases. Thus, the uncertainty inestimating model parameters has very little impact on our uncertainty in P . In our group-level analysis, we assess the variability of P ij within risk groups in our holdoutset. In order to create risk groups, we consider the predictive quantity P ∗ ij , which we deﬁnein (5). To match as closely as possible how our model would be used in criminal justicerisk assessment, we condition the global parameters β, τ, µ only on the training data, butthe individual-level parameters θ i are updated every time new data arrive for individual i . Speciﬁcally, for any ζ ∈ [0 , A ij ( ζ ) = { ( θ i , τ, µ, β ) : Φ( µ + τ θ i + X ij β ) ≤ ζ } , andlet y train , X train be the training data (all data through December 31, 2017). The CDFs H ∗ ij ( ζ ) ≡ P [ P ∗ ij ≤ ζ ] and H ij ( ζ ) ≡ P [ P ij ≤ ζ ] are given by H ∗ ij ( ζ ) = (cid:90) A ij ( ζ ) p ( θ i | y i, j − , X i, j , τ, µ, β ) p ( τ, µ, β | X train , y train ) dθ i dτ dµ dβ (5) H ij ( ζ ) = (cid:90) A ij ( ζ ) p ( θ i | y i, j , X i, j , τ, µ, β ) p ( τ, µ, β | X train , y train ) dθ i dτ dµ dβ. In other words, when computing the distribution of either P ∗ ij or P ij , we condition theglobal parameters on the training data, but update the conditional posterior of θ i as moreinformation becomes available in the test data. The quantity P ∗ ij does not condition onhaving yet observed the outcome for individual i on occasion j . Thus, P ∗ ij is the predictivequantity that would be used in risk assessment for individual i on occasion j . On theother hand, P ij does condition on y i,j , and is essentially the posterior distribution of theprobability of FTA for individual i after observing occasions 1 through j . It diﬀers fromthe exact posterior in that the global parameters are conditioned only on the data upthrough the end of 2017. Because the conditional posterior p ( τ, µ, β | X train , y train ) isalready quite concentrated (see, e.g., the right panel of ﬁgure 5), these quantities are aclose approximation to what would result if we updated the full posterior distribution ofall model parameters every time a new observation is obtained. We are unaware of anycriminal justice risk assessment instruments that are trained in an online fashion, and thusthese “partial information” posterior predictive and posterior quantities reﬂect our bestapproximation of how a model like ours would be used in practice. Lum et al.

Table 2.

The average posterior mean of ˆ P ∗ ij , ˆ P ij , and the “true”, empirical mean of Y ij within riskgroups under three different grouping strategies. PSA Risk Groups PSA-Sized Groups Clustered GroupsP* P rate P* P rate P* P rate1 0.10 0.10 0.08 (0.08,0.09) 0.08 0.08 0.07 (0.06,0.08) 0.09 0.09 0.07 (0.07,0.08)2 0.13 0.13 0.12 (0.12,0.13) 0.12 0.12 0.11 (0.11,0.12) 0.12 0.12 0.11 (0.1,0.11)3 0.18 0.18 0.18 (0.17,0.18) 0.17 0.17 0.17 (0.16,0.18) 0.15 0.15 0.14 (0.14,0.15)4 0.26 0.26 0.26 (0.25,0.27) 0.24 0.24 0.25 (0.24,0.26) 0.19 0.19 0.2 (0.19,0.21)5 0.36 0.36 0.36 (0.35,0.36) 0.37 0.37 0.35 (0.35,0.36) 0.26 0.26 0.27 (0.26,0.28)6 0.41 0.40 0.36 (0.35,0.38) 0.54 0.54 0.5 (0.48,0.52) 0.42 0.42 0.4 (0.39,0.41)

There are several ways in which risk groups could be deﬁned. We consider three. Theﬁrst is to use the PSA’s FTA risk groups that were calculated at the time of the interview.These are the groups that were used to make pre-trial recommendations. These risk groupsdo not perfectly correspond to any set of thresholds, c , applied to our model’s output, asthese risk scores were produced by binning diﬀerent probabilities than those obtained byour model. However, for evaluation, we will need to deﬁne approximate thresholds, c PSA .The values of c PSA are set to be the mid-point between the empirical rates observed ineach of the PSA’s risk groups.We could also create our own groups using our model. Let ˆ P ∗ ij = E [ P ∗ ij ] be the mean of P ∗ ij . We bin the ˆ P ∗ ij s to create our own risk groups. We determine the bin thresholds usingonly the training data. For consistency with the PSA’s typical scoring, we create six bins.We consider one binning scheme in which we create bins so that the number of people inthe k th bin is the same as the number of people in the k th PSA risk group. We call this“PSA-sized groups.” We also create groups by applying k-means clustering with K = 6clusters. This allows the data rather than the precedent set by the PSA to determine thesize of each of the bins. We refer to the thresholds used to create each of these groupingsas c PSA-sized and c clustered . In summary, we consider three groupings. The ﬁrst, g PSA ( i, j ),returns the value of the i th individual’s FTA risk score as given in the data on the j thoccasion on which they were observed. The second, g PSA-sized ( i, j ) = ˜ g ( ˆ P ∗ ij ; c PSA-sized )returns the i th individual’s risk group applying thresholds c PSA-sized to each individual’sˆ P ∗ ij under our model. Finally, g clustered ( i, j ) = ˜ g ( ˆ P ∗ ij ; c clustered ) returns the i th person’s riskgroup from applying thresholds c clustered .Table 7.2 shows the average ˆ P ∗ ij and ˆ P ij for each risk group in our holdout set. Forcomparison, we also show the empirical rate of the outcome within each group. We con-struct frequentist 95% conﬁdence intervals for the proportion in each risk group using thenormal approximation to the binomial distribution (shown in parentheses by the group-wise rates). Applying this standard group-wise analysis to the three sets of risk groupsshows that the risk groups exhibit diﬀerent average rates of FTA with only one exception.Notably, for the PSA Risk Groups, there is virtually no diﬀerence in the empirical rate ofFTA between the two highest risk groups, ﬁve and six. This is also reﬂected by the factthat the conﬁdence intervals for these groups overlap.We also see that the model is generally well-calibrated— estimates of P ∗ and P arevery similar to the true rate of the outcome within each group. We ﬁnd the calibration tobe quite good notwithstanding that in late 2017 (the end of the period from which we drawour training data), Kentucky created a specialized unit to conduct the risk assessment thatresulted in more thorough assessments of past pre-trial failure than had been conductedpreviously. Despite diﬀerence this caused in the way the that the training and test datawas recorded, our model remained satisfactorily calibrated.Figure 6 shows density estimates in black of the posterior distribution of P ij for each loser than they appear Fig. 6.

Posterior density of P ij and ˆ P ∗ ij for each risk group (subscripts omitted in legend). group, g , and grouping scheme, s : p ( P ij | data , g s ( i, j ) = g ). The gray line shows a densityestimate of p ( ˆ P ∗ ij | data , g s ( i, j ) = g ). In the former case, the density is the average overeach observation in g ’s posterior distribution of P ij . In the latter case, each observationin group g contributes one value, the point estimate of P ∗ ij , to the density. For consistencywith previous examples and displays, the empirical rate of FTA is shown by each of thecolored bars. We ﬁnd that the unexplained variation in the population within groups issubstantial. Reality more closely resembles the“large” variance scenario than the “small”variance scenario given in ﬁgure 3. The group-wise posterior distribution of P ij is broad,placing non-negligible mass in regions far from the group-wise mean in all cases. Returningto the discussion of the model-level analysis, the large posterior intervals within risk groupsare attributable to having estimated with high certainty that there exists large unexplainedbetween-individual variability, even after conditioning on available covariates.This motivates the question of how often an individual is assigned to the “right” or“wrong” risk group. To address this, we look at the distribution of ˜ g ( P ij ; c s ) | g s ( i, j )for each s — the probability that a P ij falls into each bin, g (cid:48) , conditional on having beenassigned to group g on the basis of their predictive point estimate. This is shown in ﬁgure7. Each row indicates the assigned risk group for each of the three grouping schemes.The columns indicate each of the groups to which an individual could be assigned. Thecolor in the k th row and the k (cid:48) th column represents the posterior probability that a P ij falls within bin k given that the predictive point estimate assigns them to group k (cid:48) . Theﬁrst and last groups (columns) have the highest posterior probability because the rangeof values of ˆ P ∗ ij corresponding to each of those bins is large relative to the middle groups.For example, while group one maps to values of ˆ P ∗ ij in approximately the range [0 , . . , . § From this we see that assignment to group one is the most meaningful. For example,individuals assigned to group one under the PSA’s risk groups have probability 0.61 oftheir P ij being within the range attributed to group one and probability 0.09 of it beingin the range ascribed to groups four, ﬁve, or six. Similarly though less pronounced, peopleassigned to group six also had probability 0.56 of their P ij being in the range attributedto group six and probability 0.17 of P ij being in range associated with groups one, two,or three. Using group assignments derived from a model ﬁtted to these data resulted inslightly better separation of individual-level probabilities into groups, though across all § The precise thresholds vary depending on grouping scheme; these serve only as an example. Lum et al.

Fig. 7.

Posterior probability that P ij falls within the region of group g for each actual group assign-ment “Assigned Group”. cases, there was substantial posterior probability of falling into some other group thanthe one assigned. Speciﬁcally, the probability of P ij falling into the range attributed tothe group to which it was assigned based on its predictive point estimate was 0.25, 0.32,and 0.39 for each of the binning strategies, respectively. This is because the proportionof arrests resulting in a group one or six assignment (those about which we are mostcertain) is fairly low. On most occasions, an individual receives a group two through ﬁveassignment. The probability of such assignments having been correct tends to be low. Finally, we turn to explicitly evaluating individual probabilities and associated uncertaintyby considering metrics based on individual posterior credible intervals for P ij . The leftpanel of ﬁgure 8 shows a random sample of 10,000 95% posterior credible intervals of P ij .For each interval, we calculate the posterior median. For those intervals whose posteriormedian is in the bottom half of all intervals’ medians, intervals are sorted by the upperend of the interval. For those whose posterior median falls in the upper half, intervals aresorted by the lower end of the interval. This allows us to approximate the extent to whichindividuals can be divided into groups that are statistically distinct from one another bydetermining whether their posterior credible intervals overlap. We choose a threshold of0.25 and color intervals in blue if they do not cross that threshold. This creates twogroups for which all of the individuals at the lower end of the distribution have intervalsthat do not overlap with those at the higher end. Under this strategy, one could ﬂag 7%of the arrests as belonging to one of the groups on the two ends of the spectrum. All ofthose that fall into the middle could be labeled as undetermined, as we cannot statisticallydistinguish them from the two extreme groups. This suggests that if we were to try tocreate groups so that all individuals in one group were statistically distinguishable from allmembers of another group, the groups we create would be composed of few people. This isonly one way to create groups while accounting for statistical uncertainty at the individuallevel. Alternative methods, such as considering pairwise probabilities that one individual’slikelihood of the outcome is higher than another’s, may result in larger separable groups.The right panel of Figure 8 shows the average 95% posterior credible length withineach of the PSA’s FTA risk groups as a function of the number of times each individualwas previously observed in the dataset. As expected, the intervals tend to get smaller asthe number of observations grows. However, we ﬁnd that the average interval length, evenfor individuals that have been observed several times remains large, and is largest for thepeople in the highest risk group. loser than they appear Fig. 8. (left) 10,000 randomly selected 95% posterior credible intervals for P ij . Intervals shown inblue do not cross the 0.25 threshold. (right) Average length of 95% posterior credible interval bynumber of times observed and PSA risk group. Fig. 9. (left) Average length of 95% credible interval for P ij over 1000 replicates for each value of n . (right) 95% posterior credible intervals for P ij for low/high signal and low/high noise scenarios.Intervals are colored by the value of the outcome variable. This raises the question of how many times an individual would need to be releasedand their outcome observed before one could say with a reasonable level of certainty whattheir individual P ij is. We consider an individual whose true probability falls right aroundthe marginal rate in our data, and whose covariates perfectly predict their probability. Forthis individual, 0 .

20 = P i = Φ( Xβ ), i.e. µ i = 0, though we don’t assume µ i to be known.We assume β and τ are ﬁxed. We simulate 1000 replicates of Y ij iid ∼ Bernoulli(0 . j = 1 , ..., n and n = 1 , ..., n and each of the thousand replicates,we calculate the posterior distribution of P in conditional on the n observations of Y ij . Thelength of the 95% credible interval, averaged over replicates, is shown in the left panelof Figure 9 for several values of τ . At the approximate value of τ estimated in our data(shown in red), posterior intervals for P ij remain fairly broad, even for unrealisticallylarge values of n , the number of times the individual is released. At around n = 100,for example, the average interval length drops below 0.10. However, if it were the casethat τ were smaller, i.e. we learned that individuals do not deviate much from the valueindicated by their covariates, reasonably small posterior intervals can be obtained withmuch more modest and realistic sample sizes. For example, after one observation when τ = 0 .

1, posterior credible intervals average around length 0.06. Thus, it is not an inherentproperty of our model’s structure that individuals have wide intervals— were it the casethat τ were smaller, even with few observations, we would have high posterior certaintyabout the range of values P ij could take.This alone would not solve the problem of our inability to statistically distinguishindividuals from one another. The right panel of Figure 9 shows posterior intervals for P ij for “low signal” and “high signal” scenarios. In the low signal scenario, Φ( Xβ ) takes three Lum et al.

Fig. 10.

Proportion of the population in our data who would be ﬂagged under a decision-rule inwhich an individual is ﬂagged if we can say certainty h that their individual probability of FTA is atleast c . discrete values: 0.05, 0.15, and 0.25. In the high signal scenario, Φ( Xβ ) takes values 0.01,0.25, and 0.50. We also consider a “high noise” and a “low noise” scenario for τ : τ = 0 . τ = 0 .

10, respectively. 95% posterior credible intervals are colored bythe value of the outcome variable, Y . In the case where there is both high signal andsmall τ , it is possible to create groups of individuals whose intervals do not overlap. Forexample, this scenario occurs for τ = 0 . c , one might instead ﬂag all individuals for whom the posteriorprobability that P ij exceeds c is at least h . For example, we might ﬂag for additionalreview all people for whom we can say with probability h ≥ .

95 under our model thattheir likelihood of FTA on the next occasion exceeds c = 0.25. Figure 10 shows theproportion of individuals in this data who would be ﬂagged for additional review for fourvalues of certainty ( h ) and across a range of cutoﬀ values c .If we select h = 0 .

99, i.e. we demand to know with at least 99% certainty that theindividual’s probability of failure is greater than c , then we ﬁnd that for all but thesmallest c s (0.05, 0.10), only a small proportion of individuals would be ﬂagged. There isno occasion on which we would be able to say with 99% certainty that FTA is more likelythan not. If we were to relax the certainty standard, and demand only that we know with90% certainty that an individual will fail to appear with probability at least c on the nextoccasion, then for all thresholds greater than 0 .

4, we would ﬂag almost no one. Even atthe least demanding level of certainty shown, only a tiny proportion of individuals couldbe ﬂagged as being more likely than not to fail to appear.

8. Limitations

One of the main assumptions of random eﬀects models is the distributional family of therandom eﬀects. However, when replacing the Gaussian random eﬀects distribution with loser than they appear a discrete mixture – a very diﬀerent form for the random eﬀect distribution – virtuallynone of our conclusions change (see the Appendix). This suggests this is not a majorissue in our case. Chief among the remaining assumptions is surely the linear dependenceon covariates. However, we note that predictive performance of risk assessment modelsused in criminal justice is similar, even when diﬀerent predictive factors or more complexmodel structures are used (Desmarais et al., 2016; Johndrow et al., 2019). Moreover,generalized linear models are among the most common method used in building RAI’s,and thus our speciﬁcation is as close to the existing practice as possible while also includingindividual-level random eﬀects. This parallel to current practice was intentional, as theresults we present can only be attributed to accounting for individual-level variability asopposed to larger scale modeling diﬀerences. This leaves the assumption that individual-level random eﬀects are constant across time as the main limitation of our analysis. Whilewe believe it is worthwhile to try relaxing this assumption, given the few observationsper person available, any additional ﬂexibility in the model will only increase the levelof uncertainty about individual probabilities. So, to the extent that one believes thatour results are unrealistic due to the rigidity of our modeling assumptions, any additionalcomplexity introduced into this model would likely only serve to increase the inferred levelof variability across individuals.

9. Conclusion

Ultimately, RAIs are used to assign labels to real human beings, and those labels aﬀectthe trajectory of their lives. This is a paper about statistical conﬁdence in those labels.Though standard methods of evaluating statistical conﬁdence have focused on model-levelor group-level measures, here we focus on statistical conﬁdence that the label appliedto each individual is an accurate reﬂection of their individual-speciﬁc probability of theoutcome.We ﬁnd that, when individual propensity toward the outcome that is not explained bycovariates is explicitly modeled, there is signiﬁcant uncertainty about the probability of theoutcome for most individuals. As a result, there is also signiﬁcant uncertainty about therisk group to which most individuals belong: the between-individual variability swampswhat little signal there is in the covariates. Perhaps more concerning, we ﬁnd that evenif we had an unrealistic number of observations per person— if each person were arrestedand released, say, 50 times— the random eﬀect distribution we have learned from thedata would still result in substantial uncertainty about group membership and individualprobabilities. As a consequence, it seems that the only way to create risk groups that resultin high conﬁdence about most individual group assignments is to put the vast majorityof individuals in the same group. Since the real-world application of risk assessment isfounded on diﬀerent treatment for individuals assigned to diﬀerent groups, allocating mostindividuals to one group would prevent actionable risk assessment.This analysis has uncovered an important diﬀerence between group-level and individual-level conclusions. While past discussion of group versus individual probabilities in thisdomain has been mostly theoretical due to an inability to meaningfully model individualprobabilities using previously available data, our model moves the debate from theoreticalto empirical. At the group-level, we can say with high certainty that the average rateswithin groups are diﬀerent (for all but the highest two risk groups using the PSA riskgroups). These sorts of results have been available in the past and, in fact, have beena standard component of model validations. Our model makes it possible to assess ouruncertainty about an individual’s probability of the outcome. In doing so, we can now alsosay that for any individual within the groups, we do not in general have high conﬁdencethat their probability of the outcome is similar to that ascribed to them due to their Lum et al. inclusion in a certain risk group. Their probability of the outcome may be very diﬀerentthan that of the group to which they are assigned.Both notions of uncertainty are relevant. When evaluating policies at the populationlevel one might be more concerned with the former. When faced with making decisionsabout an individual on the basis of the scores, the latter may be more relevant. Indeed,similar legal tensions between group- and individual-level inferences have arisen in othercontexts, such as expert testimony (Faigman et al., 2014). How these notions of uncertaintyought to be weighed in determining how and whether RAIs are used is largely a matterof policy and law, not statistics. However, we do venture one policy recommendation:the fact that we typically have low conﬁdence about the risk group assignment of anyindividual should be conveyed to decision makers, both at the level of decision-makersdeciding whether an RAI is appropriate in their jurisdiction and those making pre-trialdecisions.

10. Acknowledgments

This work was supported by a grant from Arnold Ventures. We thank the KentuckyCourt of Justice Pre-Trial Services for providing the data used in this analysis as wellas context and useful conversations. Useful comments on this manuscript were providedby Logan Koepke, Mingwei Hsu, Ira Globus-Harris, Karen Levy, and Sarah Riley. KarenLevy suggested the title.

References

Advancing Pre-Trial Policy and Research (2020) Guide to the pretrial decision framework.

Tech. rep.

Anderson, C., Redcross, C., Valentine, E. and Miratrix, L. (2019) Evaluation of PretrialJustice System Reforms That Use the Public Safety Assessment: Eﬀects of New Jersey’sCriminal Justice Reform.

Tech. rep.

Angwin, J., Larson, J., Mattu, S. and Kirchner, L. (2016) Machine bias.

ProPublica, May .Chouldechova, A. (2017) Fair prediction with disparate impact: A study of bias in recidi-vism prediction instruments.

Big Data , , 153–163.Dawid, P. (2017) On individual risk. Synthese , , 3445–3474.DeMichele, M., Baumgartner, P., Wenger, M., Barrick, K., Comfort, M. and Misra, S.(2018) The public safety assessment: A re-validation and assessment of predictive utilityand diﬀerential prediction by race and gender in Kentucky. Available at SSRN 3168452 .Desmarais, S. and Lowder, E. (2019) Pre-trial risk assessment tools: A primer for judges,prosecutors, and defense attorneys.

MacArthur Foundation Safety and Justice Challenge .Desmarais, S. L., Johnson, K. L. and Singh, J. P. (2016) Performance of recidivism riskassessment instruments in us correctional settings.

Psychological Services , , 206.Dieterich, W., Mendoza, C. and Brennan, T. (2016) COMPAS risk scales: Demonstratingaccuracy equity and predictive parity. Northpointe Inc .Faigman, D. L., Monahan, J. and Slobogin, C. (2014) Group to individual (g2i) inferencein scientiﬁc expert testimony.

The University of Chicago Law Review , 417–480.Farabee, D., Zhang, S., Roberts, R. E. and Yang, J. (2010) COMPAS validation study.

Tech. rep. loser than they appear Flores, A. W., Bechtel, K. and Lowenkamp, C. T. (2016) False positives, false negatives,and false analyses: A rejoinder to machine bias: There’s software used across the countryto predict future criminals. and it’s biased against blacks.

Fed. Probation , , 38.Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. and Rubin, D. B.(2013) Bayesian Data Analysis . CRC press.Glover, A. J., Churcher, F. P., Gray, A. L., Mills, J. F. and Nicholson, D. E. (2017) Across-validation of the Violence Risk Appraisal Guide—Revised (VRAG–R) within acorrectional sample.

Law and Human Behavior , , 507.Goodman, R. (1992) Hennepin County Bureau of Community Corrections Pretrial Re-lease Study. Tech. rep.

URL: .Hanson, R. K., Babchishin, K. M., Helmus, L. M., Thornton, D. and Phenix, A. (2017)Communicating the results of criterion referenced prediction measures: Risk categoriesfor the static-99r and static-2002r sexual oﬀender risk assessment tools.

PsychologicalAssessment , , 582.Harcourt, B. E. (2008) Against prediction: Proﬁling, policing, and punishing in an actu-arial age , chap. 3. University of Chicago Press.Hart, S. D. and Cooke, D. J. (2013) Another look at the (im-) precision of individual riskestimates made using actuarial risk assessment instruments.

Behavioral Sciences & TheLaw , , 81–102.Hart, S. D., Michie, C. and Cooke, D. J. (2007) Precision of actuarial risk assessmentinstruments: Evaluating the ‘margins of error’of group v. individual predictions of vio-lence. The British Journal of Psychiatry , , s60–s65.Holder, E. (2014) Remarks as prepared for delivery at the national association of criminaldefense lawyers 57th annual meeting and 13th state criminal justice network conference.Imrey, P. B. and Dawid, A. P. (2015) A commentary on statistical assessment of violencerecidivism risk. Statistics and Public Policy , , 1–18.Johndrow, J. E., Lum, K. et al. (2019) An algorithm for removing sensitive information:application to race-independent recidivism prediction. The Annals of Applied Statistics , , 189–220.Jung, J., Concannon, C., Shroﬀ, R., Goel, S. and Goldstein, D. G. (2020) Simple rules toguide expert classiﬁcations. Journal of the Royal Statistical Society: Series A (Statisticsin Society) .Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J. and Mullainathan, S. (2017) Humandecisions and machine predictions.

The Quarterly Journal of Economics , , 237–293.Levin, D. (2012a) Santa Clara County, California pretrial risk assessment instrument. Pretrial Justice Institute .Levin, D. J. (2012b) Development of a validated pretrial risk assessment tool for LeeCounty, Florida.

Pretrial Justice Institute .Lowenkamp, C. T. (2009) The development of an actuarial risk assessment instrument forUS Pretrial Services.

Federal Probation , , 33. Lum et al.

Lum, K. and Shah, T. (2019) Measures of fairness for new york city’s supervised releasetool.Papaspiliopoulos, O., Roberts, G. O. and Sk¨old, M. (2007) A general framework for theparametrization of hierarchical models.

Statistical Science , 59–73.Paulsen, M. G. (1966) Pre-trial release in the United States.

Columbia Law Review , ,109–125.Picard, S., Watkins, M., Rempel, M. and Kerodal, A. (2019) Beyond the algorithm. Tech.rep.

Pretrial Justice Institute and JFA Institute (2012) The Colorado Pretrial AssessmentTool (CPAT): A Joint Partnership among Ten Colorado Counties, the Pretrial JusticeInstitute, and the JFA Institute.

Tech. rep.

Robinson, D. G. and Koepke, L. (2019) Civil rights and pretrial risk assessment instru-ments.Rousseau, J. and Mengersen, K. (2011) Asymptotic behaviour of the posterior distribu-tion in overﬁtted mixture models.

Journal of the Royal Statistical Society: Series B(Statistical Methodology) , , 689–710.Scurich, N. and John, R. S. (2012) A Bayesian approach to the group versus individualprediction controversy in actuarial risk assessment. Law and Human Behavior , , 237.Solow-Niederman, A., Choi, Y. and Van den Broeck, G. (2019) The institutional life ofalgorithmic risk assessment. Berkeley Technology Law Journal , , 705.VanNostrand, M. (2003) Assessing risk among pretrial defendants in Virginia: The Vir-ginia pretrial risk assessment instrument . Virginia Department of Criminal JusticeServices.VanNostrand, M. and Keebler, G. (2009) Pretrial risk assessment in the federal court.

Federal Probation , , 3.Zeng, J., Ustun, B. and Rudin, C. (2017) Interpretable classiﬁcation models for recidivismprediction. Journal of the Royal Statistical Society: Series A (Statistics in Society) , , 689–722. A. Results for Discrete Mixture Random Effect Distribution

We begin by showing a comparison of the distribution of P ij under the normal and discreterandom eﬀect distributions. We calculate several summary statistics of the distributionof P ij for each individual, including their posterior median, 2.5 percentile, 97.5 percentile,and the length of their 95% posterior credible interval. Figure 11 shows QQ-plots of thesesummary statistics, showing the correspondence between the distributions of P ij under thetwo random eﬀect distributions. This shows that the discrete distribution tends to give aheavier right tail. Thus, results under this distribution paint a somewhat less optimisticview of the ability of the risk assessment model to disambiguate individuals.Table A shows summaries of the estimated rate of FTA by group under three diﬀer-ent binning methods applied to the output of the discrete random eﬀects model. Thisalso shows good calibration. Figure 12 shows the estimated densities of P ij and ˆ P ij byrisk group. Similar to the analysis with the normal random eﬀects distribution, we ﬁndsubstantial individual-level heterogeneity about the group-wise mean. Finally, Figure 13 loser than they appear Fig. 11.

QQ-plots of the median and 0.025 and 0.975 quantiles of the distribution of P ij under thenormal and discrete random effect distributions. We also show a QQ-plot of the interval length ofthe 95% posterior credible interval. Table 3.

The average posterior mean under the discrete random effects distribution of ˆ P ∗ ij , P ij , and Y ij (the “true”, empirical rate) within risk groups under three different grouping strategies. PSA Risk Groups PSA-Sized Groups Clustered GroupsP* P rate P* P rate P* P rate1 0.10 0.10 0.08 (0.08,0.09) 0.08 0.08 0.07 (0.06,0.08) 0.08 0.08 0.07 (0.07,0.08)2 0.13 0.13 0.12 (0.12,0.13) 0.12 0.12 0.11 (0.11,0.12) 0.12 0.12 0.11 (0.1,0.11)3 0.18 0.18 0.18 (0.17,0.18) 0.17 0.17 0.17 (0.17,0.18) 0.15 0.15 0.15 (0.14,0.15)4 0.26 0.26 0.26 (0.25,0.27) 0.24 0.24 0.25 (0.24,0.26) 0.19 0.19 0.2 (0.2,0.21)5 0.36 0.36 0.36 (0.35,0.36) 0.37 0.37 0.36 (0.35,0.36) 0.26 0.26 0.27 (0.26,0.28)6 0.41 0.40 0.36 (0.35,0.38) 0.54 0.54 0.5 (0.48,0.52) 0.42 0.42 0.4 (0.39,0.41) shows the posterior probability that one’s P ij belongs in each bin, conditional on groupingby point estimates. This also shows the highest degree of “correct” grouping for those inthe lowest risk group. Lum et al.

Fig. 12.

Estimated densities of P ij and ˆ P ij under the discrete random effects distribution. Fig. 13.

Posterior probability that P ij falls within the region of group g (columns) for each actualgroup assignment “Assigned Group” (rows). The color of the kj th cell corresponds to the posteriorprobability that ˜ g s ( P ij ; c s ) = j given that g s ( i ) = k under the discrete random effect model. Fig. 14. (left) Sample of 95% posterior credible intervals under discrete random effect model.(right) Average interval length as a function of number of observations per person and risk groupat last release. loser than they appear Fig. 15.

Proportion of population ﬂagged for additional review under a decision rule accountingfor uncertainty.

B. Supplementary Materials

B.1. Computation for the Discrete Random Effect Distribution

Now we consider the case where f η is a discrete distribution with K classes, and wechoose Gaussian priors on the atoms in the distribution. We take the approach impliedby Rousseau and Mengersen (2011), which showed that in overﬁtted mixture models – i.e.where the number of classes K is greater than the true number of classes K – the posteriorwill concentrate around the true number of classes when a Dirichlet(1 /K, . . . , /K ) prior ischosen on the mixture weights. The practical implication is that one can ﬁt a ﬁnite mixturemodel with K classes, and if many of the classes are unoccupied a posteriori , then theselected value of K can safely be considered large enough. This approach is a convenientalternative to specifying a prior on K that nonetheless enjoys similar theoretical propertiesand leads to simpler computation. We choose K = 30 and estimate the following model y ij | z i = k ∼ Bernoulli(Φ( θ k + x ij β )) θ k ind ∼ N (0 , , β ∼ N (0 , I ) z i ∼ Categorical( ν ) , ν ∼ Dirichlet(1 /K, . . . , /K ) . Let Q be the number of occupied components in the mixture model. We found that with K = 30, the posterior probability P [ Q > < . K = 30is suﬃciently large. For computation, we again use a blocked data augmentation Gibbssampler. However, the updates diﬀer somewhat because of the diﬀerent speciﬁcation for f η . Our Gibbs sampler has the following update rule.(a) Update z i marginal of ω from Categorical( ζ i ) with ζ i = ( ζ (1) i , ζ (2) i , . . . , ζ ( K ) i ), wherelog( p ( ζ ( k ) i )) ∝ log( ν k ) + n i (cid:88) j =1 y ij log(Φ( θ k + x ij β )) + (1 − y ij ) log(Φ( − θ k − x ij β ))(b) Update ω given everything else, all conditionally independent, from ω ij ∼ N y ij ( θ z i + x ij β, N y is a normal truncated to (0 , ∞ ) if y is positive and truncated to( −∞ ,

0] if y is nonpositive.(c) Update β given everything else from β ∼ N ( α β , S β ) Lum et al. S β = ( X (cid:48) X + (1 / I ) − , α β = S β X (cid:48) ( ω − θ ) . where θ is the vector that has θ z i for all entries corresponding to person i (d) Update θ given everything else, all conditionally independent, from θ k ∼ N ( α k , s k ) s k = (cid:32) (cid:88) i : z i = k n i (cid:33) − , α k = s k (cid:88) i : z i = k n i (cid:88) j =1 ( ω ij − x ij β ) . (e) Update ν given everything else from ν ∼ Dirichlet (cid:32) /K + (cid:88) i : z i =1 n i , . . . , /K + (cid:88) i : z i = K n i (cid:33) . (6)We run the algorithm for 80,000 iterations, discarding the ﬁrst 40,000 iterations as burn-in. B.2. Derivation of full conditional for τ To derive this, ﬁrst write the data augmented model as ω = µ N + τ W θ + Xβ + (cid:15) where N = (cid:80) i n i is the total number of observations of y ij , 1 N is a column vector of N ones, ω is the vectorization of ω ij , θ is the vectorization of θ i , and W is the binarymatrix that has a 1 in its k th column when the corresponding row of ω corresponds toindividual i = k (i.e. it picks oﬀ the correct entry of θ ). Now since µ ∼ N (0 , c ), we have1 N µ = µ N ∼ N (0 , c N (cid:48) N ). So with µ marginalized out ω ∼ N ( τ W θ + Xβ, I + c N (cid:48) N ) . By the Woodbury matrix identity( I + c N (cid:48) N ) − = I − N (1 (cid:48) N N + c − ) − (cid:48) N = I − N ( N + c − ) − (cid:48) N . So then after some simple algebra, we have that the log of the conditional density for τ | θ, ω, β (but marginal of µ ), up to a constant, is given by p ( τ | θ, ω, β ) ∝ −

12 ( τ θ (cid:48) ( W (cid:48) W − W (cid:48) N ( N + c − ) − (cid:48) N W ) θ − τ θ (cid:48) W (cid:48) ( I − N ( N + c − ) − (cid:48) N )( ω − Xβ )) − τ . Now, noticing that W (cid:48) W = diag( n i ) and W (cid:48) N is the vectorization of n i , we have, droppingthe − / c = 9, and writing (cid:80) i,j for (cid:80) mi =1 (cid:80) n i j =1 p ( τ | θ, ω, β ) ∝ τ (cid:32) m (cid:88) i =1 θ i n i (cid:33) − (cid:32) m (cid:88) i =1 θ i n i (cid:33) ( N + 1 / −  − τ  m (cid:88) i =1 (cid:88) i,j θ i ( ω ij − x ij β ) − (cid:88) i,j n i θ i  (cid:88) i,j ( ω ij − x ij β )  ( N + 1 / −  ,,