[PDF] A note on post-treatment selection in studying racial discrimination in policing

Abstract

We discuss some causal estimands used to study racial discrimination in policing. A central challenge is that not all police-civilian encounters are recorded in administrative datasets and available to researchers. One possible solution is to consider the average causal effect of race conditional on the civilian already being detained by the police. We find that such an estimand can be quite different from the more familiar ones in causal inference and needs to be interpreted with caution. We propose using an estimand new for this context -- the causal risk ratio, which has more transparent interpretation and requires weaker identification assumptions. We demonstrate this through a reanalysis of the NYPD Stop-and-Frisk dataset. Our reanalysis shows that the naive estimator that ignores the post-treatment selection in administrative records may severely underestimate the disparity in police violence between minorities and whites in these and similar data.

Full PDF

AA note on post-treatment selection in studying racialdiscrimination in policing

Qingyuan Zhao , Luke J Keele , Dylan S Small , and Marshall M Joﬀe Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics,University of Cambridge Department of Surgery, Perelman School of Medicine, University of Pennsylvania Department of Statistics, Wharton School, University of Pennsylvania Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine,University of PennsylvaniaSeptember 11, 2020

Abstract

We discuss some causal estimands used to study racial discrimination in policing. Acentral challenge is that not all police-civilian encounters are recorded in administrativedatasets and available to researchers. One possible solution is to consider the averagecausal eﬀect of race conditional on the civilian already being detained by the police. Weﬁnd that such an estimand can be quite diﬀerent from the more familiar ones in causalinference and needs to be interpreted with caution. We propose using an estimand new forthis context—the causal risk ratio, which has more transparent interpretation and requiresweaker identiﬁcation assumptions. We demonstrate this through a reanalysis of the NYPDStop-and-Frisk dataset. Our reanalysis shows that the naive estimator that ignores thepost-treatment selection in administrative records may severely underestimate the disparityin police violence between minorities and whites in these and similar data.

This research note is motivated by an interesting and timely article by Knox, Lowe, and Mummolo(2020, hereafter KLM) about learning racial disparities in policing from administrative data. Onekey point made by KLM is that such investigations have an intrinsic selection bias, becauseadministrative records only contain those encounters in which civilians are detained. If thereis racial discrimination in police detainment in the ﬁrst place, any naive analysis using theadministrative data may then suﬀer from potentially severe selection bias.1 a r X i v : . [ s t a t . A P ] S e p ere, we present a research note on this important topic with two purposes. First, KLMconsidered several local causal estimands that are being used in the empirical studies. Some of theestimands are not routinely used in causal inference, so we would like to clarify their interpretations.In particular, we demonstrate that the local estimands can poorly indicate the more global causaleﬀects. Second, we introduce a causal risk ratio estimand which is straightforward to interpret andallows us to avoid some hard to justify assumptions. The key idea is to use the Bayes formula toavoid the hard problem of estimating the probability of detainment in police-civiliant encounters.We conclude this research note with a reanalysis of the New York City Police Department (NYPD)Stop-and-Frisk dataset and some further discussion. We begin with a brief review of the key quantities in KLM. Following their work, the unit ofanalysis is an encounter between a civilian and the police. In administrative data, there are n encounters indexed by i = 1 , . . . , n . We denote the outcome with Y i , where Y i = 1 indicatesthe use of force by the police in encounter i . Next, D i is a binary variable where D i = 1 recordsthe race of the civilian as a minority. While the race of the civilian is not manipulable, we adoptthe approach in KLM where the counterfactual is the replacement of the civilian in an encounterwith a separate, comparable civilian engaged in comparable behavior, but diﬀering on race. SeeKnox, Lowe, and Mummolo (2020, p. 621) for a more complete discussion. Finally, we use M i to indicate the presence of a police detainment or stop. Critically, M i = 1 for all i in theadministrative data. KLM also invoked the potential outcomes framework, such that we havethe potential mediator M i ( d ) which represents whether encounter i would have resulted in astop if civilian race is d . Next, Y i ( d, m ) is the potential outcome for the use of force if race is d and the mediating variable is set to m ; similarly, Y i ( d ) is the potential outcome if race is d .Throughout this note we make the consistency or stable unit treatment assumption (SUTVA), so M i ( D i ) = M i and Y i ( D i , M i ) = Y i ( D i ) = Y i . Finally, X i represents a collection of covariatesthat describe aspects of the encounter. These include measures for time of day, location, age,2 M U Y

Figure 1: KLM’s directed acyclic graph (DAG) model for racially discrimination in policing with anunmeasured mediator-outcome confounder U . Administrative records only contain observationswith M = 1 .sex, and the like. Hereafter, we drop the i subscript.KLM studied the following “naive” treatment eﬀect estimand: ∆ = E [ Y | D = 1 , M = 1] − E [ Y | D = 0 , M = 1] , (1)where E denotes expectation over a random police-civilian encounter. Intuitively, ∆ compares theaverage rates of force between diﬀerent racial groups who are detained by police. KLM showedthat, if there is racial discrimination in detainment and an unmeasured confounder betweendetainment and use of force (see Figure 1), ∆ can be quite misleading when used to representthe causal eﬀect of race on police violence.The key issue is that the structure of the data implies all estimates are conditional on M —apost-treatment variable. It is well known that conditioning on post-treatment variables oftenleads to biased estimators of the causal eﬀect (Rosenbaum 1984). In the causal graphical modelliterature, bias of this type is an instance of the collider bias that arises when we condition onthe common child of two variables in a causal diagram (Greenland, Pearl, and Robins 1999).Bias of this type occurs in many applied problems in social science (Elwert and Winship 2014;Montgomery, Nyhan, and Torres 2018) and medicine (Paternoster, Tilling, and Davey Smith2017).Despite this collider bias, using the principal stratiﬁcation framework of Frangakis and Rubin(2002), KLM proposed two identiﬁcation strategies. They showed that it is still possible toeither identify or partially certain forms of average treatment eﬀects using a set of tailored3ausal assumptions. These assumption include mandatory reporting, mediator monotonicity, andtreatment ignorability.Here, we brieﬂy comment on the mandatory reporting assumption. KLM write this assumptionas: Assumption 1 (Mandatory reporting) . (i) Y (0 ,

0) = Y (1 ,

0) = 0 and (ii) the administrativedataset contains all police-civilian encounters.It is important to note that this assumption is both a restriction on potential outcomesand a feature of the data collection. The ﬁrst part of the assumption says that the potentialoutcome Y ( d, m ) is equal to 0 whenever m = 0 . This assumption is reasonable because, besidesinadvertent collateral damage, there should be virtually no police violence if the civilian is notstopped by the police in the ﬁrst place. The second part of the assumption is needed so that wecan use the administrative dataset to get the conditional distribution of ( D, Y, X ) given M = 1 .For a given administrative data source, it is possible that some police stops are unrecorded. Iftrue, any analysis relying on Assumption 1 needs to be interpreted with care.Using these assumptions, KLM derived nonparametric bounds forATE M =1 = E [ Y (1) − Y (0) | M = 1] . They also derived a point identiﬁcation strategy forATT M =1 = E [ Y (1) − Y (0) | D = 1 , M = 1] , which depends on an external estimate of the proportion of racial detainments among reportedminority encounters, i.e. P ( M (0) = 0 | D = 1 , M = 1) . See KLM (p. 631) for discussion onestimating this quantity. Their second solution is an identiﬁcation result for the average treatmenteﬀect ATE = E [ Y (1) − Y (0)] given external estimates of P ( M = 1 | D = d ) for d = 0 , .4 Average treatment eﬀects conditional on the mediator

In many causal analyses, investigators are focused on the sample average treatment eﬀect (ATE),which is the average diﬀerence in potential outcomes averaged over the study population. Attimes, researchers deﬁne the ATE over speciﬁc subpopulations which makes the ATE more local;e.g. the average treatment eﬀect might be deﬁned for the subpopulation exposed to the treatmentor the average treatment eﬀect on the treated (ATT). In general, the “global” ATE is the goalin most studies and is preferred over more local eﬀects.(Gerber and Green 2012, ch. 2). Forexample, IV studies have been strongly critiqued for identifying a local average treatment eﬀect(LATE) instead of the global ATE (Deaton 2010; Swanson and Hernán 2014). Moreover, evendefenders of IV studies, view the LATE as a “second choice” estimand compared to the globalATE (Imbens 2014).As KLM outline, the global ATE has not generally been the target causal estimand in thisliterature. Instead, researchers have focused on ATE M =1 and ATT M =1 which are both conditionalon the mediator M . As such, these estimands are both more local than the global ATE butalso condition on a post-treatment quantity. However, they are not the ﬁrst estimands in causalinference that condition on post-treatment quantities. Other examples of estimands that conditionon post-treatment quantities include the survivor average treatment eﬀect in Frangakis andRubin (2002) (though conceptually the always survivor principal stratum can be thought as apretreatment variable), eﬀect modiﬁcation by a post-treatment quantity (Stephens, Keele, andJoﬀe 2016; Ertefaie et al. 2018), and the probability of causation P [ Y (0) = 0 | D = 1 , Y = 1] (Pearl 1999; Dawid, Musio, and Murtas 2017). An inexperienced researcher might think theseestimands are informative about the global ATE or even an estimand such as the controlled directeﬀect: E [ Y (1 , − Y (0 , . Here, we build upon the population stratiﬁcation framework in KLMand clarify the diﬀerence between the conditional estimands in KLM and estimands like the globalATE.To simplify the illustration, we will consider the case where there is no mediator-outcomeconfounder (i.e. no variable U in the diagram in Figure 1). The issues we describe below will5till occur if there is mediator-outcome confounding. In mediation analysis, a standard way todecompose the average treatment eﬀect isATE = E [ Y (1) − Y (0)] = E (cid:2) Y (1 , M (1)) − Y (1 , M (0)) (cid:3) + E (cid:2) Y (1 , M (0)) − Y (0 , M (0)) (cid:3) . The two terms on the right hand side are called the pure indirect eﬀect (PIE) and pure directeﬀect (PDE) (Robins and Greenland 1992). Under the Non-Parametric Structural EquationModel with Independent Errors (NPSEM-IE) model (Pearl 2009; Richardson and Robins 2013)and Assumption 1, they can be expressed asPIE = β M · E [ Y (1 , , PDE = β Y · E [ M (0)] , where β M = E [ M (1) − M (0)] is the average eﬀect of race on detainment and β Y = E [ Y (1 , − Y (0 , is the controlled direct eﬀect of race on police violence (See Appendix A). An immediateconsequence of the above expressions is thatATE ≥ if β M , β Y ≥ and ATE ≤ if β M , β Y ≤ . (2)In words, the global ATE is nonnegative whenever both β M and β Y are nonnegative, and viceversa. This same property also holds for the average treatment eﬀect on the treated (ATT)because in the simple setting here the treatment D is completely randomized.In Appendix A, we use principal stratiﬁcation to show that neither ATE M =1 or ATT M =1 isguaranteed to inherit the sign of β M and β Y and satisfy the property in Equation (2). Speciﬁcally,we outline concrete examples in which:(i) β M > and β Y > , but ATE M =1 < ;(ii) β M < and β Y < , but ATE M =1 > ;(iii) β M < and β Y < , but ATT M =1 > .That is, when there is racial discrimination of the same direction in both police detainment and6he use of force, it is still possible for ATE M =1 and ATT M =1 to have the opposite sign.Heuristically, this is due to the fact that all of the causal estimands above, including β M , β Y , ATE, ATE M =1 , and ATT M =1 , only measure some weighted average treatment eﬀect forpolice detainment and/or use of force. Conditioning on the post-treatment M may correspondto unintuitive weights. The possibility that ATE M =1 and ATE can have diﬀerent signs can beunderstood from the following iterated expectation:ATE = ATE M =1 P ( M = 1) + E [ Y (1) − Y (0) | M = 0] P ( M = 0) . In this decomposition, the second term may be nonzero and have the opposite sign of ATE M =1 .An inexperienced researcher might be tempted to drop the second term because of Assumption 1,as Y (0 ,

0) = Y (1 ,

0) = 0 with probability 1. However, conditioning on M = 0 is not the sameas the intervention that sets M = 0 . This means that we cannot deduce E [ Y ( d ) | M = 0] = 0 from Y ( d,

0) = 0 , because E [ Y ( d ) | M = 0] = E [ Y ( d, M ( d )) | M = 0] is not necessarily equalto E [ Y ( d, | M = 0] . What the counterexamples in the Appendix further demonstrate is thatthe discordance between ATE M =1 and ATT can occur even when the pure direct and indirecteﬀects have the same sign.In sum, ATE M =1 and ATT M =1 are generally diﬀerent from the estimands that are routinelythe target in causal analyses. As such, we urge applied researchers to use caution when usingthese local estimands to infer anything about more common global estimands. KLM also derived an identiﬁcation formula for ATE M =1 using external estimates of P ( M = 1 | D = d ) for d = 0 , . As noted in their paper, however it is often diﬃcult to quantify the frequencyof stops among all police-civilian encounters. In other words, it can be diﬃcult to determinethe magnitude of P ( M = 1 | D = d ) . Here, we show that by formulating the estimand on arelative scale, we can also achieve point identiﬁcation and avoid the diﬃculties of estimating7 ( M = 1 | D = d ) . More speciﬁcally, we consider the causal risk ratio for covariate level x :RR ( x ) = E [ Y (1) | X = x ] E [ Y (0) | X = x ] . This estimand measure the relative risk of police violence and has been used in related context(Edwards, Lee, and Esposito 2019). When this term is equal to one the risk of police violencedoes not vary with the race of the civilian. When this term is greater than one, the risk of violenceis higher for minorities.Using treatment ignorability (i.e. the DAG model in Figure 1 conditional on X ) and Assump-tion 1, we show in Appendix B that the causal eﬀect of race can be based on the decomposition E [ Y ( d ) | X = x ] = E [ Y | M = 1 , D = d, X = x ] · P ( M = 1 | D = d, X = x ) , for d = 0 , . The same result is derived in KLM and forms the basis of their identiﬁcation of the ATE. Wesimplify their proof in Appendix B and show that it does not require the mediator monotonicity andrelative nonseverity of racial stops (Assumptions 2 and 3 in KLM) as stated in their Proposition 2.Using Bayes formula for the last term on the right hand side, we obtainRR ( x ) = E [ Y | D = 1 , M = 1 , X = x ] E [ Y | D = 0 , M = 1 , X = x ] (cid:124) (cid:123)(cid:122) (cid:125) naive estimand · (cid:110) P ( D = 1 | M = 1 , X = x ) P ( D = 0 | M = 1 , X = x ) (cid:111)(cid:46)(cid:110) P ( D = 1 | X = x ) P ( D = 0 | X = x ) (cid:111)(cid:124) (cid:123)(cid:122) (cid:125) bias factor . (3)The ﬁrst term on the right hand side of (3) is the naive risk ratio estimand conditional onbaseline covariates. It is the risk ratio counterpart to the naive risk diﬀerence in (1) and bothof them ignore the possible bias from the selection process into the administrative data. Thesecond term inside the curly brackets is a ratio of probability ratios. The ﬁrst ratio of probabilitiesmeasures the relative probability (odds) of an encounter being with a minority conditional on X in the administrative data. The second ratio also measures the relative probability (odds) of anencounter being with a minority conditional on X , but these probabilities need to be estimatedfrom a second data source. This ratio between the last two terms is thus an odds ratio that8haracterizes the bias of the the naive estimator; for this reason we call it the “bias factor.”That is, if minorities are over-represented in the administrative data, the bias factor correctsthat over-representation and so increases the magnitude of the risk ratio. For example, if theprobability of an encounter being with a minority is 0.8 in the administrative data and 0.25 in arandom police-civilian encounter, the bias factor would be (0 . / . / (0 . / .

75) = 12 , whichwould increase the magnitude of the naive risk ratio when it is larger than 1. See the appendixfor a full derivation. All these terms can be estimated using generalized linear models, or onecould use more ﬂexible models. Conﬁdence intervals can be estimated using the bootstrap or asimple delta method estimator.Why use this risk ratio estimand? First of all, relative risk can be a powerful rhetorical toolin discussing racial disparities. More importantly, by targeting the causal risk ratio, we are able,through cancellation, to avoid the diﬃculties associated with estimating P ( M = 1) . As such,by focusing on relative risks, we avoid key assumptions. Note that if we are willing to assumestochastic mediator monotonicity: E [ M (1) | X = x ] ≥ E [ M (0) | X = x ] (that is, there is racialbias against the minority in detainment), the bias factor can indeed be lower bounded by . Inthis case, the naive risk ratio estimator (ﬁrst term on the right hand side of (3)) provides a lowerbound for the causal risk ratio RR ( x ) .Notice that using the risk ratio estimand does not free us from the complications that tendto arise from the use of two data sources. As we noted above, the administrative data onlycontain those encounters with M = 1 , so all of the identiﬁcation results in KLM and in this noterequire some external estimates. In particular, the administrative dataset can only be used toestimate the ﬁrst two terms on the right hand side of (3). Estimation of the third term requiresa second data source that may not be congruent with the administrative data. For example, theadministrative data in KLM is an NYPD database of police stops. For a second data source,we could use the Current Population Survey (CPS), which contains measures for race and alsohas geographic information that allows us to restrict the data to the metro area in the state ofNew York (which is larger than the ﬁve boroughs of New York City). However, The CPS doesnot contain any more ﬁne-grained geographic identiﬁers or any measures of police encounters.9nother possible data source is the Police-Public Contact Survey (PPCS) collected by the U.S.Department of Justice, which contain detailed measures on police contacts. However, PPCS isa national survey and geographic identiﬁers are not available to researchers. As such, if we usethe PPCS, we can do little to measure the prevalence of police-minority encounters in New YorkCity. In other settings such as traﬃc stops, one may use the “veil of darkness” test (Grogger andRidgeway 2006) and use night-time police stops in the same dataset to estimate the bias factor,as police are less likely to know the race of a motorist. However, this still requires the assumptionthat the racial distribution of motorists are the same during the day and at night.Nonetheless, as we show next, the results using the risk ratio with diﬀerent data sources canstill be useful. That is, we can use multiple data sources to illuminate the probable bias in thenaive estimator. We used the identiﬁcation formula (3) to estimate the causal risk ratio using the NYPD “Stop-and-Frisk” dataset analyzed in Fryer (2019) and KLM. Speciﬁcally, we use the replication datafrom KLM. As such, we followed KLM’s preprocessing of the dataset, with the one exceptionthat we removed all races other than black and white. We also focused on all forms of forcerather than estimate the eﬀects for diﬀerent types of force. We used CPS and PPCS data forfrom to estimate the third term in (3). Because PPCS does not contain a geographic identiﬁer,we also used the racial distributions for diﬀerent subsets of the PPCS data. Speciﬁcally, we usedsubgroups for those in the survey that experienced a motor vehicle stop, any other kind of policestop, and those in a large metro area.Table 1 reports the estimated risk ratios using diﬀerent estimators and external datasets.Using the naive estimator—the ﬁrst term in (3), we ﬁnd a modest causal eﬀect: black peoplehave 29% higher risk of the police using of force than white people. Recall that we can viewthis as lower bound on the true causal risk ratio if we are willing to assume stochastic mediatormonotonicity (i.e. there is discrimination against black civilians in police detainments on average).10xternal dataset Estimated risk ratio 95% Conﬁdence intervalNaive estimator—First term in (3)None 1.29 1.28–1.30Adjusted for selection bias by using (3)CPS 13.6 12.8–14.3PPCS 32.3 31.3–33.3PPCS (MV Stop) 29.5 26.9–32.7PPCS (Stop in Public) 29.2 23.5–36.5PPCS (Large Metro) 16.7 15.4–18.4Table 1: Estimates of the causal eﬀect of minority race (black) on police violence. CPS is theCurrent Population Survey. PPCS is Police-Public Contact Survey. MV Stop is the subset ofsurvey respondents that has been the passenger in a motor vehicle that was stopped by the police.Large Metro is the subset that lives in a region with more than 1 million population. Conﬁdenceintervals were computed using the nonparametric bootstrap.The estimator (3) that adjusts for the selection bias shows a very diﬀerent picture. No matterwhich external dataset we used, the estimated risk ratio for black versus white is always greaterthan 10.The estimates in Table 1 did not condition on any covariate that confounds the eﬀect of raceon police use of force. A potentially important confounder is the location of the police-civilianencounter. The NYPD currently has 77 precincts that are responsible for the law enforcementwithin a designated geographic area. Using census blocks and the 2010 census data, Keefe (2020)constructed a population breakdown for each NYPD precinct. This allows us to compare theproportion of black residents (among black and white residents) with the proportion of detainmentsof black civilians in each precinct (Figure 2). It is evident from this ﬁgure that in most of theprecincts, black civilians make up less than half of the population but more than half of thedetainment records. This shows that the bias factor in (3) can be quite large in this problem.By using the census data to estimate the last term in (3), Figure 3 compares the naive riskratio estimator and selection-adjusted risk ratio estimator for each precinct. The selection-adjustedestimates are almost always much larger except for three outliers—precincts 67 and 113, whereBlacks account for more than 90% of the population, and precinct 22 (Central Park), whereonly 25 residents were recorded and the majority of police-civilian encounters were likely with11on-residents. It is likely that in these precincts, the residential distribution in the census datapoorly approximate the racial distribution in police-civilian encounters, because the civilians couldbe visitors from other precincts or anywhere else in the world. Most of the precincts with thehighest estimated risk ratios are wealthy neighborhoods in Manhattan and Brooklyn. In severalprecincts, our method estimated that the risk of police use of force for Blacks is more than30 times higher than the risk for whites. This may be due in part to increased suspicion ofminorities in areas where there presence is not common. Finally, Figure 4 shows a strong negativecorrelation between the estimated risk ratios and the percentage of black residents in the precinct.This indicates that the racial discrimination in police use of force may be strongly moderated bycharacteristics of the geographic location such as the racial composition, aﬄuence, and averagecrime rate of the neighborhood.

In this research note, we studied some causal estimands in the context of racial discriminationin policing. We found that the ATE that conditional on the mediator (police detainment) aregenerally diﬀerent from the unconditional ATE and other routinely used causal estimands, so extracaution is needed when using these estimands and interpreting the results. We also proposeda new estimator for the causal risk ratio, which is straightforward to interpret and avoids thediﬃcult task of discerning the percentage of stops in all police-civilian encounters. In a reanalysisof the NYPD Stop-and-Frisk dataset with causal risk ratio being the estimand, we found that forblacks the risk of experiencing force is much higher than for whites.When interpreting the results of our reanalysis, the reader should keep in mind its limitations.First, it is diﬃcult to ﬁnd a good external dataset to estimate the bias factor. The datasetswe used should only be viewed as crude approximations to the racial distribution in police-civilian encounters in New York City. Second, our identiﬁcation of the causal risk ratio requirestreatment ignorability by conditioning on confounders such as time, location, and other relevantcharacteristics of the police-civilian encounter. However, such covariates are not always available12n external datasets. Finally, since New York is a metropolitan in which people move arounda great deal on a daily basis, the racial distribution of the residents in a precinct might poorrepresents the racial distribution in police-civilian encounters, especially when the residentialdistribution is extreme. Therefore, Figure 4 may have exaggerated the eﬀect modiﬁcation by theproportion of black residents. A further analysis on carefully selected precincts (e.g. residentialareas with diﬀerent racial composition) is needed to better quantify the eﬀect modiﬁcation.Nevertheless, our empirical results show that a naive analysis of police administrative datasetsthat ignores the selection bias can severely underestimate the risk of police force for minorities.Further careful analyses are needed to better quantify the racial discrimination in policing andunderstand the socioeconomic factors that moderate racial discrimination.

Acknowledgement

We thank Dean Knox, Joshua Loftus, and Jonathan Mummolo for helpful suggestions.13

56 7910 1314 1718 1920 22 2324 2526283032333440 4142 4344 4546 4748 49505260 6162 6366 6768 6970 7172 73 7576 7778 79 818384 88 9094 100 101102 103104 105106 107108 109110 111112 113114 115120121 122123 (a) Proportion of black residents in the census data. (b) Proportion of detainments of black civilians in the NYPDstop-and-frisk data.

Figure 2: Racial distributions (indicated by the ﬁlled color) in each NYPD precinct.14

Risk ratio estimator P r e c i n c t Adjusted for selection bias Naive

Figure 3: Risk ratio estimates for every NYPD precinct. Error bars correspond to 95% conﬁdenceintervals computed by the bootstrap. We did not resample the census data because that is alreadythe residential distribution (instead of a statistical estimate). Blue estimates are obtained usingthe naive estimator (ﬁrst term in (3)); Red estimates further take into account the bias factordue to sample selection in (3). 15

Proportion of black residents E s t i m a t ed r i sk r a t i o Figure 4: Relationship between the risk ratio estimates and the proportion of black residentsacross NYPD precincts. Notice that to use the identiﬁcation formula (3) for the risk ratio correctly,we need to estimate the racial distribution in police-civilian encounters using external dataset.The residential distribution is used as an approximation but it can be biased and exaggerate theeﬀect modiﬁcation. See Section 6 for further discussion.16 eferences

Dawid, A Philip, Monica Musio, and Rossella Murtas. 2017. “The probability of causation.”

Law,Probability and Risk

Journal ofeconomic literature

Proceedings of the NationalAcademy of Sciences

Annual Review of Sociology

40: 31–53.Ertefaie, Ashkan, Jesse Y Hsu, Lindsay C Page, Dylan S Small et al. 2018. “Discovering treatmenteﬀect heterogeneity through post-treatment variables with application to the eﬀect of classsize on mathematics scores.”

Journal of the Royal Statistical Society Series C

Biometrics

Journal of Political Economy

Field Experiments: Design, Analysis, and Interpreta-tion . New York, NY: Norton.Greenland, Sander, Judea Pearl, and James M Robins. 1999. “Causal diagrams for epidemiologicresearch.”

Epidemiology

Journal of the American Statistical Association

StatisticalScience https://johnkeefe . net/nyc-police-precinct-and-census-data ; Retrieved: August 31, 2020.Knox, Dean, Will Lowe, and Jonathan Mummolo. 2020. “Administrative Records Mask RaciallyBiased Policing.” American Political Science Review in press.Montgomery, Jacob M, Brendan Nyhan, and Michelle Torres. 2018. “How conditioning onposttreatment variables can ruin your experiment and what to do about it.”

American Journalof Political Science

PLoS Genetics

Synthese

Causality: Models, Reasoning and Inference . Cambridge University Press.Richardson, Thomas S, and James M Robins. 2013. Single world intervention graphs (SWIGs): Auniﬁcation of the counterfactual and graphical approaches to causality. Technical Report 128Center for the Statistics and the Social Sciences, University of Washington.Robins, James M, and Sander Greenland. 1992. “Identiﬁability and exchangeability for direct andindirect eﬀects.”

Epidemiology

Journal of the Royal Statistical Society: Series A (General)

Journal of Causal Inference in press.Unpublished Manuscript.Swanson, Sonja A, and Miguel A Hernán. 2014. “Think globally, act globally: an epidemiologist’sperspective on instrumental variable estimation.”

Statistical science: a review journal of theInstitute of Mathematical Statistics

Average treatment eﬀects conditional on the mediator

We assume the variables ( D, M, Y ) are generated from a nonparametric structural equationmodel: D = f D ( (cid:15) D ) , M = f M ( D, (cid:15) M ) , Y = f Y ( D, M, (cid:15) Y ) where (cid:15) D , (cid:15) M , (cid:15) Y are mutuallyindependent (Pearl 2009). Potential outcomes for M and Y can be deﬁned by replacingrandom variables in the functions by ﬁxed values; for example, M ( d ) = f M ( d, (cid:15) M ) , d = 0 , .Because the errors are independent, D , { M (0) , M (1) } , and { Y (0 , , Y (0 , , Y (1 , , Y (1 , } are mutually independent (Richardson and Robins 2013). We also make the mandatory assumption(Assumption 1). The derivations below do not need mediator monotonicity ( M (1) ≥ M (0) ).We next derive expressions of ATE M =1 and ATT M =1 using two basic causal eﬀects: β M = E [ M (1) − M (0)] , the racial bias in detainment, and β Y = E [ Y (1 , − Y (0 , , the controlleddirect eﬀect of race on police violence. To simplify the interpretation, we introduce a new variableto denote the the principal stratum (see Figure 2 in KLM): S =  always stop (al) , if M (0) = M (1) = 1 , minority stop (mi) , if M (0) = 0 , M (1) = 1 , majority stop (ma) , if M (0) = 1 , M (1) = 0 , never stop (ne) , if M (0) = M (1) = 0 , Let S = { al , mi , ma , ne } be all possible values for S . Using this notation, we have β M = (cid:88) s ∈S E [ M (1) − M (0) | S = s ] P ( S = s ) = P ( S = mi ) − P ( S = ma ) . By using the independence between M ( d ) and Y ( d, m ) and Assumption 1, it is easy to show20hat θ =  E [ Y (1) − Y (0) | S = al ] E [ Y (1) − Y (0) | S = mi ] E [ Y (1) − Y (0) | S = ma ] E [ Y (1) − Y (0) | S = ne ]  =  E [ Y (1 , − Y (0 , E [ Y (1 , − Y (0 , E [ Y (1 , − Y (0 , E [ Y (1 , − Y (0 ,  =  β Y β Y + E [ Y (0 , − E [ Y (0 ,  . Average treatment eﬀects, whether conditional on M or D or not, can be written as weightedaverages of the entries of θ . Proposition 1.

Suppose there is no unmeasured mediator-outcome confounder (i.e. no U ) inFigure 1. Under Assumption 1, the estimands ATE M =1 , ATT M =1 , ATE = E [ Y (1) − Y (0)] , andATT = E [ Y (1) − Y (0) | D = 1] can be written as weighted averages ( w T θ ) / ( w T ) ( is theall-ones vector) with weights given by, respectively, w ( ATE M =1 ) =  P ( S = al ) (cid:2) P ( S = ma ) + β M (cid:3) P ( D = 1) P ( S = ma ) P ( D = 0)0  , w ( ATT M =1 ) =  P ( S = al ) P ( S = ma ) + β M  , and w ( ATE ) = w ( ATT ) =  P ( S = al ) P ( S = mi ) P ( S = ma ) P ( S = ne )  =  P ( S = al ) P ( S = ma ) + β M P ( S = ma ) P ( S = ne )  . Proof.

Let’s ﬁrst consider ATE M =1 . By using the law of total expectations, we can ﬁrst decomposeit into a weighted average of principal stratum eﬀects:ATE M =1 = E [ Y (1) − Y (0) | M = 1] = (cid:88) s ∈S E [ Y (1) − Y (0) | M = 1 , S = s ] · P ( S = s | M = 1) . We can simplify the principal stratum eﬀects using recursive substitution of the potential outcomes21nd the assumption that D , { M (0) , M (1) } , and { Y (0 , , Y (0 , , Y (1 , , Y (1 , } are mutuallyindependent. For m , m ∈ { , } , E [ Y (1) − Y (0) | M = 1 , M (0) = m , M (1) = m ]= E [ Y (1 , M (1)) − Y (0 , M (0)) | M = 1 , M (0) = m , M (1) = m ]= E [ Y (1 , m ) − Y (0 , m ) | M = 1 , M (0) = m , M (1) = m ]= E [ Y (1 , m ) − Y (0 , m ) | M (0) = m , M (1) = m ]= E [ Y (1 , m ) − Y (0 , m )] . The third equality uses the fact that M ⊥⊥ { Y (1 , m ) , Y (0 , m ) } | { M (0) , M (1) } , because given { M (0) , M (1) } the only random term in M = D · M (1) + (1 − D ) · M (0) is D . Thus ATE M =1 can be written asATE M =1 = θ T w ( ATE M =1 ) , where w ( ATE M =1 ) =  P ( S = al | M = 1) P ( S = mi | M = 1) P ( S = ma | M = 1) P ( S = ne | M = 1)  . Similarly, ATT M =1 , ATE, and ATT can also be written as weighted averages of the entries of θ ,where the weights are w ( ATT M =1 ) =  P ( S = al | D = 1 , M = 1) P ( S = mi | D = 1 , M = 1) P ( S = ma | D = 1 , M = 1) P ( S = ne | D = 1 , M = 1)  , w ( ATE ) = w ( ATT ) =  P ( S = al ) P ( S = mi ) P ( S = ma ) P ( S = ne )  . Next we compute the conditional probabilities for the principal strata in w ( ATE M =1 ) and w ( ATT M =1 ) . By using Bayes’ formula, for any m , m ∈ { , } , P ( M (0) = m , M (1) = m | M = 1) P ( M (0) = m , M (1) = m ) · P ( M = 1 | M (0) = m , M (1) = m )= P ( M (0) = m , M (1) = m ) · (cid:88) d =0 P ( M = 1 , D = d | M (0) = m , M (1) = m )= P ( M (0) = m , M (1) = m ) · (cid:88) d =0 { m d =1 } P ( D = d | M (0) = m , M (1) = m )= P ( M (0) = m , M (1) = m ) · (cid:88) d =0 { m d =1 } P ( D = d ) . The last two equalities used M = M ( D ) and D ⊥⊥ { M (0) , M (1) } . For this, it is straightforwardto obtain the form of w ( ATE M =1 ) in Proposition 1. Similarly, P ( M (0) = m , M (1) = m | D = 1 , M = 1) ∝ P ( M (0) = m , M (1) = m ) · { m =1 } . From this we can derive the form of w ( ATT M =1 ) in Proposition 1. Proposition 2.

Under the same assumptions as above, PIE = β M · E [ Y (1 , and PDE = β Y · E [ M (0)] .Proof. This follows from the deﬁnition of pure direct and indirect eﬀects and the following identity, E (cid:2) Y ( d, M ( d (cid:48)(cid:48)(cid:48) )) (cid:3) = E (cid:2) Y ( d, | M ( d (cid:48) ) = 1 (cid:3) · P ( M ( d (cid:48) ) = 1) = E (cid:2) Y ( d, (cid:3) · P ( M ( d (cid:48) ) = 1) , for any d, d (cid:48) ∈ { , } .Using the forms of weighted averages in Proposition 1, we can make the following observationon the sign of the causal estimands when β M and β Y are both nonnegative or both nonpositive: Corollary 1.

Let the assumptions in Proposition 1 be given. If β M ≥ and β Y ≥ , thenATE = ATT ≥ . Conversely, if β M ≤ and β Y ≤ , then ATE = ATT ≤ . However, both ofthese properties are not true for ATE M =1 and the second property is not true for ATT M =1 . The fact that ATT and ATE would have the same sign as β M when β M and β Y have thesame sign follows immediately from Proposition 2. However, this important property does not23old for ATE M =1 and ATT M =1 . Here are some concrete counterexamples:(i) When β M = β Y = 0 . , P ( S = al ) = 0 . , P ( S = ma ) = 0 . , E [ Y (0 , . , and P ( D = 1) = 0 . , we have ATE M =1 = − . .(ii) When β M = β Y = − . , P ( S = al ) = 0 . , P ( S = ma ) = 0 . , E [ Y (0 , . , and P ( D = 1) = 0 . , we have ATE M =1 = 0 . .(iii) When β M = β Y = − . , P ( S = al ) = 0 . , P ( S = ma ) = 0 . , E [ Y (0 , . , and P ( D = 1) = 0 . , we have ATT M =1 = 0 . .The problem is that conditioning on the post-treatment variable M alters the weights onthe principal strata, as shown in Proposition 1. ATE M =1 and ATT M =1 then depend on notonly the racial bias in detainment and use of force (captured by β M and β Y ) but also thebaseline rate of violence E [ Y (0 , and the composition of race P ( D = 1) . For instance, inthe ﬁrst counterexample above, even though the minority group D = 1 is discriminated againstin both detainment and use of force, because the baseline violence is high and the minoritygroup is extremely small, ATE M =1 becomes mostly determined by the smaller bias (captured by P ( S = ma ) = P ( M (0) = 1 , M (1) = 0) ) experienced by the much larger majority group.We make some further comments on the above paradox. First of all, the second counterexamplecan be eliminated if we additionally assume P ( D = 1) < . , that is D = 1 indeed representsthe minority group. With this benign assumption, one can show that ATE M =1 < whenever β M , β Y < . Furthermore, it can be shown that ATT M =1 < whenever β M , β Y > . So in avery rough sense we might say that as causal estimands, ATE M =1 is unfavorable for the minoritygroup (because ATE M =1 can be negative even if both β M , β Y > ) and ATT M =1 is unfavorablefor the majority group (because ATT M =1 can be positive even if both β M , β Y < ).Our second comment is about the ﬁrst counterexample. We can eliminate such possibilityby assuming mediator monotonicity P ( S = ma ) = 0 , or in other words, by assuming that themajority race group is never discriminated against in any police-civilian encounter. KLM indeedused mediator monotonicity to obtain bounds on ATE M =1 and ATT M =1 . So a supporter ofthe estimand ATE M =1 may argue that if one is willing to assume mediator monotonicity, there24s no paradox regarding ATE M =1 . However, it is worthwhile to point out that under mediatormonotonicity, the pure indirect eﬀect is guaranteed to be nonnegative because β M = P ( S = mi ) − P ( S = ma ) = P ( S = mi ) ≥ . Empirical researchers should be mindful of and clearlycommunicate the consequences of the mediator monotonicity assumption unless it is compellingin the speciﬁc application. See KLM’s discussion after their Assumption 2 on when mediatorignorability may be violated. This concern can be alleviated if future work can incorporate non-zero P ( S = ma ) as sensitivity parameters in KLM’s bounds. B Derivation of the causal risk ratio

To simplify the derivation, we will omit the conditioning on X = x below. Fix a d ∈ { , } . UsingAssumption 1, E [ Y ( d ) | M ( d ) = 0] = E [ Y ( d, | M ( d ) = 0] = 0 . Therefore E [ Y ( d )] = E [ Y ( d ) | M ( d ) = 1] · P ( M ( d ) = 1)= E [ Y ( d, | M ( d ) = 1] · P ( M ( d ) = 1)= E [ Y ( d, | M ( d ) = 1 , D = d ] · P ( M ( d ) = 1)= E [ Y | M = 1 , D = d ] · P ( M ( d ) = 1) . The third equality above uses treatment ignorability: D ⊥⊥ Y ( d, | M ( d ) (this follows fromthe single world intervention graph corresponding to Figure 1); the last equality follows from theconsistency (or stable unit value treatment) assumption for potential outcomes. By further using D ⊥⊥ M ( d ) , we have P ( M ( d ) = 1) = P ( M ( d ) = 1 | D = d ) = P ( M = 1 | D = d ) . Pluggingthis into the last display equation, we have E [ Y ( d )] = E [ Y | M = 1 , D = d ] · P ( M = 1 | D = d ) , d = 0 , . Thus we have recovered KLM’s Proposition 2 (point identiﬁcation of ATE) without assumingtheir Assumption 2 (mediator monotonicity) and Assumption 3 (relative nonseverity of racialstops). To get the causal risk ratio, we only needs to take a ratio between E [ Y (1)] and E [ Y (0)] P ( M = 1)= 1)