[PDF] On Policy Recommendations from Causal Inference in Physics Education Research

Abstract

Sound educational policy recommendations require valid estimates of causal effects, but observational studies in physics education research sometimes have loosely specified causal hypotheses. The connections between the observational data and the explicit or implicit causal conclusions are sometimes misstated. The link between the causal conclusions reached and the policy recommendations made is also sometimes loose. Causal graphs are used to illustrate these issues in several papers from Physical Review Physics Education Research. For example, the core causal conclusion of one paper rests entirely on the choice of a causal direction although an unstated plausible alternative gives an exactly equal fit to the data.

Full PDF

11/29/21 - 1 -

Causal Inference and Policy Recommendations In Physics Education Research

M. B. Weissman

Department of Physics, University of Illinois at Urbana-Champaign 1110 West Green Street, Urbana, IL 61801-3080

Abstract

Sound policy recommendations require valid estimates of causal effects, but observational studies in physics education research often have undefined or loosely specified causal hypotheses. The connections between the observational data and the explicit or implicit causal conclusions are often misstated. The link between the causal conclusions and the policy recommendations is also often loose. Several papers, mostly from Physical Review Physics Education Research, are used to illustrate these issues. /29/21 2

Introduction

The central goal of most physics education research is to find ways teach better, i.e. to improve educational outcomes. That requires determining the probable results of several possible courses of action, so that value judgments can be applied to choose the best policies. Problems of the form “If we do this vs. that, what will be the probable difference in results?” are by definition problems in counterfactual causation. (See ( ) for a primer, or ( ) for an up-to date technical review.) Although the techniques for drawing causal conclusions from randomized controlled trials are generally understood, the traditional methods for drawing causal inferences from observational data (e.g. multiple regression) are often inappropriate or misapplied. (

1, 2 ) The distinction between correlation and causation is usually acknowledged, but in practice unwarranted causal conclusions are often drawn explicitly or are implicit in policy recommendations. Even when causal conclusions are explicit, the policy recommendations that follow often have only loose relations to those conclusions. The purpose of this paper is to make these issues vivid using instances of papers published in this journal in which questionable procedures were used either to draw causal inferences from data or to draw policy recommendations from causal inferences or both. The papers chosen are not meant to be a nearly comprehensive collection of all the problematic papers even from this journal. I also do not mean to imply that they are unusually problematic. Rather, they are chosen to illustrate some of the most common issues. I point out these issues because the situation need not continue. In the last few decades more valid methods of causal inference have been developed (

1, 2 ) with applications in epidemiology( ), biology( ), public health( ), economics( ), psychology( ), sociology( ), political science( ), and other fields. With care, and reasonable explicit assumptions, approximate causal conclusions often can be drawn from observed correlations. The correlations involved, however, are usually not the ones that were traditionally used. (

1, 2 ) Correlation and Causation “Correlation does not imply causation” is a truism that is easier to state than to follow in practice. /29/21 3

Typical pitfalls can be seen in papers concerning the effects of various factors on student attitudes. One common problem is the use of Structural Equation Models (SEMs) as if they were a method for discovering causal patterns. (

1, 2 ) A linear SEM can give effect-size coefficients for each edge connecting a pair of variables in a Directed Acyclic Graph (DAG), ( ) in which the directions of causation are indicated by unidirectional arrows. Sometimes these are written in a more general form, Acyclic Directed Mixed Graphs (ADMGs), which include bidirectional arrows as shorthand for two single-ended arrows coming from a single unmeasured variable in an underlying DAG. ( ) The direction of the arrows is essential information needed to determine policy choices, since making a change in some quantity only affects variables downstream in the DAG.( ) You can make a dog wag its tail by giving it a treat, but you cannot put a treat in its mouth by wagging its tail. Words like “effect” and “impact” are often used to describe correlations while suggesting causation but retaining plausible deniability. Nevertheless the statements “Giving treats has a tail-wagging effect” and “tail-wagging has a treat-giving effect” do not mean the same thing, although each may have some truth to it. Which arrows are bidirectional and which directions the unidirectional ones point are not generally determined uniquely by the correlations used to find the SEM coefficients. ( ) Each DAG is a member of a Markov equivalence class, usually containing multiple DAGs, any one of which has the same number of adjustable parameters and can support an SEM reproducing the observed correlations equally well. ( ) Therefore auxiliary information not contained in those correlations is required to determine the causal directions. (

1, 2, 4, 9 ) To re-emphasize the key point, graphs that are exactly equivalent in fitting correlations generally have completely different implications for what the outcomes of actions would be. ( ) I shall discuss two examples(

10, 11 ) of papers that say little or nothing about how these directions were chosen, and whose causal conclusions are therefore unjustified. These papers (

10, 11 ) look at traits like “Physics Identity” and “Recognition” rather than events. Since traits develop over extended times these really should be represented as time series. Unrolling these variables into functions of time would allow each trait to causally effect not only its own future value but the future value of the other. It is then inappropriate to represent a snapshot in time of a /29/21 4 set of traits by any DAG since effects have been flowing both ways between past values of the traits. Although this effect is a fundamental problem for the SEM models I shall discuss, I will confine the discussion to narrower methodology issues. a) Out-of-class activities.

The effects of out-class science and engineering activities (OCSE) on student attitudes were explored in a recent paper for which “…the primary goal of the current analysis is determining the impact of OCSE activities on physics identity”, i.e. estimating a causal effect. ( ) The issue raised in the abstract is a policy question based on understanding causal patterns: “Understanding the influence of students’ science and engineering experiences on career choices is critical in order to improve future efforts…”( ) The abstract also reaches a causal conclusion “we find that out-of-class science and engineering activities have the largest influence on physics performance/competence beliefs…” ( ) Although at points non-causal associative wording is used, the body of the paper is peppered with causal conclusions, e.g. “Recognition beliefs, while having the largest impact on overall identity…” and “…physics identities have less impact on their career choice… ”. ( ) In one case a conclusion is drawn explicitly about the expectation of what would happen if something is done: “…if performance/competence beliefs are developed in isolation from recognition beliefs and interest, a student is not more likely to develop an overall physics identity.” ( ) The causal conclusions are based on coefficients obtained from an SEM relating OCSE to various questionnaire-based measures of attitudes toward physics, grouped into four clusters called Performance/Competence, Interest, Recognition, and Physics Identity. ( ) (I will abbreviate the first and last as Competence and Identity.) Identity is then used as a sole predictor of Career Choice. The linear SEM model is based on an ADMG connecting these variables, Fig. 1 of ( ), redrawn in DAG form as Fig. 1 below. This graph immediately implies that the “primary goal”, “determining the impact of OCSE activities on physics identity”, is given directly by the unconditional regression coefficient of Identity on OCSE, since no confounders are present. Oddly, this causal coefficient for the /29/21 5 “primary goal” is not explicitly given in the results, nor is this simple relation mentioned in the extensive statistical analysis. ( ) Figure 1. This is the DAG used in reference ( ), with their bidirectional arrow shown instead as an explicit unmeasured U. The variable “Physics Career Choice”, connected only by an arrow from Identity, is omitted for simplicity. The graph was drawn using the online DAGitty tool. /29/21 6 Although the variables Interest, Competence, and Recognition serve only as mediators on the primary causal relation from OCSE to Identity, they still can be important for “understanding the influence”, i.e. seeing how it happens, which in principle could lead to changes that enhance the primary causal coefficient. ( ) In drawing their ADMG, Lock et al. give no particular justification for the choices made, e.g. why the arrows between different variables point the way they do, why no arrow is included between OCSE and Identity, or why a double-ended arrow is used between Interest and Recognition but not elsewhere. Three arrows come out from OCSE and none go into it. ( ) That choice apparently reflects the desire to find the effects of OCSE, but does not necessarily reflect common prior beliefs about how the world works. Do we really know that OCSE has much more causal effect on Interest than Interest has on OCSE, for example? There is some discussion of the quality of the SEM fit to the data, but no indication of any comparison with how well SEMs based on alternate graphs would do. There is no discussion of how well the DAG chosen fits the data compared to other possible DAGs. ( ) Was this one chosen to give the best fit without using too many parameters, e.g. via the Akaike Information Criterion( )? We know that it couldn’t be the unique best because elementary DAG rules( ) say that a DAG in the same equivalence class as that shown in Fig. 1 can be obtained by reversing the arrow from Competence to OCSE, which leaves both the skeleton and the set of unshielded collider triads unchanged. That reversal, however, would make the unconditional regression of Identity on OCSE not equal to the impact of OCSE activities on physics identity and thus would change the estimate of “…the primary goal …the impact of OCSE activities on physics identity”. The paper contains no recognition that such a choice is possible, much less of the criteria by which it should be made. There are in fact many DAGs in the same Markov equivalence class as the one used, i.e. alternate causal pictures with the same conditional independence relations. ( ) The bidirectional arrow here has sufficient properties to allow replacement with either unidirectional arrow while preserving equivalence. ( ) Then, even if we leave Identity fixed as a child of all but OCSE, there are 24 DAGs obtained from the permutations of the other labels in the saturated graph of the other four variables that must fit the data equally well. In addition there are many ways of substituting one or two bidirectional arrows for unidirectional ones. /29/21 7 Two of the equivalent graphs, fitting the data equally well as the one used, would, after converting to an equivalent DAG by replacing the bidirectional arrow with either unidirectional one, simply reverse the directions of the three arrows from OCSE. For those equivalent graphs “determining the impact of OCSE activities on physics identity” would be easy since it would be identically zero. Thus, even with the constraint that no arrows come from Identity, there are many SEM-equivalent graphs from which to choose to find one that tends to agree with auxiliary causal knowledge. Nevertheless, there is at least one indication that the ADMG chosen does not represent causality very well. The authors remark upon the “surprising” negative sign of the regression coefficient for the direct effect of Competence on Identity. ( ) It seems implausible that boosting a student’s competence confidence directly reduces their identification with the area even if mediators are held fixed. That interpretation results from inappropriately treating the negative SEM coefficient found in a causally unmotivated graph as if it described a causal value of what would happen “…if performance/competence beliefs are developed in isolation from recognition beliefs and interest….”. Viewed as an association, however, it is not especially surprising. A student who is very confident in their abilities but still neither has interest in a field nor has sought recognition for it is likely to be a student who finds that field unattractive and therefore does not identify with it. This verbal description could be translated to the graph description by a slight modification of the graph used. For Interest and Recognition to remain unchanged despite an increase in Competence would require a negative shift in the effect of the double-arrow’s implicit unmeasured (U) factor on Interest and Recognition. Including an arrow in Fig. 1 from U to Identity with the same sign of coefficient as the arrows from U to Interest and Recognition then could capture the negative correlation found between Competence and Identity when the Interest and Recognition mediators are held fixed, without implausible causal effect signs. In other words, Interest, Recognition, and Identity would be colliders between U and Competence, so that the regression coefficients for Competence on any of those variables would differ from the causal coefficient by an undetermined amount of collider stratification bias, allowing positive values for each causal coefficient( ). /29/21 8 The implicit variable U might, for example, represent how the student viewed interacting with physicists on a scale from yucky to cool. No questions addressing such a variable were included in this questionnaire,( ) although they could have been. Perhaps a hint of such a variable may appear in a previous study which found a negative correlation between desiring a career with personal interactions and identifying as a Physics Person.( ) Regardless of whether this informal suggestion concerning the implicit U has any merit, the reasoning that leads to it is a reminder that a combination of explicit prior ideas about causality together with SEM results can suggest causal models with testable predictions. The paper’s Conclusion draws policy recommendations based on what the effects would be of “Modifying programmed activities to better support recognition beliefs and interest…” or “…recognizing students in programmed OCSE activities … ”. ( ) Since the analysis provides no basis for estimating those causal effects, it is perhaps fortunate that the policy recommendations are anodyne and not particularly reliant on the analysis: “…[ensure] that activities are not only fun and engaging but also provide challenge. Students need to be provided with sufficient guiding support…”( ) b) Gender Another recent paper( ) explores the effects of gender on the same student attitudes, using some slightly more descriptive names: “Recognition” becomes “Perceived Recognition”, “Performance/Competency” becomes “Competency Belief”, and “Physics Identity” becomes “Physics Person”. A linear SEM model similar to that of ( ) is used, but based on a plain DAG without bidirectional arrows, as shown in their Fig. 2 and in Fig. 2 below. /29/21 9 Figure 2 . This is the DAG from reference ( ), but using the same abbreviated variable names as used for reference ( ). /29/21 10 In contrast to ( ) the causal goal “to explore which motivational factors cause changes in other factors…” is stated clearly from the start, and the word “cause” is not avoided subsequently.( ) Since gender, unlike OCSE, almost always precedes the various attitudes examined, it can safely be assumed to have no incoming arrows in the DAG. (Selection effects( ) could complicate that relation, but I shall not explore them here.) The paper includes a brief explicit discussion of how a particular graph was chosen from the myriad possibilities by “dropping connections or variables of low strength” from the saturated model in which all variables are connected. The paper does not say what particular criteria were used to decide when the cost of including an edge (i.e. another adjustable parameter) was justified by the improved fit. ( ) Fig. 2 shows that the two links that were dropped are those from Gender to Competence (a.k.a. Competency Belief) and to Identity (a.k.a. Physics Person). All six possible connections between the four core attitudinal variables are retained, as in ( ), but here all are unidirectional. Nevertheless, the causal conclusions have some ambiguity. In a 16-page paper explicitly intended to guide actions by finding causal relations, only one brief, vague sentence is used to describe how the directions of causality were determined: “…we only used the suggestions that were theoretically plausible”. ( ) In this case, even given the constraint that Gender is a cause only of Recognition and Interest, there are four Markov equivalent DAGS.( ) The arrows between Interest and Recognition and between Identity and Competence can each independently be assigned either direction. Either of these two arrows could also be replaced by a bidirectional arrow leaving a total of nine equivalent ADMGs. ( ) The theoretical plausibility criterion again seems inadequate for deciding the directions of these arrows, especially between Interest and Recognition. If some other plausible prior constraints were used to make these crucial choices, they are not described in the paper. ( ) If, for example, the arrow between Interest and Recognition started from Interest rather than from Recognition, the meaning of the equivalent graph would change substantially. The coefficient between Interest and Recognition would increase slightly, from 0.64 to 0.67. The coefficient from Gender to Interest would increase from 0.16 to 0.33, and the coefficient from Gender to Recognition would fall from 0.27 to 0.05. The qualitative impression would then be that gender differences in interests are most important, rather than gender differences in perceived recognition. /29/21 11 The very small coefficient for the Gender ! Recognition arrow in this equivalent graph raises the question of whether it would meet the unspecified criteria for “dropping connections or variables of low strength”, since it is only one third as large as the smallest direct coefficient included. ( ) Based on the sample size of 559, ( ) the 95% confidence interval for this coefficient would be (-0.05, 0.15), so that conventionally one would expect the coefficient to be dropped from the model. If it were dropped, then in the resulting new equivalence class all Gender effects would be mediated through Interest. This equivalence class includes some DAGs for which Recognition has no causal effects on other traits. The claim of the title of ref. ( ) thus lacks robust statistical support in the data. Ref. ( ), unlike ref. ( ), acknowledges that choices are to be made in picking the graph that will be given a causal interpretation. Like ref. ( ), it does not acknowledge the problems involved in using any such graph to represent a snapshot of traits. Nor does it treat the crucial choices of what causes what as deserving the attention and clarity devoted to various details of correlations. Those choices, however, are essential for predicting the effects of actions and thus recommending policies. As in ( ), the broad policy recommendations, centered on making the physics classroom gestalt more supportive, did not require careful identification of causal relations among the attitudinal variables. ( ) A controlled intervention along those general policy lines from an overlapping group of authors did report important improvements in physics course performance for females, statistically significant by the conventional p<0.05 criterion, and perhaps also for non-white students, although the effect didn’t reach conventional statistical significance for the small non-white sample.( ) That intervention( ) emphasized encouraging effort when faced with difficulty and facilitating supportive interactions among students. It did not include a survey of the traits studied in ( ) and ( ), so it is hard to tell which, if any, of these traits (or perhaps the implicit U of ( )) were most affected. This experiment serves as a reminder of how much more straightforward it is to find out what works from actual interventions than from static correlations. c) Comparing (a) and (b) /29/21 12 Both studies ( ) and ( ) include each of the possible six connections among the four shared core attitudinal variables. One connection is represented in one study( ) with a unidirectional arrow and in the other study( ) with a bidirectional arrow representing the influence of an implicit unmeasured variable. Two connections are represented with arrows pointing opposite directions in the two studies. Three are represented with arrows in the same direction in both studies. One of those three arrows has opposite signs of coefficients in the two studies. Of the six edges only two are represented with arrows of the same type with the same sign of coefficient . It would be hard to argue that the two graphs representing these variables are close to converging toward a shared causal picture. No amount of linear SEM analysis can decide which causal pattern is best without further inputs concerning realistic causal mechanisms. One obvious type of input would be longitudinal data. One possible result would be that no DAG could represent the causal relations between the different traits unless each trait were unrolled into a time series. Fortunately methods have been developed for inferring causal relations in complicated time-dependent data. (

2, 17 ) Specific interventions in randomized controlled trials would provide stronger evidence.

Characterizing Causal Variables

A recent paper by Salehi et al. ( ) claims that the differences found between major demographic groups on scores on tests in introductory college physics courses are due to differences in pre-course “preparation”. The explicit data analysis given shows that the college exam scores of individuals can be predicted fairly well from a combination of pre-course scores on math ACT/SAT tests and the physics “concept inventory” (CI) exams. These tests are described as “admittedly crude proxies of incoming preparation”.( ) If demographic variables are added to the predictive model, their coefficients are not large enough to consider statistically significant in this sample and their point estimates are small for practical purposes.( ) That shows that the incoming test combination is an approximately demographically unbiased predictor of college physics exam scores, at least with respect to the demographic variables considered. This result was then interpreted to mean that “preparation gaps” are responsible for the demographic differences in college scores. ( ) That conclusion is not justified, since the initial test differences, like the final test differences, can serve as “crude proxies” for any number of /29/21 13 common causes, including ones for which measures are not available. Any fairly stable individual trait measurable by these exams would give the results observed, so long as its effects on both the pre-course and college tests were fairly large and about the same in each demographic group, and other independent traits did not introduce demographic imbalance. Therefore the core result, the insignificance of direct demographic prediction terms for the college tests once pre-course tests are included in the model, does not provide adequate information to distinguish what causes the fairly stable individual differences. The ordinary language interpretation of the title and the entire discussion implies that the key causal factor is “preparation” in something like the usual sense of the word. Lest there be any doubt that “preparation” is not meant as a catch-all term for all pre-course traits that might affect the test results, a subsequent editorial entitled “It’s Not ’Talent,’ it’s ‘Privilege’” by the senior author explicitly claims that the paper shows that “talent” plays little role in the score differentials.( ) Other potential causal variables, e.g. interest and the other traits discussed in ( ) and ( ), are not even mentioned. The key sentence of the paper comes in the Discussion: “We initially expected that it would be differences in what high school physics courses were taken, but we analyzed that for HSWC [the highly selective west-coast university] , and we found that all demographic groups at this institution had the same distribution of taking AP physics, regular high school physics, and no physics, even though the groups had different average CI prescores and math SAT or ACT scores.” ( ) In other words, the main conventional component of “preparation” was at least roughly measured and was nominally the same for the different groups. That would suggest that preparation is not the key causal variable that differs. Other factors for which the pre-course tests also serve as crude proxies have no such measured indicator of being matched between the different groups. Realistically, however, courses with the same name can be radically different in different U.S. schools, and those differences are likely to show major correlations with racial/ethnic differences. Therefore, although the results provide no evidence for the role of preparation, they do not provide strong evidence against its role in causing differences between those groups. For the most part, however, males and females go to the same schools, so that if they took the same nominal courses they took the same actual courses. Thus the results provide evidence that it is /29/21 14 not preparation, at least in the ordinary sense of the word, that accounts for the male/female differences in both pre-course and college exams. If “preparation” is taken in a much broader sense, to include all sorts of pre-high-school factors, it becomes impossible to distinguish from any other cause using these sorts of data. These relations may again be easier to see with the help of a diagram. The paper ( ) does include several DAGs used for SEM calculations of the predictive role of math ACTs and CI tests for different demographic traits. Although the paper acknowledges that such SEM analysis “does not test for causality”, the actual SEMs used can reasonably be interpreted as part of a causal diagram, so long as the measured variables are interpreted to be proxies for underlying traits, since the time order (demography, pre-tests, tests) is clear. ( ) Fig. 3 here shows a simplified version of that diagram, but expanded to include the possible causes about which the paper makes claims. For simplicity I include only gender as a demographic variable and do not disaggregate the pre-tests, whose details are irrelevant here and which were described correctly in the original paper.( ) The central observational result reported is that the sum of the coefficients of any paths from gender (or other demographic variables) to post-scores that bypass pre-scores is negligible. The causal claims, however, concern the paths from gender to pre-scores. The only measured mediator on that path is high-school course preparation, but since the effect of gender on it within this limited sample is said to be about zero, gender differences give no information on the effect of this major type of preparation on downstream variables. In other words, the data were consistent with erasing the arrow from Gender to HS Courses. All the other features are unmeasured. These include direct effects of gender on scores, and effects mediated by other forms of preparation or other experiences. All of these are moderated by the effects of the gender-asymmetric social environment. We can draw no more conclusions about the roles of all these unmeasured variables than we could in a diagram in which the pre-test step was simply omitted. /29/21 15 Figure 3. A DAG schematically representing the causal issues addressed in ( ). Variables shown by lightly-filled ovals are unmeasured. The combined effects of social variables and gender are presumably strongly non-linear, i.e. the effects of gender depend on social context. Since it happens that in this sample the variable HS Courses is approximately independent of Gender( ), the arrow between them could be erased. Other demographic variables have similar diagrams but with poorly measured HS Courses. /29/21 16 If preparation were the main causal factor, then one might expect that the obvious policy change would be to offer preparatory courses to students who need them to prepare for the usual track, as some universities already do. The tentative policy recommendations in the paper and subsequent editorial, however, included “changing the coverage and pace of some intro courses so they are optimized for the third of the distribution with the least preparation…”, i.e. making the main track less challenging. ( ) That policy recommendation seems to assume that the differences between the groups come from some persistent traits not fixable by some preparatory course. Ironically, that recommended policy change would be consistent with precisely the sorts of causal interpretations that were rejected rather than with the causal interpretation that was put forward. Perhaps this double negative would result in a workable policy for teaching college physics, although not one for teaching causal reasoning. Relating Causal Results to Theoretical Frameworks

Dusen and Nissen ( ) have analyzed the effect of learning assistants (LA’s) on student outcomes within a “quantitative critical race theory (QuantCrit) perspective focused on the role of hegemonic power structures in perpetuating inequitable student outcomes”. After reading five pages of introductory material “challenging the ideas of objectivity” etc., ( ) a reader might be prepared for unsupported causal claims. Such a reader would be surprised, however. The core analysis of the paper simply compares successful course completion rates in semesters where LA’s were used with semesters when they were not used.( ) Reasonable controls were used to try to minimize causal confounding by comparing semesters that were approximately otherwise matched. The choice of which demographic characteristics to include in comparisons of the differential effects of the treatment on subgroups was made via the Akaike Information Criterion( ), avoiding both unreasonably strong prior assumptions and excessive focus on statistically shaky correlations. ( ) The conclusion that LA’s help students succeed, especially students from demographic groups most at risk of failure, ( ) seems warranted. /29/21 17 Given the generally uncontroversial value that we want students to succeed, a policy recommendation for using LA’s might easily follow. The “QuantCrit perspective” re-enters in the Conclusions, however. We are warned of the “danger” that “If well-resourced institutions disproportionately adopt LA programs, those programs will perpetuate existing racist and classist power structures by disproportionately benefiting White, middle-upper class students.” ( ) Fortunately the recommendation is to provide “support at institutions serving marginalized students” where the results also suggest the benefits may be largest. ( ) At any rate, readers are nonetheless free to decide in an informed way on what they believe is the best policy, since the core causal analysis of the data is transparent and without obvious errors. The first three papers described above (

10, 11, 18 ) implicitly introduced strong unjustified assumptions into their analysis of the data. In contrast, this paper( ) introduces very strong explicit claims that then seem to have no influence on the data analysis. Weak Priors

The problems I have described so far fall loosely in the broad category of excessively informative Bayesian priors, both about causal relations and about valuable policies. Unwarranted assumptions were made about causal patterns, sometimes even over-riding the implications of the data, and policy recommendations were made with little regard to the analyses. Nevertheless, it is also possible to reach erroneous conclusions from limited data by using overly weak priors. To illustrate that, I found an example from this field but from a physics education research conference proceedings rather than this journal. ( ) Machine learning was used to analyze what factors influence whether students were accepted to a physics graduate program, with the most prominent factors found being undergraduate GPA and Physics GRE scores. ( ) The resulting plot (Fig. 2 of ( )) of greater or lesser acceptance probability as a function of those two variables looks reasonable at a coarse-grained level, with higher values of either predictor increasing the probability. In detail, however, the plot includes multiple non-monotonic regions, looking like a map of disputed territories in some fractious /29/21 18 conflict. I do not believe that any admissions program that would have accepted some applicant would reject an otherwise identical applicant with higher GRE’s or GPA. In this case, I believe the machine should have been given the prior constraint that the acceptance probability function was monotonic in each predictor. Discussion

I feel some trepidation about drawing any conclusions in a paper devoted to criticizing the conclusions of others. Although I will not fully avoid the sins to which I pointed above, at least I will try to be open about conclusions based on subjective prior beliefs. We have seen that several papers fail in several ways. The first is that inattention to the variety of causal possibilities enables drawing causal pictures that are unsupported by the data. Alternate possibilities are sometimes not even mentioned. Neither are any of the standard modern methods of causal inference. The second is that the policy recommendations are at best loosely connected with the causal interpretations, regardless of whether those are right or wrong. In addition, strong premises are sometimes invoked and then left unused, while common-sense prior knowledge is sometimes ignored. If we care about educational consequences, our policy recommendations need to be grounded in reliable estimates of causal effects. The notorious difficulties in drawing reliable causal conclusions from observational data in fields such as education should be taken as a reason to pay even closer attention to the relation between models and data than we do in physics, rather than as a reason to relax that attention. That task requires getting up to speed on causal inference methods. It may also require a willingness to modify prior beliefs to which researchers and funding agencies are attached, since powerful research tools cannot be guaranteed to produce desired results. How far is physics education research (PER) behind the curve? The papers discussed here are not a representative random sample of PER papers. They are mostly from a select group suggested to me by people in the field. Without a more comprehensive survey it would be /29/21 19 premature to estimate the prevalence of invalid causal reasoning and of policy recommendations unmoored from causal expectations. Regardless of the precise frequency with which improper causal techniques are used in PER, it should be clear from this sample that there is room for improvement in the editorial process. I believe that adding some experts in causal inference to the editorial board of this journal (and others) could be helpful, although admittedly no one has done a randomized controlled trial to see if that treatment will actually work. If a change is made abruptly, perhaps a regression discontinuity analysis( ) could indicate whether it succeeded for the journal, at least if some reliable measure of quality were found. It would then be interesting to look for violations of the Stable Unit Treatment Value Assumption( ), i.e. whether the level of causal reasoning was changed in just one journal, or perhaps raised throughout the field, or whether the same sorts of papers ended up published but just in different journals. It might also help to have a causal reasoning primer specifically for PER, similar to those mentioned for other areas ( ). Its authors should include at least one with domain-specific PER knowledge and one with solid grounding in causal inference, neither of which are true of the present author. Acknowledgements

I thank Carl Wieman for a cordial exchange and Jamie Robins, Sander Greenland, and Thomas Richardson for very helpful editorial comments on sections of this paper. Thomas and Jamie in particular guided me through some very basic graph algebra that I had not learned. I am intermittently grateful to the Physics Education Research group at UIUC for introducing me to this field. I am not sure whether to thank or blame editors of this journal for suggesting this project. /29/21 20

References

1. J. Pearl, M. Glymour, and N. P. Jewell,

Causal Inference in Statistics - A Primer . (Wiley, Chichester, U.K., 2016). 2. M. A. Hernán, and J. M. Robins,

Causal Inference: What If. , (Chapman & Hall/CRC, Boca Raton, 2020). 3. C. Glymour, K. Zhang, and P. Spirtes, Review of Causal Discovery Methods Based on Graphical Models.

Frontiers in genetics , 524 (2019). 4. T. A. Glass, S. N. Goodman, M. A. Hernán, and J. M. Samet, Causal Inference in Public Health. Ann. Rev. of Public Health , 61 (2013). 5. H. R. Varian, Causal inference in economics and marketing. Proc. Nat. Acad. Sci. , 7310 (2016). 6. E. M. Foster, Causal inference and developmental psychology.

Dev. Psychol. , 1454 (2010 ). 7. M. Gangl, Causal Inference in Sociological Research. Annual Review of Sociology , 21 (2010). 8. L. Keele, The Statistics of Causal Inference: A View from Political Methodology. Political Analysis , 313 (2015). 9. T. Richardson, Markov Properties for Acyclic Directed Mixed Graphs. Scandanavian J. Statistics: Theory and Applications , 145 (2003). 10. R. M. Lock, Z. Hazari, and G. Potvin, Impact of out-of-class science and engineering activities on physics identity and career intentions. Phys. Rev. Phys. Educ. Res. , 020137 (2019). 11. Z. Y. Kalender, E. Marshman, C. D. Schunn, T. J. Nokes-Malach, and C. Singh, Why female science, technology, engineering, and mathematics majors do not identify with physics: They do not think others see them that way. Phys. Rev. Phys. Educ. Res. , 020148 (2019). 12. H. Akaike, A new look at the statistical model identification. IEEE Transactions on Automatic Control , 716 (1974). 13. S. Greenland, Quantifying Biases in Causal Models: Classical Confounding vs Collider-Stratification Bias. Epidemiology , 300 (2003). 14. Z. Hazari, G. Sonnert, P. M. Sadler, and M.-C. Shanahan, Connecting High School Physics Experiences, Outcome Expectations,Physics Identity, and Physics Career Choice: A Gender Study. J. Res. Sci, Teach. , 978 (2010). 15. M. A. Hernán, S. Hernández-Díaz, and J. M. Robins, A Structural Approach to Selection Bias. Epidemiology , 615 (2004). 16. K. R. Binning et al. , Changing Social Contexts to Foster Equity in College Science Courses: An Ecological-Belonging Intervention. Psychological Science , 1059 (2020). /29/21 21 17. J. Robins, A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical Modelling , 1393 (1986). 18. S. Salehi, E. Burkholder, G. P. Lepage, Steven Pollock, and C. Wieman, Demographic gaps or preparation gaps?: The large impact of incoming preparation on performance of students in introductory physics. Phys. Rev. Phys. Educ. Res. , 020114 (2019). 19. C. Wieman, It’s Not “Talent,” it’s “Privilege”. APS News , 8 (2020). 20. B. V. Dusen, and J. Nissen, Associations between learning assistants, passing introductory physics, and equity: A quantitative critical race theory investigation. Phys. Rev. Phys. Educ. Res. , 010117 (2020). 21. N. T. Young, and M. D. Caballero, in Physi. Educ. Res. Conf. (Provo, UT, 2019), pp. 669. 22. D. S. Lee, and T. Lemieux, Regression Discontinuity Designs in Economics.

Journal of Economic Literature , 281 (2010). 23. Sharon Schwartz, N. M. Gatto, and U. B. Campbell, Extending the sufficient component cause model to describe the Stable Unit Treatment Value Assumption (SUTVA). Epidemiol. Perspect. Innov. , 1 (2012 )., 1 (2012 ).