[PDF] Teacher-to-classroom assignment and student achievement

Abstract

We study the effects of counterfactual teacher-to-classroom assignments on average student achievement in elementary and middle schools in the US. We use the Measures of Effective Teaching (MET) experiment to semiparametrically identify the average reallocation effects (AREs) of such assignments. Our findings suggest that changes in within-district teacher assignments could have appreciable effects on student achievement. Unlike policies which require hiring additional teachers (e.g., class-size reduction measures), or those aimed at changing the stock of teachers (e.g., VAM-guided teacher tenure policies), alternative teacher-to-classroom assignments are resource neutral; they raise student achievement through a more efficient deployment of existing teachers.

Full PDF

TTeacher-to-classroom assignmentand student achievement

Bryan S. Graham ∗ , Geert Ridder † , Petra Thiemann ‡ , Gema Zamarro §¶ September 2, 2020

Abstract

We study the eﬀects of counterfactual teacher-to-classroom assignments on averagestudent achievement in elementary and middle schools in the US. We use the Measuresof Eﬀective Teaching (MET) experiment to semiparametrically identify the average re-allocation eﬀects (AREs) of such assignments. Our ﬁndings suggest that changes inwithin-district teacher assignments could have appreciable eﬀects on student achieve-ment. Unlike policies which require hiring additional teachers (e.g., class-size reductionmeasures), or those aimed at changing the stock of teachers (e.g., VAM-guided teachertenure policies), alternative teacher-to-classroom assignments are resource neutral; theyraise student achievement through a more eﬃcient deployment of existing teachers.

JEL codes:

I20, I21, I24

Keywords: teacher quality, teacher assignment, education production, student achievement,average reallocation eﬀects, K–12 ∗ Department of Economics, University of California - Berkeley, and National Bureau ofEconomic Research, e-mail: [email protected] † Department of Economics, University of Southern California, e-mail: [email protected] ‡ Department of Economics, Lund University, and IZA, e-mail: [email protected] § Department of Education Reform, University of Arkansas, and CESR, e-mail: [email protected] ¶ We thank seminar audiences at the 2017 All-California Econometrics Conference at Stanford University,UC Riverside, NESG, University of Duisburg-Essen, Tinbergen Institute Amsterdam, Lund University, IFNStockholm, the University of Bristol, and the University of Southern California for helpful feedback. We thankTommy Andersson, Kirabo Jackson, Magne Mogstad, Hessel Oosterbeek, Daniele Paserman, and HashemPesaran for useful comments and discussions. All the usual disclaimers apply. Financial support for Grahamwas provided by the National Science Foundation (SES a r X i v : . [ ec on . E M ] S e p Introduction

Approximately four million teachers work in the public elementary and secondary educationsystem in the United States. These teachers provide instruction to almost ﬁfty million stu-dents, enrolled in nearly one hundred thousand schools, across more than thirteen thousandschool districts (McFarland et al., 2019, Snyder et al., 2017). Diﬀerences in measured stu-dent achievement are substantial across US schools as well as across classrooms within theseschools. Beginning with Hanushek (1971), a large economics of education literature attributescross-classroom variation in student achievement to corresponding variation in (largely) la-tent teacher attributes. These latent attributes are sometimes referred to as teacher qualityor teacher value-added.The implications of value-added measures (VAM) for education policy are controver-sial both within the academy and outside it. The most contentious applications of VAMinvolve their use in teacher tenure and termination decisions (cf., Chetty et al., 2012, Darling-Hammond, 2015). The premise of such applications is that changes in the stock of existingteachers – speciﬁcally rooting out teachers with low VAMs and retaining those with highones – could lead to large increases in student achievement and other life outcomes.In this paper we pose an entirely diﬀerent question: is it possible to raise student achieve-ment, without changes to the existing pool of teachers, by changing who teaches whom?Schools and school districts are the loci of teacher employment. To keep our analysis policy-relevant, we therefore focus on the achievement eﬀects of diﬀerent within-school and within-district teacher-to-classroom assignment policies.For teacher assignment policies to matter, teachers must vary in their eﬀectiveness inteaching diﬀerent types of students. For example, some teachers may be especially goodat teaching English language learners, minority students, or accelerated learners (c.f., Dee,2004, Loeb et al., 2014). Formally educational production must be non-separable in someteacher and student attributes (Graham et al., 2007, 2014, 2020).Using data collected in conjunction with the Measures of Eﬀective Teaching (MET)project, we present experimental evidence of such non-separabilities. We expand upon stan-dard models of educational production, which typically assume that the eﬀects of teacher andstudent inputs are separable. Our identiﬁcation strategy relies on the random assignment ofteachers to classrooms in the MET project. Speciﬁcally, we study assignments based upon(i) an observation-based, pre-experiment measure of teaching practice – Danielson’s (2011)Framework for Teaching (FFT) instrument – and (ii) students’ and classroom peers’ baselinetest scores, also measured pre-experiment.Alternative teacher assignments change the joint distribution of these variables. When See, for example, the March 2015 special issue of

Educational Researcher on value-added research andpolicy or the American Statistical Association’s statement on value-added measures (Morganstein and Wasser-stein, 2014). τ ×

100 percent ofteachers from classrooms – sorted according to their VAM – and replaces them with averageteachers (i.e., teachers with VAMs of zero). Assuming a Gaussian distribution for teachervalue-added, the eﬀect of such an intervention would be to increase the mean of the studenttest score distribution by (1 − τ ) σ φ (cid:0) q τ σ (cid:1) − Φ (cid:0) q τ σ (cid:1) standard deviations. Here σ corresponds to the standard deviation of teacher value-addedand q τ to its τ th quantile.Rockoﬀ (2004, Table 2) and Rothstein (2010, Table 6) estimate a standard deviationof teacher value-added of between 0.10 and 0.15. Taking the larger estimate and setting τ = 0 .

05 (0 .

10) generates an expected increase in student test scores of 0 .

015 (0 . As is common practice, we abstract from the eﬀects of possible behavioral responses to VAM-guidedteacher retention policies. For example, teachers might become demoralized by VAM retention systems andteach less eﬀectively as a result; or they could be motivated to teach more eﬀectively. Moreover, many teachers have preferences over school and student characteristics. Thus, one mightneed to pay bonuses to move teachers to certain schools or classrooms (e.g., classrooms with high fractionsof disadvantaged students). diﬀerent pair of teacher andclassroom attributes would lead to greater achievement gains. Working with appropriatelychosen linear combinations of multiple student and teacher attributes could generate evenlarger achievement gains. We leave explorations along these lines to future research.Readers may reasonably raise questions about the external validity of the results reportedbelow. We believe such skepticism is warranted and encourage readers to view what isreported below as provisional. We do, however, hope our results are suﬃciently compellingto motivate additional research and experimentation with teacher-to-classroom assignmentpolicies.Work by Susanna Loeb and co-authors suggests that US school districts have de facto teacher-to-classroom assignment policies (Kalogrides et al., 2011, Grissom et al., 2015). Forexample, they ﬁnd that less experienced, minority, and female teachers are more likely to beassigned to predominantly minority classrooms. They also present evidence that principalsuse teacher assignments as mechanisms for retaining teachers – as well as for encouragingless eﬀective teachers to leave – and that more experienced teachers exert more inﬂuence onclassroom assignment decisions.Our work helps researchers and policy-makers understand the achievement eﬀects of suchpolicies and the potential beneﬁts of alternative ones. The ﬁndings presented below suggestthat teacher-to-classroom assignment policies are consequential and that changes to themcould meaningfully increase average student achievement.In addition to our substantive results, we present new identiﬁcation results for averagereallocation eﬀects (AREs). Identiﬁcation and estimation of AREs under (conditional) exo-geneity is considered by Graham et al. (2014, 2020). Unfortunately these results do not applydirectly here. Although teachers were randomly assigned to classrooms as part of the METexperiment, compliance was imperfect. Furthermore some students moved across classroomsafter the random assignment of teachers, which raises concerns about bias due to endogenousstudent sorting. We develop a semiparametric instrumental variables estimator (e.g., Ai andChen, 2003) which corrects for student and teacher non-compliance. Our analysis highlightshow complex identiﬁcation can be in the context of multi-population matching models whereagents sort endogenously.In independent work, Aucejo et al. (2019) also use the MET data to explore comple-mentarity between student and teacher attributes in educational production. They do not That is, we did not consider the universe of possible assignment variables.

Our goal is to identify the average achievement eﬀects of alternative assignments of teachers toMET classrooms. These are average reallocation eﬀects (AREs), as introduced by Grahamet al. (2007, 2014). The identiﬁcation challenge is to use the observed MET teacher-to-classroom assignments and outcomes to recover these AREs.Our analysis is based upon experimentally generated combinations of student and teacherattributes, that is, it exploits the random assignment of teachers to classrooms in the METexperiment. Like in many other ﬁeld experiments, various deviations from MET’s intendedprotocol complicate our analysis. In this section we outline a semiparametric model of educa-tional production and consider its identiﬁcation based upon the MET project as implemented ,using the MET data as collected .It is useful, however, to ﬁrst explore nonparametric identiﬁcation of reallocation eﬀectsunder an ideal implementation of the MET project (henceforth MET as designed ). Such anapproach clariﬁes how the extra restrictions introduced below allow for the identiﬁcation ofreallocation eﬀects despite non-compliance, attrition, and other deviations from the intendedexperimental protocol.

Nonparametric identiﬁcation under ideal circumstances

Our setting features two populations, one of students and the other of teachers. Each studentis distinguished by an observed attribute X i , in our case a measure of baseline academicachievement, and an unobserved attribute, say “student ability,” V i . Similarly, each teacheris characterized by an observed attribute W i , in our case an observation-based measure ofteaching pedagogy, and an unobserved attribute, say “teacher quality,” U i . Let i = 1 , . . . , N index students. Let C be the total number of MET classrooms orequivalently teachers. We deﬁne G i to be a C × c th element of G i equals one if student i is in classroom c ∈ { , . . . , C } and zero otherwise. To be precise, the randomization was carried out within schools. See Section 3 for details. We use “ability” as a shorthand for latent student attributes associated with higher test scores; likewisewe use “quality” as a shorthand for latent teacher attributes associated with higher test scores. i (cid:48) s peers or classmates are therefore given by the index set p ( i ) = { j : G i = G j , i (cid:54) = j } . Next we deﬁne the peer average attribute as ¯ X p ( i ) = | p ( i ) | (cid:80) j ∈ p ( i ) X j (i.e., the average of thecharacteristic X across student i (cid:48) s peers). We deﬁne ¯ V p ( i ) similarly.The MET project protocol did not impose any requirements on how students, in a givenschool-by-grade cell, were divided into classrooms. Evidently schools followed their existingprocedures for dividing students within a grade into separate classrooms. An implication ofthis observation is that the MET experiment implies no restrictions on the joint density f X i ,V i , ¯ X p ( i ) , ¯ V p ( i ) ( x, v, ¯ x, ¯ v ) , (1)beyond the requirement that the density be feasible. For example, if most schools trackedstudents by prior test scores, then we would expect X i and ¯ X p ( i ) to positively covary. If,instead, students were randomly assigned to classrooms and hence peers, we would have,ignoring ﬁnite population issues, the factorization f X i ,V i , ¯ X p ( i ) , ¯ V p ( i ) ( x, v, ¯ x, ¯ v ) = f X i ,V i ( x, v ) f ¯ X p ( i ) , ¯ V p ( i ) (¯ x, ¯ v ) . Our analysis allows for arbitrary dependence between own and peer attributes, both observedand unobserved, and consequently is agnostic regarding the protocol used to group studentsinto classrooms.Two implications of this agnosticism are (i) our analysis is necessarily silent about thepresence and nature of any peer group eﬀects, and (ii) it is likely that more complicated poli-cies, involving simultaneously regrouping students into new classes and reassigning teachersto them, could raise achievement by more than what is feasible via reassignments of teachersto existing classrooms alone, which is the class of policies we consider. Although nothing about the MET protocol generates restrictions on the joint density (1),random assignment of teachers to classrooms – however formed – ensures that f X i ,V i , ¯ X p ( i ) , ¯ V p ( i ) ,W i ,U i ( x, v, ¯ x, ¯ v, w, u ) = f X i ,V i , ¯ X p ( i ) , ¯ V p ( i ) ( x, v, ¯ x, ¯ v ) f W i ,U i ( w, u ) . (2)Here W i and U i denote the observed and unobserved attributes of the teacher assigned tothe classroom of student i . A perfect implementation of MET as designed would ensure We cannot, for example, have an assignment which implies all students have above-average peers. Indeed additional restrictions on the second density to the right of the equality above would also hold;see Graham et al., 2010, and Graham, 2011, for additional discussion and details. Learning about the eﬀects of policies which simultaneously regroup students and reassign teachers wouldrequire double randomization. See Graham (2008) for an empirical example and Graham et al. (2010, 2020)for a formal development. Y i be an end-of-year measure of student achievement, generated according to Y i = g (cid:0) X i , ¯ X p ( i ) , W i , V i , ¯ V p ( i ) , U i (cid:1) . (3)Other than the restriction that observed and unobserved peer attributes enter as means,equation (3) imposes no restrictions on educational production. Under restriction (2) theconditional mean of the outcome given observed own, peer, and teacher attributes equals E (cid:2) Y i | X i = x, ¯ X p ( i ) = ¯ x, W i = w (cid:3) = (cid:90) (cid:90) (cid:90) (cid:104) g ( x, ¯ x, w, v, ¯ v, u ) f V i , ¯ V p ( i ) | X i ,X p ( i ) ( v, ¯ v | x, ¯ x ) × f U i | W i ( u i | w i ) (cid:3) d v d¯ v d u = m amf ( x, ¯ x, w ) . (4)Equation (4) coincides with (a variant of) the Average Match Function (AMF) estimanddiscussed by Graham et al. (2014, 2020). The AMF can be used to identify AREs. Oursetting – which involves multiple students being matched to a single teacher – is somewhatmore complicated than the one-to-one matching settings considered by Graham et al. (2014,2020). One solution to this “problem” would be to average equation (3) across all students inthe same classroom and work directly with those averages. As will become apparent below,however, working with a student-level model makes it easier to deal with non-complianceand attrition, which have distinctly student-level features. It also connects our results moredirectly with existing empirical work in the economics of K-to-12 education, where student-level modelling predominates, and results in greater statistical power.The decision to model outcomes at the student level makes the analysis of teacher re-assignments a bit more complicated, at least superﬁcially. To clarify the issues involved it ishelpful to consider an extended example. Assume there are two types of students, X i ∈ { , } ,and two types of teachers, W i ∈ { , } . For simplicity assume that the population fractionsof type X i = 1 students and type W i = 1 teachers both equal one-half, that is, half of thestudents are of type 1 and half of the students are taught by a teacher of type 1. Assume,again to keep things simple, that classrooms consist of three students each.Table 1 summarizes this basic set-up. Column 1 lists classroom types. For example, a000 classroom consists of all type-0 students. There are four possible classroom types, eachassumed to occur with a frequency of one-fourth. The status quo mechanism for groupingstudents into classrooms induces a joint distribution of own and peer average attributes. A fully nonparametric model would allow achievement to vary with any exchangeable function of peerattributes; see Graham et al. (2010) for more details. Within classrooms students are exchangeable. able 1: Feasible teacher reassignments Status quo CounterfactualClassroom Type Pr (cid:0) W i = 1 | X i , ¯ X p ( i ) (cid:1) Pr (cid:16) ˜ W i = 1 (cid:12)(cid:12)(cid:12) X i , ¯ X p ( i ) (cid:17) X i ¯ X p ( i ) f (cid:0) X i , ¯ X p ( i ) (cid:1)

12 13

12 12

12 16

12 12

12 16

12 23 Note:

The population fraction of type X i = 1 students is and that of type W i = 1 teachers is also .Classrooms of three students each are formed, such that the frequency of each of the four possible classroomconﬁgurations is in the population of classrooms of size 3. Under the status quo teachers are assigned toclassrooms at random; in the counterfactual teachers are assigned more assortatively. See the main text formore information. This joint distribution is given in the right-most column of Table 1. For instance, (3 outof 12) of the students are in a classroom with two type-0 peers, so that f X i , ¯ X p ( i ) (0 ,

0) = ,and (2 out of 12) of the students are in a classroom with one type-0 and one type-1 peer,so that f X i , ¯ X p ( i ) (0 , ) = . The MET experiment implies no restrictions on the joint density f X i , ¯ X p ( i ) ( x, ¯ x ), consequently we only consider policies which leave it unchanged.Next assume, as was the case in the MET experiment, that under the status quo teachersare randomly assigned to classrooms. This induces the conditional distribution of W i given X i and ¯ X p ( i ) reported in column 2 of Table 1. Of course, from this conditional distribution,and the marginal for X i and ¯ X p ( i ) , we can recover the joint distribution of own type, peeraverage type, and teacher type (i.e., of X i , ¯ X p ( i ) and W i ).Now consider the AMF: m amf ( x, ¯ x, w ). Consider the subpopulation of students with X i =1 and ¯ X p ( i ) = . Inspecting Table 1, this subpopulation represents of all students (right-most column of Table 1). If we assign to students in this subpopulation a teacher of type W i = 1, then the expected outcome coincides with m amf (cid:0) , , (cid:1) . Under random assignmentof teachers the probability of assigning a type-1 teacher is the same for all subpopulations ofstudents.Finally consider a counterfactual assignment of teachers to classrooms. Since we leave thecomposition of classrooms unchanged, f X i , ¯ X p ( i ) ( x, ¯ x ) is left unmodiﬁed. The counterfactual as-signment therefore corresponds to a conditional distribution for teacher type, ˜ f ˜ W i | X i , ¯ X p ( i ) ( w | x, ¯ x )7hich satisﬁes the feasibility condition: (cid:90) (cid:90) ˜ f ˜ W i | X i , ¯ X p ( i ) ( w | x, ¯ x ) f X i , ¯ X p ( i ) ( x, ¯ x ) d x d¯ x = f ( w ) (5)for all w ∈ W . Here ˜ f denotes a counterfactual distribution, while f denotes a statusquo one. We use ˜ W i to denote an assignment from the counterfactual distribution. Notethat by feasibility of an assignment ˜ W i D = W i marginally, but will diﬀer conditional onstudent attributes. Condition (5), as discussed by Graham et al. (2014), allows for degenerateconditional distributions, as might occur under a perfectly positive assortative matching.Average achievement under a counterfactual teacher-to-classroom assignment equals: β are (cid:16) ˜ f (cid:17) = (cid:90) (cid:90) (cid:20)(cid:90) m amf ( x, ¯ x, w ) ˜ f ˜ W i | X i , ¯ X p ( i ) ( w | x, ¯ x ) d w (cid:21) f X i , ¯ X p ( i ) ( x, ¯ x ) d x d¯ x. (6)Since all the terms to the right of the equality are identiﬁed, so too is the ARE. Conceptuallywe ﬁrst – see the inner integral in equation (6) – compute the expected outcome in each typeof classroom (e.g., X i = x and ¯ X p ( i ) = ¯ x ) given its new teacher assignment (e.g., to type˜ W i = w ). We then – see the outer two integrals in equation (6) – average over the status quodistribution of X i , ¯ X p ( i ) , which is left unchanged. This yields average student achievementunder the new assignment of teachers to classrooms.In addition to the feasibility condition (5) we need to also rule out allocations that assigndiﬀerent teachers to students in the same classrooms. Note that m amf ( x, ¯ x, w ) is the averageoutcome for the subpopulation of students of type X i = x with peers ¯ X p ( i ) = ¯ x . For example,in Table 1 classroom 001 has students from two subpopulations so deﬁned. Assignment ofteachers to subpopulations of students opens up the possibility that a classroom is assignedto teachers of diﬀerent types for its constituent subgroups of students. If, as indicatedin Table 1, the teacher-type assignment probability is the same for all subpopulations ofstudents represented in a classroom, then the ARE in equation (6) coincides with one based ondirect assignment of teachers to classrooms. This implicit restriction on teacher assignmentsprovides a link between models for individual outcomes and classroom-level reallocations. Semiparametric identiﬁcation under MET as implemented

In the MET experiment as implemented not all teachers and students appear in their assignedclassrooms. This occurs both due to attrition (e.g., when a student changes schools priorto follow-up) as well as actual non-compliance (e.g., when a teacher teaches in a classroomdiﬀerent from their randomly assigned one).In this section we describe our approach to identifying AREs in MET as implemented.Relative to the idealized analysis of the previous subsection we impose two types of additional8estrictions. First, we work with a semiparametric, as opposed to a nonparametric, educa-tional production function. Second, we make behavorial assumptions regarding the natureof non-compliance. Both sets of assumptions are (partially) testable.

Educational production function

Our ﬁrst set of restrictions involve the form of the educational production function. A key re-striction we impose is that unobserved student, peer, and teacher attributes enter separably.Although this assumption features in the majority of economics of education empirical work(e.g., Chetty et al., 2014a,b), it is restrictive. We also discretize the observed student andteacher attribute. This allows us to work with a parsimonously parameterized educationalproduction function that nevertheless accommodates complex patterns of complementaritybetween student and teacher attributes. Discretization also allows us to apply linear program-ming methods to study counterfactual assigments (c.f., Graham et al., 2007, Bhattacharya,2009).Speciﬁcally we let X i be a vector of indicators for each of K “types” of students. Typescorrespond to intervals of baseline test scores. Our preferred speciﬁcation works with K = 3types of students: those with low, medium, and high baseline test scores. In this case X i isa 2 × i ’s baseline test score was in the medium orhigh range (with the low range being the omitted group). This deﬁnition of X i means that¯ X p ( i ) equals the 2 × W i is a vector of indicatorsfor L diﬀerent ranges of FFT scores. In our preferred speciﬁcation we also work with L = 3types of teachers: those with low, medium, and high FFT scores. Hence W i is again a 2 × i ’s FFT score was in the medium orhigh range (with the low range again being the omitted group).We assess the sensitivity of our results to coarser and ﬁner discretizations of the baselinetest score and FFT distributions. Speciﬁcally we look at K = L = 2 and K = L = 4dicretizations.We posit that end-of-school year achievement for student i is generated according to Y i = α + X (cid:48) i β + V i (cid:124) (cid:123)(cid:122) (cid:125) Student Ability + ¯ X (cid:48) p ( i ) γ + ρ ¯ V p ( i ) (cid:124) (cid:123)(cid:122) (cid:125) Peer Eﬀect + W (cid:48) i δ + U i (cid:124) (cid:123)(cid:122) (cid:125) Teacher Quality + (cid:0) X i ⊗ ¯ X p ( i ) (cid:1) (cid:48) ζ (cid:124) (cid:123)(cid:122) (cid:125) Student-Peer Complementarity + ( X i ⊗ W i ) (cid:48) η + (cid:0) W i ⊗ ¯ X p ( i ) (cid:1) (cid:48) λ (cid:124) (cid:123)(cid:122) (cid:125) Student-Teacher Complementarity (7)Observe that – as noted above – own, V i , peer, ¯ V p ( i ) , and teacher, U i , unobservables enter Precise variable deﬁnitions are given below. unobserved peer attributes to inﬂuenceachievement.Conditional on working with a discrete student and teacher type space, equation (7) isunrestrictive in how own and teacher attributes interact to generate achievement. In contrast,equation (7) restricts the eﬀect of peers’ observed composition on the outcome. Partition ζ = ( ζ , . . . , ζ K − ) and similarly partition λ = ( λ , . . . , λ L − ). The ( K − × i ’s outcome with respect to peer composition is ∂Y i ∂ ¯ X p ( i ) = γ + K − (cid:88) k =1 X ki ζ k + L − (cid:88) l =1 W li λ l , (8)which is constant in ¯ X p ( i ) , although varying heterogenously with student and teacher type.Put diﬀerently convexity/concavity in ¯ X p ( i ) is ruled out by equation (7). It should be notedthat the MET data, in which the assignment of peers is not random, are not suitable forestimating peer eﬀects (non-linear or otherwise).For completeness we also include the interaction of teacher type with peer composition– the (cid:0) W i ⊗ ¯ X p ( i ) (cid:1) regressor in equation (7) – although λ is poorly identiﬁed in practice. Due to our limited sample size, we do not include the third order interactions of own, peer,and teacher types. Relative to a standard “linear-in-means” type model typically ﬁtted to datasets like ours(e.g., Hanushek et al., 2004): Y i = α + X (cid:48) i β + ¯ X (cid:48) p ( i ) γ + W (cid:48) i δ + V i + U i , (9)equation (7) is rather ﬂexible. It allows for rich interactions in observed own, peer, andteacher attributes and is explicit in that both observed and unobserved peer attributes mayinﬂuence own achievement. The “linear-in-means” model (9) presumes homogenous eﬀects Admittedly, the (cid:0) W i ⊗ ¯ X p ( i ) (cid:1) term is not entirely straightforward to interpret; however it seemed ad hocto include some second-order interactions while a priori excluding others. We also report speciﬁcations whichexclude this term below (as its coeﬃcient is always insigniﬁcantly diﬀerent from zero). The educational production function in equation (7) includes J = dim( α ) + dim( β ) + dim( γ ) + dim( δ ) +dim( ζ ) + dim( η ) + dim( λ ) = 1 + 2 + 2 + 2 + 4 + 4 + 4 = 19 parameters. A fully interacted model wouldintroduce 8 additional parameters, for 3 × × X p ( i ) is not binary-valued. A more ﬂexiblemodel would therefore also include, for example, interactions of X i with the squares of the elements of ¯ X p ( i ) (and so on.) As mentioned above, a student’s assigned teacher and peers may deviate from her realizedones due to attrition and non-compliance. To coherently discuss our assumptions about theseissues we require some additional notation. Let W ∗ i and ¯ X p ∗ ( i ) denote student i (cid:48) s assigned teacher and peer attribute (here p ∗ ( i ) is the index set of i (cid:48) s assigned classmates). Randomassignment of teachers to classrooms ensures that a student’s assigned teacher’s attributesare independent of her own unobservables: E (cid:2) V i | X i , ¯ X p ∗ ( i ) , W ∗ i (cid:3) = E (cid:2) V i | X i , ¯ X p ∗ ( i ) (cid:3) def ≡ g (cid:0) X i , ¯ X p ∗ ( i ) (cid:1) . (10)Here g ( x, ¯ x ) is unrestricted. Under double randomization, with students additionally groupedinto classes at random, we would have the further restriction E (cid:2) V i | X i , ¯ X p ∗ ( i ) (cid:3) = E [ V i | X i ] . However, since the MET experiment placed no restrictions on how students were groupedinto classrooms, we cannot rule out the possibility that a student’s peer characteristics, ¯ X p ∗ ( i ) ,predict her own unobserved ability, V i . Consequently our data are necessarily silent about thepresence and nature of any peer group eﬀects in learning. This limitation does not limit ourability to study the eﬀects of teacher reallocations, because we leave the student compositionof classrooms – and hence the “peer eﬀect” – ﬁxed in our counterfactual experiments.Finally, even with double randomization, we would still have E [ V i | X i ] (cid:54) = 0. Observedand unobserved attributes may naturally covary in any population (e.g., for example, averagehours of sleep, which is latent in our setting, plausibly covaries with baseline achievementand also inﬂuences the outcome). Such covariance is only a problem if, as is true in thetraditional program evaluation setting, the policies of interest induce changes in the marginaldistribution of X i – and hence the joint distribution of X i and V i . This is not the case here:any reallocations leave the joint distribution of X i and V i unchanged.The MET protocol also ensures that assigned peer unobservables, ¯ V p ∗ ( i ) , are independentof the observed attributes of one’s assigned teacher: E (cid:2) ¯ V p ∗ ( i ) (cid:12)(cid:12) X i , ¯ X p ∗ ( i ) , W ∗ i (cid:3) = E (cid:2) ¯ V p ∗ ( i ) (cid:12)(cid:12) X i , ¯ X p ∗ ( i ) (cid:3) def ≡ g (cid:0) X i , ¯ X p ∗ ( i ) (cid:1) , (11)with g (cid:0) X i , ¯ X p ∗ ( i ) (cid:1) , like g (cid:0) X i , ¯ X p ∗ ( i ) (cid:1) , unrestricted.Random assignment of teachers to classrooms also ensures independence of the unobserved Manski (1993) did allow for unobserved peer attributes as did Graham (2008); but these cases areexceptional. A prototypical example is a policy which increases years of completed schooling. Such a policy necessarilychanges the joint distribution of schooling and unobserved labor market ability relative to its status quodistribution. assigned teacher and observed student and peer characteristics: E (cid:2) U ∗ i | X i , ¯ X p ∗ ( i ) , W ∗ i (cid:3) = E [ U ∗ i | W ∗ i ] = 0 . (12)The second equality is just a normalization. Under MET as designed we could identify AREs using equations (10), (11), and (12).To see this let, as would be true under perfect compliance, W i = W ∗ i and ¯ X p ( i ) = ¯ X p ∗ ( i ) forall i = 1 , . . . , N . Using equations (10), (11), and (12) yields, after some manipulation, thepartially linear regression model (e.g., Robinson, 1988): Y i = W (cid:48) i δ + ( X i ⊗ W i ) (cid:48) η + (cid:0) W i ⊗ ¯ X p ( i ) (cid:1) (cid:48) λ + h (cid:0) X i , ¯ X p ( i ) (cid:1) + A i (13)with E (cid:2) A i | X i , ¯ X p ( i ) , W i (cid:3) = 0 for A i def ≡ (cid:2) V i − g (cid:0) X i , ¯ X p ( i ) (cid:1)(cid:3) + ρ (cid:2) ¯ V p ( i ) − g (cid:0) X i , ¯ X p ( i ) (cid:1)(cid:3) + U i , (14)and where the nonparametric regression component equals h (cid:0) X i , ¯ X p ( i ) (cid:1) def ≡ α + X (cid:48) i β + g (cid:0) X i , ¯ X p ( i ) (cid:1) + ¯ X (cid:48) p ( i ) γ + ρg (cid:0) X i , ¯ X p ( i ) (cid:1) + (cid:0) X i ⊗ ¯ X p ( i ) (cid:1) (cid:48) ζ. (15)Note, even under this perfect experiment, we cannot identify β , γ , and ζ ; these terms areconfounded by g (cid:0) X i , ¯ X p ( i ) (cid:1) and g (cid:0) X i , ¯ X p ( i ) (cid:1) and hence absorbed into the nonparametriccomponent of the regression model. This lack of identiﬁcation reﬂects the inherent inabilityof the MET experiment to tell us anything about peer group eﬀects. Nevertheless, as detailedbelow, knowledge of δ , η , and λ is suﬃcient to identify the class of reallocation eﬀects wefocus upon. Patterns of non-compliance

Unfortunately, we do not observe student outcomes under full compliance. Non-compliancemay induce correlation between A i and X i , ¯ X p ( i ) and W i in regression model (13). Oursolution to this problem is to construct instrumental variables for observed teacher and peerattributes, W i and ¯ X p ( i ) – which necessarily reﬂect any non-compliance and attrition on thepart of teachers and students – from the assigned values, W ∗ i and ¯ X p ∗ ( i ) .Rigorously justifying this approach requires imposing restrictions on how, for example,realized and assigned teacher quality relate to one another. The ﬁrst assumption we makealong these lines is: Reallocations leave the joint distribution of U i and W i unchanged, so we are free to normalize this meanto zero. ssumption 1. (Idiosyncratic Teacher Deviations) E (cid:2) U i − U ∗ i | X i , ¯ X p ∗ ( i ) , W ∗ i (cid:3) = 0 . (16)Assumption 1 implies that the diﬀerence between realized and assigned (unobserved)teacher “quality” cannot be predicted by own and assigned peer and teacher observables.While Assumption 1 is not directly testable, we can perform the following plausibility test.Let R i − R ∗ i be the diﬀerence between the realized and assigned value of some observedteacher attribute other than W i (e.g., years of teaching experience). Under equation (16),if we compute the OLS ﬁt of this diﬀerence onto 1, X i , W ∗ i , and ¯ X p ∗ ( i ) , a test for the jointsigniﬁcance of the non-constant regressors should accept the null of no eﬀect. Finding that,for example, students assigned to classrooms with low average peer prior-year achievement,tend to move into classrooms with more experienced teachers suggests that Assumption 1may be implausible.Assumption 1 and equation (12) above yield the mean independence restriction E (cid:2) U i | X i , W ∗ i , ¯ X p ∗ ( i ) (cid:3) = 0 . (17)This equation imposes restrictions on the unobserved attribute of student i ’s realized teacher.It is this latent variable which drives the student outcome actually observed.Our second assumption involves the relationship between the unobserved attributes of astudent’s assigned peers and those of her realized peers. These two variables will diﬀer ifsome students switch out of their assigned classrooms. Assumption 2. (Conditionally Idiosyncratic Peer Deviations) E (cid:2) ¯ V p ( i ) − ¯ V p ∗ ( i ) (cid:12)(cid:12) X i , ¯ X p ∗ ( i ) , W ∗ i (cid:3) = E (cid:2) ¯ V p ( i ) − ¯ V p ∗ ( i ) (cid:12)(cid:12) X i , ¯ X p ∗ ( i ) (cid:3) (18)Assumption 2 implies that the diﬀerence between realized and assigned unobserved peer“quality” cannot be predicted by assigned teacher observables. We do allow for these devia-tions to covary with a student’s type and the assigned composition of her peers. Assumption2 and equation (11) yield a second mean independence restriction of E (cid:2) ¯ V p ( i ) (cid:12)(cid:12) W ∗ i , X i , ¯ X p ∗ ( i ) (cid:3) = g ∗ (cid:0) X i , ¯ X p ∗ ( i ) (cid:1) , (19)where g ∗ (cid:0) X i , ¯ X p ∗ ( i ) (cid:1) def ≡ E (cid:2) ¯ V p ( i ) − ¯ V p ∗ ( i ) (cid:12)(cid:12) X i , ¯ X p ∗ ( i ) (cid:3) + g (cid:0) X i , ¯ X p ∗ ( i ) (cid:1) is unrestricted. We canalso pseudo-test Assumption 2 using observed peer attributes. Finding, for example, that– conditional on own type, X i , and assigned peers’ average type, ¯ X ∗ p ( i ) – assigned teacherquality, W ∗ i , predicts diﬀerences between the realized and assigned values of other observedpeer attributes provides evidence against Assumption 2.13he experiment-generated restrictions – equations (10), (11), and (12) above – in conjunc-tion with our two (informally testable) assumptions about deviations from the experimentprotocol – Assumptions 1 and 2 above – together imply the following conditional momentrestriction: E (cid:2) U i + V i + ρ ¯ V p ( i ) (cid:12)(cid:12) W ∗ i , X i , ¯ X p ∗ ( i ) (cid:3) = g (cid:0) X i , ¯ X p ∗ ( i ) (cid:1) + ρg ∗ (cid:0) X i , ¯ X p ∗ ( i ) (cid:1) . (20)We wish to emphasize two features of restriction (20). First, the conditioning variablesare assigned peer and teacher attributes, not their realized counterparts. This reﬂects ourstrategy of using assignment constructs as instruments. Second, any function of W ∗ i , as wellas interactions of such functions with functions of X i and ¯ X p ∗ ( i ) do not predict the compositeerror U i + V i + ρ ¯ V p ( i ) conditional on X i and ¯ X p ∗ ( i ) ; hence such terms are valid instrumentalvariables.More speciﬁcally we redeﬁne h to equal h (cid:0) X i , X p ∗ ( i ) , X p ( i ) (cid:1) def ≡ α + X (cid:48) i β + g (cid:0) X i , X p ∗ ( i ) (cid:1) + X (cid:48) p ( i ) γ + ρg ∗ (cid:0) X i , X p ∗ ( i ) (cid:1) + (cid:0) X i ⊗ X p ( i ) (cid:1) (cid:48) ζ (21)and A i to equal A i def ≡ (cid:0) V i − g (cid:0) X i , X p ∗ ( i ) (cid:1)(cid:1) + ρ (cid:0) V p ( i ) − g ∗ (cid:0) X i , X p ∗ ( i ) (cid:1)(cid:1) + U i . (22)Equations (7), (21), and (22) yield an outcome equation of Y i = W (cid:48) i δ + ( X i ⊗ W i ) (cid:48) η + (cid:0) W i ⊗ ¯ X p ( i ) (cid:1) (cid:48) λ + h (cid:0) X i , ¯ X p ∗ ( i ) , X p ( i ) (cid:1) + A i . (23)Condition (20) implies that A i is conditionally mean zero given X i , ¯ X p ∗ ( i ) and W ∗ i .Summarizing, the experimentally-induced restrictions (10), (11), and (12), and our As-sumptions 1 and 2 together imply that: E (cid:2) A i | X i , ¯ X p ∗ ( i ) , W ∗ i (cid:3) = 0 . (24)The estimation simpliﬁes if we impose a restriction on the peer attrition/non-compliancethat is similar to Assumption 2, but is on the observable peer average: E (cid:2) ¯ X p ( i ) − ¯ X p ∗ ( i ) (cid:12)(cid:12) X i , ¯ X p ∗ ( i ) , W ∗ i (cid:3) = E (cid:2) ¯ X p ( i ) − ¯ X p ∗ ( i ) (cid:12)(cid:12) X i , ¯ X p ∗ ( i ) (cid:3) . (25)This restriction is directly testable. By equation (25), E (cid:2) ¯ X p ( i ) (cid:12)(cid:12) X i , ¯ X p ∗ ( i ) , W ∗ i (cid:3) = E (cid:2) ¯ X p ( i ) (cid:12)(cid:12) X i , ¯ X p ∗ ( i ) (cid:3) .

14f this restriction holds the outcome equation is as in (23), but with a redeﬁned nonparametric h that is a function of X i and ¯ X p ∗ ( i ) only, h (cid:0) X i , X p ∗ ( i ) (cid:1) def ≡ α + X (cid:48) i β + g (cid:0) X i , X p ∗ ( i ) (cid:1) + E (cid:2) ¯ X p ( i ) (cid:12)(cid:12) X i , ¯ X p ∗ ( i ) (cid:3) (cid:48) γ + ρg ∗ (cid:0) X i , X p ∗ ( i ) (cid:1) + (cid:0) X i ⊗ E (cid:2) ¯ X p ( i ) (cid:12)(cid:12) X i , ¯ X p ∗ ( i ) (cid:3)(cid:1) (cid:48) ζ, (26)and A i is A i def ≡ (cid:0) V i − g (cid:0) X i , X p ∗ ( i ) (cid:1)(cid:1) + ρ (cid:0) V p ( i ) − g ∗ (cid:0) X i , X p ∗ ( i ) (cid:1)(cid:1) + U i + (cid:0) ¯ X p ( i ) − E (cid:2) ¯ X p ( i ) (cid:12)(cid:12) X i , ¯ X p ∗ ( i ) (cid:3)(cid:1) (cid:48) γ + (cid:0) X i ⊗ (cid:0) ¯ X p ( i ) − E (cid:2) ¯ X p ( i ) (cid:12)(cid:12) X i , ¯ X p ∗ ( i ) (cid:3)(cid:1)(cid:1) (cid:48) ζ. (27)The conditional moment restriction in equation (24) also holds for this error.Equations (23) and (24) jointly deﬁne a partially linear model with an endogenous para-metric component. This is a well-studied semiparametric model (see, for example, Chenet al., 2003). The parameters δ , η , and λ are identiﬁed; h (cid:0) X i , ¯ X p ∗ ( i ) (cid:1) is a nonparametricnuisance function.We implement the partial linear IV estimator using the following approximation for h ( x, ¯ x ): h (cid:0) X i , ¯ X p ∗ ( i ) (cid:1) ≈ X (cid:48) i b + ¯ X (cid:48) p ∗ ( i ) d + (cid:0) X i ⊗ ¯ X p ∗ ( i ) (cid:1) (cid:48) f. For this approximation we estimate δ , η , and λ by linear IV ﬁt of Y i onto a constant, X i , ¯ X p ∗ ( i ) , (cid:0) X i ⊗ ¯ X p ∗ ( i ) (cid:1) , W i , ( X i ⊗ W i ), and (cid:0) W i ⊗ ¯ X p ( i ) (cid:1) using the excluded instruments W ∗ i , ( X i ⊗ W ∗ i ), and (cid:0) W ∗ i ⊗ ¯ X p ∗ ( i ) (cid:1) . Note that both assigned and realized peer groups enterthe main equation.As in the case with perfect compliance, we do not identify β , γ , and ζ , again reﬂectingthe inherent inability of the MET experiment to tell us anything about peer group eﬀects.Nevertheless knowledge of δ , η , and λ is suﬃcient to identify the class of reallocation eﬀectswe focus upon. The Measures of Eﬀective Teaching (MET) study

The MET study was conducted during the 2009/10 and 2010/11 school years in elementary,middle, and high schools located in six large urban school districts in the United States. The school districts are: Charlotte-Mecklenburg (North Carolina), Dallas Independent School District(Texas), Denver Public Schools (Colorado), Hillsborough County Public Schools (Florida), Memphis CitySchools (Tennessee), and the New York City Department of Education (New York). The participation ofschool districts and schools in the MET study was voluntary.

Sample and summary statistics

In constructing our estimation sample, we closely follow Garrett and Steinberg (2015), whoinvestigate the impact of FFT on students’ test score outcomes in the MET study. Werestrict our sample to all elementary- and middle-school students (grades 4–8) who took partin the randomization. Furthermore, we only include students with non-missing informationon baseline and ﬁnal test score outcomes as well as students with non-missing information onthe characteristics of the assigned and realized teacher, and the characteristics of the assignedand realized classroom peers. Details of the data construction from the MET data ﬁles areprovided in Appendix B.Our ﬁnal sample consists of about 8,500 students and 614 teachers in math and of about9,600 students and 649 teachers in ELA. The majority of the students (60 percent) are ele-mentary school students (grades 4–5), and the remaining students are middle school students Appendix Table A.1 summarizes all student and teacher variables. The student sample is diverse, with equal proportionsof white, black, and Hispanic students.Student test scores are centered at the relevant district-level mean and standardized bythe relevant district-level standard deviation. MET students exceed their district averagetest score by about 0.15 standard deviations on average. The majority of the teachers arefemale (about 85 percent), and about two-thirds of them are white. Most of the teachershave substantial teaching experience – seven years on average – and about 40 percent of theteachers have graduated from a Master’s program. Teaching practices

The MET data includes ratings of teaching practices for two domains of the FFT, “ClassroomEnvironment” and “Instruction.” These domains are divided into four components each, and each component is rated on a four-point scale by each rater (unsatisfactory, basic, pro-ﬁcient, distinguished), with high scores in a category indicating that a teacher is closer toan ideal teaching practice according to the FFT. The teachers are video-taped at least twiceduring the school year during lessons, and each video is rated independently by at least tworaters. In our analysis, we average the scores across all videos, raters, components, and do-mains. This aggregation of the scores is the most common one in the literature (cf., Garrettand Steinberg, 2015, Aucejo et al., 2019).The FFT is on average 2.5 in math and 2.6 in ELA, which corresponds to a rating between“basic” and “proﬁcient”; for the purpose of our analysis, we create three categories of FFT(see Appendix Figure A.1). We set the cutoﬀs at FFTs of 2.25 and 2.75. In our sample, low-FFT teachers have “basic” teaching practices on average (average FFT of 2.1 in both mathand ELA), and high-FFT teachers have “proﬁcient” teaching practices on average (averageFFT of 2.9 in both math and ELA); 18 percent of the teachers in math and 14 percent ofthe teachers in ELA are classiﬁed as having a low FFT, 62 percent of the teachers in mathand 58 percent of the teachers in ELA are classiﬁed as having a middle FFT, and 20 percentof the teachers in math and 28 percent of the teachers in ELA are classiﬁed as having a highFFT. Our results are not sensitive to the exact position of the cutoﬀs, and we investigatea model with two or four, instead of three, FFT categories in our sensitivity checks below.This categorization also creates variation in FFT levels within randomization blocks: in This variable is missing for 30 percent of the students in the math sample and for 25 percent of studentsin the ELA sample. Teaching experience and teacher education are missing for about 30 percent of teachers. The components of the domain “Classroom Environment” are: creating an environment of respect andrapport; establishing a culture for learning; managing classroom procedures; managing student behavior. Thecomponents of the domain “Instruction” are: communicating with students; using questioning and discussiontechniques; engaging students in learning; using assessment in instruction.

Test score outcomes

The end-of-year test scores provided in the MET data are z -scores, i.e., they are standardizedsuch that district-wide, they will be mean zero with unit standard deviation. Since METschools are not representative at the district level, the mean test scores may – and do –deviate from zero. We use the 2010/11 z -score as our outcome variable.To identify reallocation eﬀects, we split 2009/10 baseline test scores into three bins,corresponding to terciles of the within-district z -score distribution. In our estimation sample,27 percent of the students in math, and 26 percent in ELA, have “low” baseline test scores,36 percent of the students in math, and 35 percent of the students in ELA, have “middle”baseline test scores, and 36 percent of the students in math, and 39 percent of the studentsin ELA, have “high” baseline test scores. To include classroom peers into the analysis, wecompute the fraction of each student’s classmates with high, middle, and low baseline testscores (leave-own-out means).

Non-compliance

As outlined in Section 2, some teachers and students switched classrooms or schools beforethe start of the school year, which we take into account in our identiﬁcation strategy. Inour sample, 69 percent of the students in math, and 73 percent of the students in ELA, areactually taught by their randomly assigned teachers. This level of non-compliance is highenough to make analyses which ignore it potentially problematic.We also observe changes in classroom peers after the randomization but before the be-ginning of the school year. Changes in assigned peers were driven by students who leaveschools, repeat a grade or, in some cases, by the need to adjust their schedule. On average,students’ realized peers, however, are not appreciably diﬀerent from their assigned ones. Thediﬀerence in baseline z -scores between the assigned and the realized peers amounts to just0.02 standard deviations on average both in math and in ELA. Tests of identifying assumptions and restrictions

In this section, we report the results of the series of speciﬁcation tests discussed earlier; specif-ically tests designed to assess the plausibility of Assumptions 1 and 2. These two assumptionsimpose restrictions on the nature of non-compliance by students and teachers. We further These numbers are not exactly 33 percent in each category, because the students in our sample have onaverage higher baseline test scores than the full student population in a district. irectly test restriction (25), which is a restriction on the nature of non-compliance by peers.We also assess the quality of the initial MET randomization of teachers to classrooms.To test whether the randomization was successful in balancing student characteristicsacross teachers with diﬀerent levels of FFT we regress the FFT of a student’s assigned teacheron the student’s own characteristics, controlling for randomization block ﬁxed eﬀects. Noneof the student characteristics predict assigned teacher’s FFT, individually or jointly, whichconﬁrms that the randomization indeed “worked” (see Appendix Table A.2).Covariate balance, however, is not a suﬃcient condition to identify reallocation eﬀectsunder non-compliance by both students and teachers (see Section 2). Assumption 1 requiresthat own and assigned peer and teacher observables should not predict the diﬀerence betweenrealized and assigned unobserved teacher quality. Since this assumption involves a statementabout unobserved variables, we cannot test it directly. Instead we “test” it indirectly asdescribed in Section 2. Speciﬁcally we use those teacher background characteristics that arenot part of the the model as replacements for the unobserved quality of a teacher: a teacher’sdemographics, experience, and education. We regress the diﬀerence between realized andassigned teacher characteristics on the student’s baseline test score, the FFT of the assignedteacher, and the average baseline test score of the assigned peers. Consistent with Assumption1, these variables do not jointly predict diﬀerences between the characteristics of the assignedand realized teacher in any of the regression ﬁts (see Appendix Table A.3).Assumption 2 states that diﬀerences between realized and assigned unobserved peer qual-ity cannot be predicted by assigned teacher observables, conditional on own baseline achieve-ment and the baseline achievement of assigned peers. Again, we can only perform an indirecttest of this assumption. To do so we regress diﬀerences between the assigned and realizedcharacteristics of classroom peers onto the FFT of the assigned teacher, controlling for ownbaseline test scores and assigned peers’ average baseline test scores. We do not ﬁnd, consis-tent with Assumption 2, that teacher FFT predicts diﬀerences between the characteristics ofthe assigned and realized peers (see Appendix Tables A.4 and A.5).Finally, we directly assess restriction (25), which implies that the baseline test scoresof realized peers cannot be predicted by assigned teacher FFT. We test this restriction byregressing the baseline test scores of realized peers onto a student’s baseline test score, thebaseline test scores of her assigned peers, as well as the FFT of her assigned teacher. Weﬁnd that this restriction also holds in our data (see Appendix Table A.6). To construct estimates of the average eﬀect of reassigning teachers across classrooms weproceed in three steps. First, as outlined in Section 2, we estimate (a subset of) the param-eters of the educational production function. Second, we feed these parameters into a linear19rogramming problem to compute an optimal assignment (i.e., an assignment that maxi-mizes aggregate achievement). Third, we compare the aggregate outcome under the optimalassignment to the aggregate outcome under the status quo assignment, as well as to thatunder the worst assignment (i.e., the one which minimizes aggregate achievement). Thesecomparisons provide a sense of the magnitude of achievement gains available from improvedteacher assignment policies.

Estimation of the education production function

The speciﬁcation that we use to estimate the parameters of the educational production func-tion coincides with equation (23). Since randomization was carried out within randomizationblocks, we additionally include randomization block ﬁxed eﬀects in this regression model. Wethen estimate the model’s parameters by the method of instrumental variables (IV) as de-scribed in Section 2. Tests for instrument relevance following Sanderson and Windmeijer(2016) conﬁrm that our IV estimates are unlikely to suﬀer from weak-instrument bias (seeAppendix Table A.7).

Computing and characterizing an optimal assignment

The aim of the analysis is to ﬁnd an assignment of teachers to classrooms that improvesstudent outcomes. What is meant by an “optimal” assignment depends, of course, on theobjective function.In this paper we choose to maximize aggregate outcomes (i.e., the sum of all students’ testscores). This is the “simplest” objective we can consider, it is straightforward to compute,justiﬁable from a utilitarian perspective, and easy to interpret. We would like to emphasize,however, that our analysis can be modiﬁed to accommodate other objective functions – insome cases easily, in others with more diﬃculty. In practice, for example, the social plannermay care about both the aggregate outcome as well as inequality, especially across identiﬁablesubgroups.Our intention is not to advocate for maximization of the aggregate outcome in practice,although doing so is appropriate in some circumstances; rather this objective provides aconvenient starting point and makes our analysis comparable to that of other educationalpolicy evaluations (e.g., the typical regression-based class-size analysis provides an estimateof the eﬀect of class-size on average achievement).To determine the optimal allocation, we feed the estimated parameters from equation (23),speciﬁcally ˆ δ , ˆ η , and ˆ λ , into a linear program. Given these parameters, we can compute, foreach student, her predicted outcome when taught by a low-, middle-, or high-FFT teacher,leaving the original classroom composition unchanged.20ormally, for each student i , we compute three counterfactual outcomes (cid:98) Y i ( w ), w ∈{ w L , w M , w H } , corresponding to an assignment to a low-, middle- or high-FFT teacher. Ag-gregated to the classroom level this yields, in an abuse of notation, three counterfactualclassroom-level test score aggregates, (cid:98) Y c ( w ) = (cid:80) i ∈ s ( c ) (cid:98) Y i ( w ), with w ∈ { w L , w M , w H } (here s ( c ) denotes the set of indices for students in classroom c ).By aggregating the counterfactual outcomes to the classroom level, we transform the as-signment problem from a many-to-one matching problem to a one-to-one matching problem.This approach is only suitable because the conﬁguration of students across classrooms re-mains ﬁxed; we only consider the eﬀects of reassigning teachers across existing classrooms ofstudents. The one-to-one matching problem is a special linear program, a transportationproblem, which is easily solvable. There are a few additional constraints we impose to make the reallocation exercise real-istic. First, we do not allow teachers to be reassigned across districts or across school types(i.e., elementary or middle school). We also present, as a sensitivity check, a version of theallocation where we only allow teachers to be reassigned within their randomization block.Second, there are a few teachers that teach several sections of a class. In this case, we treatthese sections as clusters, and allocate one teacher to all sections in each such cluster.When computing counterfactual outcomes for each student, we omit the nonparametriccomponent, h ( x, ¯ x ), as well as the randomization block ﬁxed eﬀects and the main eﬀects ofown baseline achievement. This is appropriate because we report the diﬀerence between thesum of aggregate outcomes for two allocations where the nonparametric component, the ﬁxedeﬀects, and the main eﬀects of own baseline achievement cancel out.In addition to the optimal assignment, we also compute a “worst” possible assignment, i.e.,the assignment that minimizes aggregate test scores. The diﬀerence between the aggregateoutcome for the best and worst assignment is the maximal reallocation gain. This providesan upper bound on the magnitude of student achievement gains that teacher reassignmentsmight yield in practice. Average Reallocation Eﬀects

An individual reallocation eﬀect, or reallocation gain, is deﬁned as the diﬀerence betweenan individual student’s outcome under two assignments. For example, one can compute, foreach student, (cid:98) Y i ( ˜ W opti ), i.e., the predicted outcome under the optimal allocation, where ˜ W opti takes the values w L , w M , or w H , depending on whether the student would be assigned toa low-, middle-, or high-FFT teacher in an optimal allocation. Similarly, one can compute As noted earlier the eﬀects of classroom composition changes on student achievement are unidentiﬁablehere in any case. Appendix C contains the details on how we specify the transportation problem. To compute the optimalassignment, we use the function lpSolve in R . (cid:98) Y i ( W i ). InMET schools the status quo assignment was induced by random assignment of teachers toclassrooms within randomization blocks. The allocation of teachers across schools within agiven district was, of course, non-random, and possibly non-optimal from the standpoint ofmaximizing the aggregate outcome. An individual reallocation gain can then be computedas the diﬀerence between these two outcomes, (cid:98) Y i ( ˜ W opti ) − (cid:98) Y i ( W i ). The optimal assignment isnot the assignment that maximizes the predicted outcome for student i , but the assignmentthat maximizes the aggregate outcome across all the classrooms in the sample (subject to thevarious constraints we impose on the problem – like ruling out reassignments across schooldistricts).These individual gains can be aggregated in many ways in order to create policy-relevantparameters. We start by deﬁning our key parameter of interest, the average reallocationeﬀect , as (cid:91) ARE = 1 N N (cid:88) i =1 (cid:104) (cid:98) Y i ( ˜ W opti ) − (cid:98) Y i ( W i ) (cid:105) . (28)We also consider conditional average reallocation eﬀects, that is, average reallocationeﬀects for students with varying baseline characteristic x (e.g., students with low, middle,and high baseline test scores): (cid:91) ARE ( x ) = 1 (cid:80) Ni =1 ( X i = x ) N (cid:88) i =1 ( X i = x ) (cid:104) (cid:98) Y i ( ˜ W opti ) − (cid:98) Y i ( W i ) (cid:105) . (29)Below we show that only about one-half of MET students would experience a teacherchange when switching from the status quo teacher assignment to an optimal one. Thereallocation eﬀect for students in classrooms not assigned a diﬀerent teacher is, of course,zero. This motivates a focus on the average achievement gain among those students who are assigned a diﬀerent teacher. We deﬁne the reallocation eﬀect conditional on being reassigned as: (cid:92) AREC ( x ) = 1 (cid:80) Ni =1 ( X i = x ) ( W i (cid:54) = ˜ W opti ) × N (cid:88) i =1 ( X i = x ) ( W i (cid:54) = ˜ W opti ) (cid:104) (cid:98) Y i ( ˜ W opti ) − (cid:98) Y i ( W i ) (cid:105) . (30)We can, of course, not condition on x as well.Finally, the maximal reallocation gain is obtained if we compare the average outcomeunder an optimal assignment with that under a worst assignment22 ARE max ( x ) = 1 (cid:80) Ni =1 ( X i = x ) N (cid:88) i =1 ( X i = x ) (cid:104) (cid:98) Y i ( ˜ W opti ) − (cid:98) Y i ( ˜ W worsti ) (cid:105) , (31)where (cid:98) Y i ( ˜ W worsti ) is the predicted outcome for student i under an allocation that minimizesaggregate test scores. Inference on reallocation eﬀects

We use the Bayesian bootstrap to quantify our (posterior) uncertainty about AREs. We treateach teacher-classroom pair as an i.i.d. draw from some unknown (population) distribution.Following Chamberlain and Imbens (2003), we approximate this unknown population bya multinomial, to which we assign an improper Dirichlet prior. This leads to a posteriordistribution which (i) is also Dirichlet and (ii) conveniently only places probability mass ondata points observed in our sample.Mechanically, we draw C standard exponential random variables and weigh each studentin section/classroom c = 1 , . . . , C with the c th weight (i.e., all students in the same sectionhave the same weight). Using this weighted sample, we compute the IV regression ﬁt, solvefor the optimal (worst) assignment, and compute the various reallocation eﬀects. We repeatthis procedure 1,000 times. This generates 1,000 independent draws from the posteriordistribution of the ARE.Formally this approach to inference is Bayesian. Consequently the “standard errors” wepresent for our ARE estimates summarize dispersion in the relevant posterior distribution(not variability across repeated samples). An alternative, frequentist, approach to inferenceis provided by Hsieh et al. (2018). They transform the problem of inference on the solutionto a linear program into inference on a set of linear moment inequalities. If the bindingconstraints are the same over the bootstrap distribution, then inference based on the Bayesianbootstrap will be similar to that based on moment inequalities (see also Graham et al., 2007and Bhattacharya, 2009). The “standard errors” for the AREs are constructed as follows. Let ˆ θ ( b ) be the estimate of the reallocationeﬀect in b th the bootstrap sample (or equivalently the b th draw from the posterior distribution for θ ) andˆ θ the reallocation eﬀect in the original sample. Consider the centered statistic ˆ θ ( b ) − ˆ θ ; let q (0 . q (0 . θ − q (0 . , ˆ θ − q (0 . − (0 . ≈ .

92, where Φ( . ) is the standard normal CDF. Results

Regression results

Before reporting parameter estimates for our preferred model – equation (23) above withthree categories of FFT, own and peer baseline achievement each – we present those for amore conventional model with teacher FFT, student and peer average baseline test scoresall entering linearly. For this speciﬁcation we leave the FFT and baseline test scores undis-cretized. These initial results replicate and expand upon the prior work of Garrett andSteinberg (2015), who study the impact of FFT on student achievement in (approximately)the same sample. These initial results indicate that teacher FFT does not aﬀect students’ test scores onaverage (see Table 2, Panel A). This is consistent with the ﬁndings of Garrett and Steinberg(2015). Table 2 presents estimates, using the instrumental variables described in Section2. Here this involves using the FFT score of a student’s randomly assigned teacher as aninstrument for that of her realized teacher (Panels A–C), and the average baseline test scoreof her assigned peers as an instrument for the average baseline test score of her realized peers(Panel B).OLS estimates of the same model are reported in Appendix Table A.8. The coeﬃcienton FFT is statistically signiﬁcant and positive in the OLS results. The discrepancy betweenthe OLS and IV results may reﬂect the impact of correcting for non-compliance, as describedabove, or simply reﬂect the greater sampling variability of the IV estimates.Next we add the interactions of teacher FFT with both the baseline student, and peeraverage, test scores. This provides an initial indication of whether any complementaritybetween teacher FFT and student baseline achievement is present. These results suggestthat high-FFT teachers are more eﬀective at teaching students with higher baseline scores(see Table 2, Panels B and C). This result is signiﬁcant at the 5-percent level in the mathsample, but only weakly signiﬁcant in the ELA sample. The magnitudes and precision ofthe teacher-student match eﬀects are similar across the IV and OLS estimates (the latterreported in Appendix Table A.8). The coeﬃcient on the interaction between teacher FFTand peer average baseline achievement (Panel B) is poorly determined (whether estimatedby IV or OLS).In Table 3 we report IV estimates of our preferred 3 × W i and X i consisting of dummy variables for middle and high FFTand baseline test scores respectively. In this speciﬁcation, the teacher FFT main eﬀectsremain insigniﬁcant. We do, however, ﬁnd positive match eﬀects between teacher FFT andstudent baseline scores. These are signiﬁcant for the high-FFT teachers (i.e., teachers with Aucejo et al. (2019) study the impact of FFT in the MET data, but focus on math classrooms inelementary schools. able 2: IV regression results of the linear model. Dependent variables: student test score outcomes(1) (2) (3) (4) (5) (6)A. Only teacher B. Full model C. Withouteﬀects teacher × peerinteractionsMath ELA Math ELA Math ELA δ FFT 0.013 0.079 -0.004 0.104 -0.004 0.076(0.093) (0.076) (0.091) (0.075) (0.091) (0.075) η FFT × baseline 0.096** 0.073* 0.098** 0.049(0.042) (0.039) (0.046) (0.040) λ FFT × avg. peer baseline 0.016 -0.172(0.103) (0.197) β Baseline 0.749*** 0.693*** 0.505*** 0.512*** 0.498*** 0.565***(0.011) (0.011) (0.109) (0.099) (0.119) (0.103) R Note:

The dependent variables are subject-speciﬁc test score outcomes in math and ELA. The speciﬁcationsinclude linear terms for FFT, individual and peer baseline test scores. The instrumental variables are basedon assigned teacher FFT (Panels A–C) and assigned peer baseline test scores (Panel B). All regressionscontrol for the h ( x, ¯ x ) function (see Section 2) and for randomization block ﬁxed eﬀects. Analytic standarderrors, clustered by randomization block, are in parentheses.** signiﬁcant at the 1%-level ** signiﬁcant at the 5%-level * signiﬁcant at the 10%-level. a “proﬁcient” score on average). Students with middle or high baseline test score levelsscore signiﬁcantly higher on end-of-year achievement tests when matched with a high-FFTteacher, compared to students with low baseline test scores (see Table 3, Panels B and C). The OLS estimation results, reported in the appendix, are qualitatively similar to the IVones; although the interactions of the middle-FFT dummy variable with the middle- andhigh-baseline student test score dummies are also signiﬁcant when estimated by OLS (seeAppendix Table A.9).To compute optimal assignments and average reallocation eﬀects, we use the IV estimatespresented in columns 5–6 of Table 3. These speciﬁcations omit the FFT-by-peer compositioninteraction terms, whose coeﬃcients are poorly determined in all speciﬁcations (whetherﬁtted by IV or OLS). Omitting these terms has little eﬀect on either the location or theprecision of the coeﬃcients on the FFT-by-baseline interactions. In Appendix Tables A.15and A.16 we also present average rellocation eﬀects based upon the speciﬁcations in columns3–4 of Table 3. These ARE estimates are larger, albeit less precisely determined. In contrast, the coeﬃcients on interactions of the teacher FFT dummy variables with the peer compositionvariables are imprecisely determined. able 3: IV regression results of the 3 × × peerinteractionsMath ELA Math ELA Math ELA δ FFT middle 0.069 -0.029 -0.145 -0.219 0.027 -0.082(0.053) (0.048) (0.138) (0.173) (0.060) (0.059)FFT high 0.038 -0.058 -0.547 -0.137 -0.155 -0.154*(0.067) (0.065) (0.380) (0.203) (0.103) (0.083) η FFT middle × baseline middle 0.040 0.057 0.052 0.067(0.060) (0.067) (0.060) (0.063) × baseline high 0.015 0.076 0.050 0.097(0.084) (0.081) (0.076) (0.082)FFT high × baseline middle 0.184** 0.150** 0.196** 0.127*(0.082) (0.074) (0.084) (0.072) × baseline high 0.226** 0.187** 0.265** 0.149(0.099) (0.092) (0.100) (0.095) λ FFT middle × fraction peers middle 0.318 0.288(0.299) (0.401) × fraction peers high 0.239 0.161(0.205) (0.288)FFT high × fraction peers middle 0.641 0.227(0.513) (0.389) × fraction peers high 0.460 -0.355(0.409) (0.343) β Baseline middle 0.888*** 0.739*** 0.843*** 0.699*** 0.840*** 0.682***(0.077) (0.084) (0.092) (0.096) (0.092) (0.094)Baseline high 1.622*** 1.555*** 1.578*** 1.503*** 1.550*** 1.470***(0.103) (0.105) (0.124) (0.118) (0.119) (0.115) R Note:

The dependent variables are subject-speciﬁc test score outcomes in math and ELA. The instrumentalvariables are based on assigned teacher FFT (Panels A–C) and assigned peer baseline test scores (PanelB). All regressions control for the h ( x, ¯ x ) function (see Section 2) and for randomization block ﬁxed eﬀects.Analytic standard errors, clustered by randomization block, are in parentheses.*** signiﬁcant at the 1%-level ** signiﬁcant at the 5%-level * signiﬁcant at the 10%-level. haracterization of an optimal reassignment If an optimal assignment is similar to the status quo one, then large reallocation eﬀectsare unlikely. Therefore, before quantifying any reallocation eﬀects, we ﬁrst discuss how anoptimal assignment and a worst assignment diﬀer from the status quo.Our primary reassignment policy considers reassignments of teachers across classroomsdistrict-wide (that is, we allow teachers to move to a diﬀerent school within their district).We do restrict teachers to teach at the same level (e.g., elementary school teachers do notmove to middle school classrooms). Under this scenario we ﬁnd that, when moving from thestatus quo to an optimal assignment, 49 percent of the students in the math sample, and 47percent of the students in the ELA sample, are assigned to a new teacher. The balance ofstudents remain with their status quo teacher.It is interesting to examine how the reallocation changes the “assortativeness” of theassignment. An assignment is characterized as positive assortative if students with higherbaseline test scores are more frequently matched with higher-FFT teachers, compared to arandom assignment. Indeed, we observe this in our data for the optimal assignment. Figure 1displays, for each level of FFT, the distribution of student baseline test scores in the averageclass a teacher of that FFT level is assigned to. In both the math and in the ELA sample,the optimal allocation is more assortative than the status quo. In the status quo in math,for instance, a high-FFT teacher has on average 23 percent of students with low baseline testscores and 40 percent of students with high baseline test scores. In the optimal allocation,the fraction of students with low baseline test scores drops to 9 percent, and the fractionof students with high baseline test scores increases to 61 percent on average. The optimalallocation in the ELA sample displays a similar pattern. The positive assortativeness arisesin the optimal allocation because students with high baseline test scores beneﬁt more froma high-FFT teacher, compared to students with low baseline test scores (see Table 3). Theworst allocation displays negative assortativeness (see Appendix Figure A.2).

Average reallocation eﬀects

Tables 4 and 5 present ARE estimates. In Panel A an optimal assignment is compared withthe status quo; while in Panel B optimal assignments are compared with “worst” assignments.We presents results for all students (Panels A.I and B.I), as well as for just those studentswho experience a teacher change as part of the reallocation (Panels A.II and B.II).In the math sample (Table 4), the optimal allocation improves average test scores by 1.7percent of a test score standard deviation compared to the status quo. This eﬀect is preciselydetermined with a Bayesian bootstrap standard error of 0.6 percent. The reported eﬀects arelargely driven by students with high baseline test scores. These students gain 2.8 percent ofa test score standard deviation on average; students with middle and low baseline test scores,27 igure 1: Assortativeness of the optimal allocation in comparison with the status quo . . . F r a c t i on o f s t uden t s FFT low FFT middle FFT high (1) Math: status quo . . . F r a c t i on o f s t uden t s FFT low FFT middle FFT high (2) Math: optimal allocation . . . F r a c t i on o f s t uden t s FFT low FFT middle FFT high (3) ELA: status quo . . . F r a c t i on o f s t uden t s FFT low FFT middle FFT high (4) ELA: optimal allocation

Baseline low Baseline middleBaseline high

Note:

The ﬁgure compares the status quo allocation (Panels 1 and 3) with the optimal allocation (Panels 2and 4). The bars represent the fractions of students with low, middle, and high baseline test scores that ateacher with low, middle, and high FFT is assigned to on average under each of the allocations. For eachteacher type, the fractions add up to 1. The optimization is carried out within school types (elementary ormiddle school) and districts.

28n contrast, gain only 1.2 and 1.4 percent of a test score standard deviation, respectively (onaverage).Since only half of the students experience a change in their teacher, the average eﬀectrepresents an equal-weighted mixture of a zero eﬀect and a positive eﬀect on those studentswho do experience a change in teachers. The average eﬀect for the latter group is 3.6 percentof a test score standard deviation (Panel A.II, SE = 1.2 percent).

Table 4: Average reallocation gains in math. Gains expressed in test score standard deviations(1) (2) (3) (4) (5) (6) (7) (8)Panel A. Optimal versus status quoA.I Full sample A.II Conditional on being reallocatedall students high middle low all students high middle lowGain 0.017 0.028 0.012 0.014 0.036 0.059 0.028 0.026SE (0.006) (0.014) (0.008) (0.011) (0.012) (0.029) (0.019) (0.019)N 8,534 2,332 3,108 3,094 4,107 1,121 1,380 1,606Panel B. Optimal versus worst allocationB.I Full sample B.II Conditional on being reallocatedall students high middle low all students high middle lowGain 0.040 0.072 0.019 0.038 0.060 0.103 0.032 0.053SE (0.012) (0.034) (0.015) (0.026) (0.019) (0.048) (0.022) (0.037)N 8,534 2,332 3,108 3,094 5,746 1,626 1,912 2,208

Note:

The table shows the average reallocation gains from implementing the optimal assignment insteadof the random assignment (status quo) in Panel A, and the average reallocation gains from implementingthe optimal assignment instead of the worst assignment in Panel B. The gains are expressed in test scorestandard deviations. The computations are based on a 3 × The comparison of an optimal allocation with a worst allocation yields improvements thatare about twice as large. Relative to a worst allocation, an optimal allocation improves testscores by 4.0 percent of a standard deviation on average (SE = 1.2 percent). The gains are7.2 percent of a standard deviation for students with high baseline test scores, and 1.9 and3.8 percent of a standard deviation for students with middle and low baseline test scores,respectively. If one considers only those students who are reassigned to a new teacher, thereallocation eﬀect amounts 6.0 percent of a standard deviation on average (Panel B.II, SE =1.9 percent).One way to benchmark the magnitude of these AREs is to compare them with the eﬀectsof hypothetical policies aimed at improving teacher value-added measures (VAMs). As noted29n the introduction, such policies are controversial, as is the evidence marshalled to supportthem. Here we oﬀer no commentary on the advisability of actually adopting VAM-guidedteacher personnel policies; nor do we oﬀer an assessment of VAM studies. Rather we simplyuse these studies, and the policy thought experiments they motivate, to benchmark our AREﬁndings.Teacher value-added is typically conceptualized as an invariant intercept-shifter, whichuniformly raises or lowers the achievement of all students in a classroom. In this frameworkreplacing a low value-added teacher with a high one will raise achievement for all studentsin a classroom. Rockoﬀ (2004) estimates that the standard deviation of the populationdistribution of teacher value-added (in a New Jersey school district) is around 0.10 testscore standard deviations in both math and reading. Recent studies ﬁnd somewhat higherestimates: Chetty et al. (2014a) estimates that the standard deviation of teacher value-addedis 0.16 in math and 0.12 in reading; similarly, Rothstein (2017) ﬁnds values of 0.19 in mathand 0.12 in reading.Using a standard deviation of 0.15 we can consider the eﬀect of a policy which removesthe bottom τ ×

100 percent of teachers, sorted by VAM, and replaces them with teachersat the ˜ τ th quantile of the VAM distribution. Under normality the eﬀect of such a policy onaverage student achievement is to increase test scores by(1 − τ ) σ φ (cid:0) q τ σ (cid:1) − Φ (cid:0) q τ σ (cid:1) + σ Φ − (˜ τ )standard deviations. Setting τ = 0 .

05 and ˜ τ = 0 .

75 this expression gives an estimate of thepolicy eﬀect of 0.021 (i.e., 2.1 percent of a test score standard deviation). This is comparableto the average eﬀect on math achievement associated with moving from the status quo METassignment to an optimal one. In practice correctly identifying, and removing from class-rooms, the bottom ﬁve percent of teachers would be diﬃcult to do. Replacing them withteachers in the top quartile of the VAM distribution perhaps even more so. Contextualizedin this way the AREs we ﬁnd are large.An attractive feature of the policies we consider is that they are based on measurablestudent and teacher attributes, not noisily measured latent ones. At the same time we aremindful that most school districts would not ﬁnd it costless to reallocate teachers freely acrossclassrooms and schools.Another way to calibrate the size of the eﬀects we ﬁnd is as follows. We ﬁnd that whenimplementing an optimal teacher-to-classroom assignment, only about one half of studentsexperience a change in teachers. We ﬁnd that test scores increase by about 3.6 percent of astandard deviation for these students. This is comparable to increasing the average VAM ofthese students’ teachers from zero (i.e., the median) to the 0.6 quantile of the teacher VAMdistribution. Again, increasing the VAM of half of all teachers by such an amount may be30iﬃcult to do in practice. For English language arts (ELA) achievement we ﬁnd smaller reallocation eﬀects. Mov-ing from the status quo to an optimal allocation is estimated to raise achievement by 0.8percent of a test score standard deviation (SE = 0.6 percent). As with math, these gainsare concentrated among students with high baseline scores who are assigned a new teacher.These students experience an average gain of 5 percent of a test score standard deviation(SE = 2.5 percent).

Table 5: Average reallocation gains in ELA. Gains expressed in test score standard deviations.(1) (2) (3) (4) (5) (6) (7) (8)Panel A. Optimal versus status quoA.I Full sample A.II Conditional on being reallocatedall students high middle low all students high middle lowGain 0.008 0.024 0.002 0.004 0.017 0.050 0.004 0.008SE (0.006) (0.011) (0.007) (0.010) (0.012) (0.025) (0.013) (0.020)N 9,641 2,480 3,402 3,759 4,529 1,167 1,627 1,735Panel B. Optimal versus worst allocationB.I Full sample B.II Conditional on being reallocatedall students high middle low all students high middle lowGain 0.018 0.057 0.003 0.005 0.025 0.080 0.004 0.007SE (0.010) (0.028) (0.011) (0.019) (0.015) (0.038) (0.017) (0.024)N 9,641 2,480 3,402 3,759 6,675 1,770 2,229 2,676

Note:

Teacher and student categories

Working with a discrete categorization of baseline achievement and teacher FFT provides asimple, but also highly ﬂexible, way of capturing non-linearities in educational production.We tested the sensitivity of our results to alternative categorizations of teachers and students.Speciﬁcally, we test (a) a coarser speciﬁcation with two levels of teacher FFT (cutoﬀ pointat 2.5) and two levels of student baseline test scores and (b) a ﬁner speciﬁcation with fourlevels of teacher FFT (cutoﬀ points at 2.25, 2.5, and 2.75) and four levels of student baselinetest scores.In the coarser speciﬁcation, we do not detect any signiﬁcant match eﬀects between teacherFFT and student baseline test scores (see Appendix Table A.10): the model with only twolevels of teacher FFT does not capture the positive eﬀect of the high-FFT teachers on studentswith middle or high baseline test scores. By contrast, the results of the ﬁner speciﬁcation,with four levels of FFT, are very similar to the results of our preferred speciﬁcation (seeAppendix Table A.11).

Measure of teaching practices

We also assess the sensitivity of our results to the measure of teaching practices that we usein the analysis. The MET data also contains an alternative measure, the CLASS (ClassroomAssessment Scoring System), for a subset of the sample (6,320 observations in the mathsample and 6,999 observations in the ELA sample). The CLASS is a teacher observation32rotocol that uses diﬀerent domains and evaluation criteria than the FFT (see Section B.4 inthe Data Appendix for details on the observation protocol and on how we process the data).The CLASS is also widely used in research on teacher quality (e.g. Araujo et al., 2016). Weﬁnd that our results are similar when using the CLASS instead of the FFT (see AppendixTable A.12).

Restrictions on the reassignment process

In our preferred optimization procedure, we optimize the assignment of teachers to classroomswithin types of schools (elementary or middle school) and school districts. We choose theserestrictions because teachers might not be willing or able to teach in a diﬀerent school typeor a diﬀerent school district. As a sensitivity check, we also calculated the results of both amore restrictive and a less restrictive optimization procedure.Our least restrictive allocation optimizes the assignment within school types, but allowsfor reassignments across districts. The magnitudes of the reallocation eﬀects in this caseare only slightly larger than those where reallocations are within districts (see AppendixTables A.13 and A.14, columns 1–4). This ﬁnding may reﬂect the similarity of the distribu-tions of baseline student achievement and teacher FFT in MET school districts. It wouldbe interesting to repeat our analysis in a metro area consisting of an urban core district andmultiple suburban ones. In such a setting is it seems plausible that moving teachers acrossschool districts might raise average achievement.Our most restrictive allocation reassigns teachers only within school-grade-subject cells(i.e., randomization blocks). This restriction is strong because each randomization blocktypically contains only two or three sections. Under this restriction, the reallocation gainsare overall negligible (see Appendix Tables A.13 and A.14, columns 5–8).

Accounting for teacher-peer match eﬀects in the assignment

Our main results for the reallocation eﬀects are based on a model without teacher-to-peermatch eﬀects (i.e., in a model which imposes the restriction that λ = 0), since these eﬀectsare very noisily estimated using our data. Using a model which does not impose this restric-tion generates reallocation eﬀects about twice as large. These eﬀects are also more noisilymeasured (see Appendix Tables A.15 and A.16). We provide an econometric framework that allows us to semiparametrically characterizecomplementarity between teaching practices and student baseline test scores. Our framework33xploits the random assignment of teachers to classrooms available in the MET dataset, whileformally dealing with non-compliance by both teachers and students.Our results show that the potential gains associated with an outcome-maximizing assign-ment of teachers to classrooms are large. They are comparable to fairly large (hypothetical)interventions to raise teacher VAM. An attractive feature is that they are, at least in theory,resource neutral. No new teachers are required to implement the policies we consider.Our focus on the objective of maximizing the population average test score leads to anoptimal assignment that increases the gap between less and more prepared students. Wecould also consider objective functions that do no focus on average test scores, but insteadon test score gaps or proﬁciency levels. We could, for instance, choose an assignment whichmaximizes the number of students who reach a “proﬁcient” level on their end-of-year assess-ment. It is possible that this objective function would suggest a less assortative assignment:students with high baseline scores would likely reach the proﬁcient level regardless of theirteacher’s FFT, while the incremental eﬀect of having a high-FFT teacher on the probabilityof attaining proﬁciency may be larger for students with lower baseline test scores. Outcomesother than math and ELA achievement (e.g., socio-emotional skills) may also be of interest.We consider this paper as a ﬁrst pass that establishes the feasibility of recovering matcheﬀects from imperfect experimental data and shows that the resulting reallocation eﬀectsthat depend upon these match eﬀects and the supply of teachers are substantial.34 eferences

Ai, C. and X. Chen (2003): “Eﬃcient Estimation of Models with Conditional MomentRestrictions Containing Unknown Functions,”

Econometrica , 71, 1795 – 1843.

Araujo, M. C., P. Carneiro, Y. Cruz-Aguayo, and N. Schady (2016): “Teacherquality and learning outcomes in kindergarten,”

The Quarterly Journal of Economics , 131,1415 – 1453.

Aucejo, E., P. Coate, J. C. Fruehwirth, S. Kelly, and Z. Mozenter (2019):“Teacher Eﬀectiveness and Classroom Composition,”

Manuscript . Bhattacharya, D. (2009): “Inferring optimal peer assignment from experimental data,”

Journal of the American Statistical Association , 104, 486 – 500.

Chamberlain, G. and G. W. Imbens (2003): “Nonparametric applications of Bayesianinference,”

Journal of Business and Economic Statistics , 21, 12 – 18.

Chen, X., O. Linton, and I. Van Keilegom (2003): “Estimation of semiparametricmodels when the criterion function is not smooth,”

Econometrica , 71, 1591 – 1608.

Chetty, R., J. N. Friedman, and J. E. Rockoff (2012): “Great Teaching,”

EducationNext , 12, 59 – 64.——— (2014a): “Measuring the Impacts of Teachers I: Evaluating Bias in Teacher Value-Added Estimates,”

American Economic Review , 104, 2593 – 2632.——— (2014b): “Measuring the Impacts of Teachers II: Teacher Value-Added and StudentOutcomes in Adulthood,”

American Economic Review , 104, 2633 – 2679.

Cohen-Vogel, L., L. Feng, and L. Osborne-Lampkin (2013): “Seniority provisionsin collective bargaining agreements and the ‘teacher quality gap’,”

Educational Evaluationand Policy Analysis , 35, 324 – 343.

Danielson, C. (2011): “The Framework for Teaching Evaluation Instrument, 2011 Edi-tion,” Retrieved on May 31, 2020, from https://danielsongroup.org/downloads/2011-framework-teaching-evaluation-instrument . Darling-Hammond, L. (2015): “Can Value Added Add Value to Teacher Evaluation?”

Educational Researcher , 44, 132 – 137.

Dee, T. S. (2004): “Teachers, race, and student achievement in a randomized experiment,”

Review of Economics and Statistics , 86, 195 – 210.35—— (2005): “A Teacher like Me: Does Race, Ethnicity, or Gender Matter?”

The AmericanEconomic Review , 95, 158 – 165.

Garrett, R. and M. P. Steinberg (2015): “Examining Teacher Eﬀectiveness UsingClassroom Observation Scores: Evidence From the Randomization of Teachers to Stu-dents,”

Educational Evaluation and Policy Analysis , 37, 224 – 242.

Graham, B. (2008): “Identifying social interactions through conditional variance restric-tions,”

Econometrica , 76, 643 – 660.

Graham, B. S. (2011): “Econometric methods for the analysis of assignment problems inthe presence of complementarity and social spillovers,” in

Handbook of social economics ,ed. by J. Benhabib, A. Bisin, and M. O. Jackson, Amsterdam: North-Holland, vol. 1B,965 – 1052.

Graham, B. S., G. W. Imbens, and G. Ridder (2007): “Redistributive eﬀects fordiscretely-valued inputs,” IEPR Working Paper 07.7, University of Southern California.——— (2010): “Measuring the eﬀects of segregation in the presence of social spillovers: anonparametric approach,” Working Paper 16499, NBER.——— (2014): “Complementarity and aggregate implications of assortative matching: anonparametric analysis,”

Quantitative Economics , 5, 29 – 66.——— (2020): “Identiﬁcation and Eﬃciency Bounds for the Average Match Function UnderConditionally Exogenous Matching,”

Journal of Business & Economic Statistics , 38, 303– 316.

Grissom, J. A., D. Kalogrides, and S. Loeb (2015): “The micropolitics of educationalinequality: The case of teacher–student assignments,”

Peabody Journal of Education , 90,601 – 614.

Hansen, B. E. (2020): “Econometrics,” Retrieved on July 4, 2020, from . Hanushek, E. (1971): “Teacher Characteristics and Gains in Student Achievement: Esti-mation Using Micro Data.”

American Economic Review , 61, 280 – 288.

Hanushek, E. A., J. F. Kain, and S. G. Rivkin (2004): “Disruption versus Tieboutimprovement: The costs and beneﬁts of switching schools,”

Journal of Public Economics ,88, 1721 – 1746.

Hsieh, Y.-W., X. Shi, and M. Shum (2018): “Inference on estimators deﬁned by math-ematical programming,”

Manuscript . 36 alogrides, D., S. Loeb, and T. Beteille (2011): “Power Play? Teacher Char-acteristics and Class Assignments,” Working Paper 59, National Center for Analysis ofLongitudinal Data in Education Research.

Kane, Thomas, J., D. F. McCaffrey, T. Miller, and D. O. Staiger (2013):“Have we identiﬁed eﬀective teachers? Validating measures of eﬀect teaching using randomassignment,” Met project research paper, Bill & Melinda Gates Foundation.

Kraft, M. A. (forthcoming): “Interpreting Eﬀect Sizes of Education Interventions,”

Edu-cational Researcher . Loeb, S., J. Soland, and L. Fox (2014): “Is a Good Teacher a Good Teacher forAll? Comparing Value-Added of Teachers with Their English Learners and Non-EnglishLearners,”

Educational Evaluation and Policy Analysis , 36, 457 – 475.

Manski, C. F. (1993): “Identiﬁcation of endogenous social eﬀects: the reﬂection problem,”

Review of Economic Studies , 60, 531 – 542.

McFarland, J. et al. (2019): “The Condition of Education 2019,” NCES 2019-144,National Center for Education Statistics.

Morganstein, D. and R. Wasserstein (2014): “ASA Statement on Value-Added Mod-els,”

Statistics and Public Policy , 1, 108 – 110.

Robinson, P. M. (1988): “Root-N-Consistent Semiparametric Regression,”

Econometrica ,56, 931 – 954.

Rockoff, J. E. (2004): “The Impact of Individual Teachers on Student Achievement:Evidence from Panel Data,”

American Economic Review , 94, 247 – 252.

Rothstein, J. (2010): “Teacher Quality in Educational Production: Tracking, Decay, andStudent Achievement 2010,”

Quarterly Journal of Economics , 125, 175 – 214.——— (2017): “Measuring the impacts of teachers: Comment,”

American Economic Review ,107, 1656–84.

Sanderson, E. and F. Windmeijer (2016): “A weak instrument F-test in linear IVmodels with multiple endogenous variables,”

Journal of Econometrics , 190, 212 – 221.

Snyder, T. D. et al. (2017): “Digest of Education Statistics,” NCES 2018-070, NationalCenter for Education Statistics.

Stock, J. H. and M. Yogo (2005): “Testing for weak instruments in Linear IV regres-sion,” in

Identiﬁcation and Inference for Econometric Models: Essays in Honor of ThomasRothenberg , Cambridge University Press, 80 – 108.37 hite, M., B. Rowen, G. Alter, L. Blankenship, C. Greene, and S. Windish (2019):

User Guide to Measures of Eﬀective Teaching Longitudinal Database (MET LDB) ,Ann Arbor: Inter-University Consortium for Political and Social Research, The Universityof Michigan. 38

Figures and Tables

Figure A.1: Distribution of teacher FFT and student baseline test scores . . . . F r a c t i on o f t ea c he r s FFTNumber of teachers = 614, mean = 2.52 (1) FFT - math teachers . . . F r a c t i on o f s e c t i on s -2 -1 0 1 2 Baseline test scores (section averages)Number of sections = 792, mean = .02 (2) Math baseline test scores . . . . F r a c t i on o f t ea c he r s FFTNumber of teachers = 649, mean = 2.58 (3) FFT - ELA teachers . . . F r a c t i on o f s e c t i on s -2 -1 0 1 2 Baseline test scores (section averages)Number of sections = 796, mean = .07 (4) ELA baseline test scores

Note:

Distribution of teacher FFT (Panels 1 and 3) and student baseline test scores (Panels 2 and 4) in theestimation sample. Student baseline test scores are section averages. igure A.2: Assortativeness of the worst allocation in comparison with the status quo . . . F r a c t i on o f s t uden t s FFT low FFT middle FFT high (1) Math: status quo . . . F r a c t i on o f s t uden t s FFT low FFT middle FFT high (2) Math: worst allocation . . . F r a c t i on o f s t uden t s FFT low FFT middle FFT high (3) ELA: status quo . . . F r a c t i on o f s t uden t s FFT low FFT middle FFT high (4) ELA: worst allocation

Baseline low Baseline middleBaseline high

Note:

The ﬁgure compares the status quo allocation (Panels 1 and 3) with the worst allocation (Panels 2and 4). The bars represent the fractions of students with low, middle, and high baseline test scores that ateacher with low, middle, and high FFT is assigned to on average under each of the allocations. For eachteacher type, the fractions add up to 1. The optimization is carried out within school types (elementary ormiddle school) and districts. able A.1: Summary statistics of the estimation samples(1) (2) (3) (4)Math sample ELA sampleMean SD Mean SDStudent characteristicsAge 10.32 (1.56) 10.28 (1.48)Fourth grade 26% - 26% -Fifth grade 35% - 35% -Sixth grade 16% - 15% -Seventh grade 10% - 12% -Eighth grade 12% - 12% -Male 49% - 49% -Gifted 6% - 11% -Special education 7% - 7% -ELL 15% - 13% -Free/reduced-price lunch 58% - 57% -White 27% - 28% -Black 31% - 30% -Hispanic 31% - 33% -Asian 8% - 7% -Other race 3% - 3% -Subject test score 2010 (baseline) 0.11 (0.89) 0.17 (0.93)Subject test score 2011 (outcome) 0.15 (0.90) 0.17 (0.91)Teacher characteristicsMale 14% - 11% -White 65% - 64% -Black 28% - 29% -Hispanic 6% - 6% -Other race 2% - 1% -Years in district 7.39 (6.73) 7.36 (6.33)Master’s or higher 40% - 37% -FFT 2.53 (0.30) 2.59 (0.30)Classroom characterisicsClass size 25.54 (5.48) 25.60 (5.44)Sample sizeStudents 8,534 9,641Teachers 614 649Schools 153 160 Note:

Summary statistics of the estimation samples. Standard deviations are in parentheses. ELL: Englishlanguage learner. FFT: Framework for teaching. For details on the sample construction, see Appendix B. able A.2: Balancing tests. Dependent variable: FFT of the assigned teacher(1) (2) (3) (4)Math ELAcoeﬀ SE coeﬀ SEBaseline test score 0.004 (0.005) 0.000 (0.004)Age -0.003 (0.006) -0.006 (0.005)Male 0.001 (0.005) -0.002 (0.003)Gifted -0.015 (0.015) 0.010 (0.016)Special education 0.013 (0.012) -0.002 (0.009)ELL -0.004 (0.011) -0.014 (0.011)Black -0.004 (0.008) -0.010 (0.007)Hispanic -0.013 (0.009) 0.002 (0.006)Asian 0.009 (0.014) 0.016 (0.011)Other race 0.014 (0.012) 0.010 (0.010)Free/reduced-price lunch 0.005 (0.010) -0.002 (0.006)F-test for joint signiﬁ-cance (p-value) 0.465 0.241 R Note:

The table presents results of OLS regressions of the FFT of the assigned teacher on individual studentcharacteristics. Free/reduced-price lunch eligibility is coded as 0 for students who do not have information onthis variable in the dataset. All regressions control for randomization block ﬁxed eﬀects. Analytic standarderrors, clustered by randomization block, are in parentheses. ELL: English language learner. FFT: Frameworkfor teaching. able A.3: Tests of Assumption 1. Dependent variables: Diﬀerences between assigned and realizedteacher characteristics (1) (2) (3) (4) (5) (6)Panel I. Math sampleMale Teacher’s race Experience Master’swhite black hispanic (years) degreeFFT, assigned teacher 0.009 -0.039 -0.061 0.059 -2.727 -0.018(0.060) (0.107) (0.068) (0.059) (2.236) (0.099)Baseline test score 0.003 -0.003 0.005 -0.002 0.071 -0.006(0.005) (0.005) (0.004) (0.002) (0.082) (0.006)Avg. peer baseline test score, 0.027 0.037 0.007 -0.037 0.738 -0.052assigned peers (0.076) (0.058) (0.034) (0.041) (0.931) (0.089)F-test for joint signiﬁcance(p-value) 0.956 0.799 0.537 0.731 0.635 0.849 R R Note:

The table presents results of OLS regressions of diﬀerences between assigned and realized teachercharacteristics on individual baseline test scores, the FFT of the assigned teacher, and the average baselinetest scores of the assigned peers. Each column represents a regression with a diﬀerent dependent variable. Allregressions control for randomization block ﬁxed eﬀects. Analytic standard errors, clustered by randomizationblock, are in parentheses. Teachers with ethnicity ”white” and ethnicity ”other” are pooled into one category.FFT: Framework for teaching.**signiﬁcant at the 5%-level, *signiﬁcant at the 10%-level. able A.4: Tests of Assumption 2 in the math sample. Dependent variables: diﬀerences between assigned and realized peer characteristics(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)Diﬀerence between assigned Age Male Gifted Special ELL FRL Race/ethnicityand realized peer characteristics education White Black Hispanic AsianFFT, assigned teacher 0.008 0.015 0.006 -0.017 0.006 -0.005 0.004 -0.010 0.006 -0.001(0.019) (0.014) (0.009) (0.011) (0.012) (0.016) (0.011) (0.011) (0.013) (0.007)Baseline test score -0.006 0.000 0.004** -0.001 0.000 0.000 0.001 -0.002 0.000 0.001*(0.003) (0.001) (0.002) (0.001) (0.001) (0.002) (0.001) (0.002) (0.001) (0.001)Avg. peer baseline test score, -0.030* -0.006 -0.013 0.009 0.033** -0.001 -0.012 0.012 -0.007 0.008assigned peers (0.017) (0.013) (0.009) (0.008) (0.011) (0.014) (0.010) (0.011) (0.011) (0.005) R Note:

The table presents results of OLS regressions of diﬀerences between assigned and realized peer characteristics on the FFT of the assigned teacher,individual baseline test scores, and the average baseline test scores of the assigned peers. Each column represents a regression with a diﬀerent dependentvariable. All regressions control for randomization block ﬁxed eﬀects. Analytic standard errors, clustered by randomization block, are in parentheses. ELL:English language learner, FRL: Free/reduced-price lunch eligible, FFT: Framework for teaching.** signiﬁcant at the 5%-level * signiﬁcant at the 10%-level. able A.5: Tests of Assumption 2 in the ELA sample. Dependent variables: diﬀerences between assigned and realized peer characteristics(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)Diﬀerence between assigned Age Male Gifted Special ELL FRL Race/ethnicityand realized peer characteristics education White Black Hispanic AsianFFT, assigned teacher 0.000 0.008 0.008 -0.012 0.022 -0.003 -0.008 -0.001 0.012 -0.002(0.020) (0.014) (0.009) (0.011) (0.017) (0.014) (0.012) (0.013) (0.014) (0.009)Baseline test score -0.005** -0.001 0.002 0.000 -0.001 -0.002 0.001 0.000 -0.001 0.001(0.003) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.000)Avg. peer baseline test score, -0.009 0.000 -0.029*** 0.013 0.028** 0.007 -0.002 0.001 0.001 0.000assigned peers (0.013) (0.011) (0.007) (0.009) (0.009) (0.011) (0.008) (0.008) (0.010) (0.006) R Note:

The table presents results of OLS regressions of the average peer baseline test scores of the realizedpeers on individual baseline test scores, FFT of the assigned teacher, and the average peer baseline test scoresof the assigned peers in the math and ELA samples. All regressions control for randomization block ﬁxedeﬀects. Analytic standard errors, clustered by randomization block, are in parentheses. FFT: Framework forteaching.*** signiﬁcant at the 1%-level. able A.7: Tests for instrument relevance(1) (2) (3)Panel A. Math First-stage F-statisticsunconditional SWVariable F(10,238) p-value F(1, 238)FFT middle 17.57 (0.000) 60.17FFT high 11.62 (0.000) 57.95FFT middle × baseline middle 82.49 (0.000) 498.96FFT high × baseline middle 33.30 (0.000) 387.27FFT middle × baseline high 39.59 (0.000) 322.43FFT high × baseline high 36.92 (0.000) 378.59FFT middle × fraction peers middle 17.47 (0.000) 113.27FFT high × fraction peers middle 12.85 (0.000) 102.70FFT middle × fraction peers high 8.07 (0.000) 182.12FFT middle × fraction peers high 12.01 (0.000) 137.55Panel B. ELA First-stage F-statisticsunconditional SWVariable F(10,238) p-value F(1, 238)FFT middle 26.21 (0.000) 75.61FFT high 22.34 (0.000) 102.87FFT middle × baseline middle 146.86 (0.000) 836.59FFT high × baseline middle 121.72 (0.000) 1500.06FFT middle × baseline high 128.34 (0.000) 1226.48FFT high × baseline high 88.44 (0.000) 1724.38FFT middle × fraction peers middle 25.06 (0.000) 164.19FFT high × fraction peers middle 21.57 (0.000) 213.78FFT middle × fraction peers high 13.9 (0.000) 362.49FFT middle × fraction peers high 13.21 (0.000) 477.11 Note:

The table presents weak instrument F-tests following Sanderson and Windmeijer (2016). Columns 1and 2 present the unconditional F-statistics from ﬁrst-stage regressions of equation (23). Column 3 presentsthe Sanderson-Windmeijer conditional F-statistics for each of the endogenous variables. The F-statistics incolumn 3 need to be compared to the critical values of the Stock-Yogo weak identiﬁcation F-test (Stock andYogo, 2005). All of the F-statistics exceed the Stock-Yogo critical values for a 5% maximal IV relative bias(critical value is 20.74). able A.8: OLS regression results of the linear model. Dependent variables: student test scoreoutcomes (1) (2) (3) (4) (5) (6)A. Only teacher B. Full model C. Withouteﬀects teacher × peerinteractionsMath ELA Math ELA Math ELA δ FFT 0.088** 0.118** 0.081** 0.116** 0.080** 0.116**(0.040) (0.045) (0.040) (0.045) (0.040) (0.044) η FFT × baseline 0.060* 0.088** 0.059** 0.078**(0.031) (0.032) (0.028) (0.030) λ FFT × avg. peer baseline -0.006 -0.066(0.066) (0.081) β Baseline 0.749*** 0.690*** 0.595*** 0.460*** 0.598*** 0.487***(0.011) (0.011) (0.079) (0.081) (0.073) (0.078) R Note:

The dependent variables are subject-speciﬁc test score outcomes in math and ELA. The speciﬁcationsinclude linear terms for FFT, individual and peer baseline test scores. All regressions control for the h ( x, x )function (see Section 2) and for randomization block ﬁxed eﬀects. Analytic standard errors, clustered byrandomization block, are in parentheses. FFT: Framework for teaching.*** signiﬁcant at the 1%-level ** signiﬁcant at the 5%-level * signiﬁcant at the 10%-level. able A.9: OLS regression results of the 3 × × peerinteractionsMath ELA Math ELA Math ELA δ FFT middle 0.062* 0.028 0.032 -0.058 0.032 -0.127**(0.032) (0.034) (0.097) (0.091) (0.038) (0.045)FFT high 0.066 0.048 -0.060 -0.043 -0.016 -0.141**(0.041) (0.043) (0.113) (0.125) (0.052) (0.054) η FFT middle × baseline middle 0.030 0.108** 0.033 0.102**(0.047) (0.054) (0.045) (0.051) × baseline high 0.039 0.136** 0.053 0.137**(0.060) (0.067) (0.055) (0.062)FFT high × baseline middle 0.100* 0.113* 0.098* 0.094(0.059) (0.060) (0.059) (0.059) × baseline high 0.120* 0.225** 0.104 0.193**(0.068) (0.075) (0.067) (0.072) λ FFT middle × fraction peers middle -0.024 -0.204(0.198) (0.187) × fraction peers high -0.045 -0.126(0.187) (0.154)FFT high × fraction peers middle 0.155 -0.187(0.224) (0.232) × fraction peers high 0.460 -0.355(0.409) (0.343) β Baseline middle 0.911*** 0.823*** 0.782*** 0.713*** 0.783*** 0.723***(0.072) (0.075) (0.076) (0.093) (0.075) (0.095)Baseline high 1.678*** 1.636*** 1.551*** 1.443*** 1.544*** 1.454***(0.105) (0.101) (0.111) (0.113) (0.106) (0.116) R Note:

The dependent variables are subject-speciﬁc test score outcomes in math and ELA. All regressionscontrol for the h ( x, x ) function (see Section 2) and for randomization block ﬁxed eﬀects. Analytic standarderrors, clustered by randomization block, are in parentheses. FFT: Framework for teaching.*** signiﬁcant at the 1%-level ** signiﬁcant at the 5%-level * signiﬁcant at the 10%-level. able A.10: IV regression results of the 2 × × peerinteractionsMath ELA Math ELA Math ELA δ FFT high 0.072* -0.060 0.088 0.170 0.056 -0.027(0.038) (0.043) (0.087) (0.107) (0.047) (0.046) η × baseline high 0.039 -0.014 0.030 -0.062(0.053) (0.055) (0.052) (0.055) λ × fraction peers high -0.068 -0.422*(0.149) (0.229) β Baseline high 0.935*** 0.927*** 0.920*** 0.938*** 0.924*** 0.956***(0.058) (0.054) (0.058) (0.060) (0.057) (0.060) R Note:

The dependent variables are subject-speciﬁc test score outcomes in math and ELA. The instrumentalvariables are based on assigned teacher FFT (Panels A–C) and assigned peer baseline test scores (Panel B). Allregressions control for the h ( x, x ) function (see Section 2) and for randomization block ﬁxed eﬀects. Analyticstandard errors, clustered by randomization block, are in parentheses. FFT: Framework for teaching.*** signiﬁcant at the 1%-level ** signiﬁcant at the 5%-level * signiﬁcant at the 10%-level. able A.11: IV regression results of the 4 × × peerinteractionsMath ELA Math ELA Math ELA δ FFT lower middle 0.146** -0.037 0.029 0.097 0.114 -0.056(0.063) (0.048) (0.307) (0.188) (0.092) (0.059)FFT upper middle 0.029 0.013 -0.656* -0.222 -0.035 -0.114(0.059) (0.055) (0.386) (0.268) (0.078) (0.072)FFT high 0.017 -0.006 -1.622** -0.044 -0.162 -0.132(0.069) (0.060) (0.815) (0.213) (0.122) (0.086) η FFT lower middle × baseline lower middle 0.111 0.084 0.083 0.074(0.096) (0.075) (0.095) (0.072) × baseline upper middle 0.028 0.023 -0.001 0.007(0.104) (0.087) (0.107) (0.079) × baseline high 0.053 0.028 0.016 0.002(0.125) (0.107) (0.124) (0.104)FFT upper middle × baseline lower middle 0.101 0.103 0.096 0.103(0.087) (0.070) (0.078) (0.069) × baseline upper middle 0.001 0.141* 0.015 0.142*(0.087) (0.084) (0.083) (0.080) × baseline high 0.110 0.214** 0.111 0.222**(0.115) (0.092) (0.100) (0.093)FFT high × baseline lower middle 0.269** 0.144* 0.200* 0.118(0.120) (0.075) (0.104) (0.074) × baseline upper middle 0.136 0.232** 0.140 0.181**(0.111) (0.089) (0.113) (0.085) × baseline high 0.266** 0.251** 0.253** 0.182*(0.130) (0.097) (0.124) (0.099) β Baseline lower middle 0.635*** 0.780*** 0.517** 0.747*** 0.563*** 0.721***(0.131) (0.098) (0.195) (0.105) (0.156) (0.107)Baseline upper middle 1.228*** 1.273*** 1.143*** 1.246*** 1.201*** 1.203***(0.160) (0.116) (0.218) (0.124) (0.177) (0.127)Baseline high 1.913*** 2.170*** 1.750*** 2.121*** 1.835*** 2.072***(0.172) (0.141) (0.246) (0.148) (0.201) (0.151) λ (cid:88) (cid:88) R Note:

The dependent variables are subject-speciﬁc test score outcomes in math and ELA. The instrumentalvariables are based on assigned teacher CLASS (Panels A–C) and assigned peer baseline test scores (PanelB). For details on the CLASS measure of teaching practices, see Appendix B. All regressions control for the h ( x, x ) function (see Section 2) and for randomization block ﬁxed eﬀects. Analytic standard errors, clusteredby randomization block, are in parentheses.***signiﬁcant at the 1%-level, **signiﬁcant at the 5%-level, *signiﬁcant at the 10%-level. able A.13: Average reallocation gains in math: sensitivity to restrictions on possible assignments.Gains expressed in test score standard deviations.(1) (2) (3) (4) (5) (6) (7) (8)Panel A. Optimal versus status quoA.I Within school type A.II Within randomization blockall students high middle low all students high middle lowGain 0.019 0.031 0.015 0.015 0.005 0.009 0.004 0.003SE (0.007) (0.015) (0.010) (0.012) (0.002) (0.004) (0.003) (0.003)N 8,534 2,332 3,108 3,094 8,534 2,332 3,108 3,094Panel B. Optimal versus worst allocationB.I Within school type B.II Within randomization blockall students high middle low all students high middle lowGain 0.048 0.089 0.025 0.041 0.011 0.020 0.008 0.008SE (0.015) (0.039) (0.019) (0.028) (0.003) (0.009) (0.006) (0.007)N 8,534 2,332 3,108 3,094 8,534 2,332 3,108 3,094 Note:

B.1 Construction of the dataset from the MET ﬁles

Our dataset combines eight diﬀerent data ﬁles from the MET study (2018 release): therandomization ﬁle, the teacher ﬁle, the class section ﬁle, the student ﬁle, two classroomobservation score ﬁles (the CLASS and the FFT ﬁle), as well as the district-wide ﬁles for theschool years 2009/10 and 2010/11.The basis of our data construction is the randomization ﬁle. It contains identiﬁers forall students who were randomly assigned to a teacher in the second year of the MET study,and an identiﬁer (identiﬁers) for their assigned teacher(s). Through the teacher identiﬁers,we merge this dataset to the FFT and CLASS ﬁles, and thus obtain the FFT and CLASSof the assigned teachers. Moreover, we use the teacher identiﬁers to merge the data to theteacher ﬁle, in order to obtain teacher background characteristics of the assigned teachers.Student characteristics and test score outcomes come from the student ﬁle and the district-wide ﬁles. Through a student identiﬁer, we ﬁrst merge the randomization ﬁle with thestudent ﬁle, which contains individual information for all students who were part of theMET study and still had a MET teacher at the end of the school year; i.e., the studentﬁle does not contain any information on students who switch to a non-MET teacher withinthe same school, a non-MET school, or a non-MET district. We use the student ﬁle toextract students’ demographic characteristics, baseline test scores (test scores in school year2009/10), and test score outcomes (test scores in school year 2010/11) in math and ELA.While there are no missings for student background characteristics, some of the studentshave missing baseline test scores or missing test score outcomes. Therefore, we obtain themissing test score information from the district-wide ﬁles.To construct information on the peer group composition in the assigned classroom, weaverage the student background characteristics and baseline test scores at the level of theassigned teacher and randomization block, since each teacher can only be assigned to oneclassroom within a randomization block. To be precise, we construct the leave-own-out mean,i.e., the classroom mean excluding the student herself.Information on the realized teacher and the realized peers are constructed based on thestudent ﬁle. This ﬁle contains an identiﬁer for the section that the student attended in schoolyear 2010/11, as well as an identiﬁer for the teacher who taught the section. Based on thisteacher identifer, we merge information on the FFT, CLASS, and background characteristicsof the realized teacher. Furthermore, we construct leave-own-out means of peer characteristicsbased on the section identiﬁers. We add information on the size of the realized classroomfrom the class section ﬁle.

B.2 Construction of the estimation sample

In our analysis, we focus on students in grades 4-8 who were randomized to a teacher beforethe start of the school year. We create separate samples for math and ELA. In total, therandomization sample contains information on about 16,000 students who were randomizedto a teacher in math and ELA in grades 4-8.57 able B.1: Sample construction from the randomization ﬁle(1) (2) (3) (4)Restrictions Math ELAN percent N percentof base of basesample sample1. Students in randomization ﬁle 15,749 - 16,252 -2. Record in the student ﬁle 10,268 - 11,271 -3. At least two classrooms per randomizationblock (base sample) 9,824 100% 10,856 100%4. Baseline test scores and test score outcomesavailable 9,245 94% 10,136 93%5. FFT of the assigned teacher available 9,066 92% 10,057 93%6. FFT of the realized teacher available 8,724 89% 9,767 90%7. Information on the assigned peers available 8,718 89% 9,762 90%8. Information on the realized peers available 8,717 89% 9,761 90%9. At least two classrooms per randomizationblock after applying all restrictions 8,534 87% 9,641 89%

Note:

Sample restrictions used to create the estimation sample from MET data ﬁles, school year 2010/11.

Table B.1 details further restrictions that we apply to construct our estimation dataset.We require each student to have a record in the student ﬁle, since we use this ﬁle to identifythe realized teacher, the realized peer group, and the student background variables. Abouttwo-thirds of the students in the randomization sample can be identiﬁed in the student ﬁle.Moreover, we restrict our sample to only randomization blocks with at least two classrooms.The resulting dataset forms our base sample.After constructing our base sample, we remove observations with missing informationon baseline test scores or test score outcomes, with missing information on the FFT ofthe assigned or realized teacher, and with missing information on the baseline test scoresof the assigned or realized peers. We further remove all randomization blocks with onlyone classroom after applying these restrictions. Our resulting estimation sample containsinformation on about 8,500 students in math and 9,600 students in ELA. Thus, we retainabout 87 percent of observations from the base sample in math, and 89 percent of observationsfrom the base sample in ELA.Our sample size is close to the sample size reported by Garrett and Steinberg (2015).We obtain a slightly larger sample size because we complete missing test score informationin the student ﬁle with information from the district-wide ﬁles. These ﬁles are part of the2018 MET release and were not available when Garrett and Steinberg (2015) published theirstudy.Our estimation dataset does not contain any missing information for the main variablesused in the analysis; however, it contains some missings in teacher and student demographics.With the exception of free/reduced-price lunch eligibility (30 percent missings in math and 25percent missings in ELA), the student background variables contain (virtually) no missings.Further missings occur in the teacher demographics: teacher gender and race/ethnicity aremissing for 3–4 percent of the estimation sample. Teacher experience and teachers’ education58re missing for about 30 percent of the estimation sample each, because one school districtdid not provide information on teachers’ education, and another district did not provide in-formation on teachers’ experience. Full information on teacher demographics is only availablefor 42 percent of the math sample, and for 48 percent of the ELA sample.

B.3 Sample comparisons

This section investigates in how far the randomization sample diﬀers from the estimationsample. It compares the distribution of student baseline test scores and teacher FFT in theoriginal randomization sample with the distribution in the estimation sample.The randomization sample consists of about 15,700 students in math and 16,300 studentsin ELA; baseline test scores are available for about 13,900 students in math and for about14,400 students in ELA (see Figure B.1). The remaining student observations can neitherbe matched to the district-wide ﬁles nor be matched to the student ﬁle. Our estimationsample contains only students in the student ﬁle – because we use this ﬁle to identify realizedteachers and peers – and is thus considerably smaller (about 8,500 students in math andabout 9,600 students in ELA). Students in the estimation dataset have on average higherbaseline test scores than the students in the randomization dataset. The diﬀerence amountsto 0.06 standard deviations in math and to 0.05 standard deviations in ELA.

Figure B.1: Distribution of baseline test scores in the randomization sample and the estimationsample . . . . . F r a c t i on o f s t uden t s -4 -2 0 2 4 Baseline test scoresNumber of students = 13920, mean = .05 (1) Math baseline test scores- randomization sample . . . . . F r a c t i on o f s t uden t s -4 -2 0 2 4 Baseline test scoresNumber of students = 8534, mean = .11 (2) Math baseline test scores- estimation sample . . . . . F r a c t i on o f s t uden t s -4 -2 0 2 4 Baseline test scoresNumber of students = 14424, mean = .12 (3) ELA baseline test scores- randomization sample . . . . . F r a c t i on o f s t uden t s -4 -2 0 2 4 Baseline test scoresNumber of students = 9641, mean = .17 (4) ELA baseline test scores- estimation sample

Note:

The ﬁgure compares the distributions of student baseline test scores across the randomization sampleand the estimation sample for math (Panels 1 and 2) and ELA (Panels 3 and 4).

Teacher FFT does not diﬀer appreciably between the randomization sample and theestimation sample (see Figure B.2). In the math randomization sample for grades 4–8, 66659eachers have non-missing information on FFT; the estimation sample includes 614 teachers.In both samples, the average FFT is 2.52. In the ELA randomization sample for these grades,705 teacher have non-missing FFT, and the estimation sample contains 649 teachers. Theaverage FFT in both samples are nearly identical with 2.57 and 2.58.

Figure B.2: Distribution of FFT in the randomization sample and the estimation sample . . . . F r a c t i on o f t ea c he r s FFTNumber of teachers = 666, mean = 2.52 (1) FFT (math) - randomization sample . . . . F r a c t i on o f t ea c he r s FFTNumber of teachers = 614, mean = 2.52 (2) FFT (math) - estimation sample . . . . F r a c t i on o f t ea c he r s FFTNumber of teachers = 705, mean = 2.57 (3) FFT (ELA) - randomization sample . . . . F r a c t i on o f t ea c he r s FFTNumber of teachers = 649, mean = 2.58 (4) FFT (ELA) - estimation sample

Note:

The ﬁgure compares the distributions of teacher FFT across the randomization sample and the esti-mation sample for math teachers (Panels 1 and 2) and ELA teachers (Panels 3 and 4). The red lines denotethe FFT cutoﬀs we use to classify teachers in our preferred speciﬁcation.

B.4 Information on the CLASS measure

Information from the MET User Guide.

The following information about the CLASS(Classroom Assessment Scoring System) protocol is taken from the MET User Guide (Whiteet al., 2019, p. 32–33):“CLASS is an observational protocol designed to measure the extent to which teacherseﬀectively support children’s social and academic development. Two diﬀerent versions ofCLASS were used in the MET Study: the Upper Elementary (Grades 4-5) and the Secondary(Grades 6-9).“The CLASS instrument is divided into three broad domains of measurement: EmotionalSupport, Classroom Organization, and Instructional Support. Each domain, in turn, is mea-sured by a number of dimensions. The domain “Emotional Support,” for example, refers tothe emotional tone in a classroom, which can be measured along four dimensions: positive cli-mate, negative climate, teacher sensitivity, and regard for student perspectives. The domain“Classroom Organization” refers to the ways a classroom is structured to manage students’behavior, time, and attention, which can be measured along three dimensions: behavior60anagement, productivity, and instructional learning formats. The domain “InstructionalSupports” refers to the ways a teacher provides supports to encourage student conceptualunderstanding and student problem solving and can be measured along four dimensions:content understanding, analysis and problem solving, instructional dialogue, and quality offeedback. [...]“CLASS scoring is done using a detailed scoring rubric. In this rubric, a classroom isscored on each instructional dimension at 15-minute intervals using a 7-point scale. For theMET Study, only the ﬁrst 30 minutes of each video was scored. Scores are assigned basedon anchor descriptions of what is to be observed in order for a classroom to be scored at“high,” “mid,” and “low” points on the 7-point scale. In the MET Study, dimension scoresare often aggregated to higher levels of analysis simply by averaging raters’ scores to geta single segment score and then calculating the harmonic mean of segment scores acrossall segments for a particular target of measurement (e.g., a day, a class section, a teacher).Standard errors of measurement for these derived scores are not generally reported.”

Use of the CLASS measure in our study.

We use the CLASS measure to test thesensitivity of our results to the classroom observation protocol used. To construct a uniquemeasure for each teacher, we take the average across the three CLASS domains.Our estimation sample contains 466 teachers in the math sample and 495 teachers in theELA sample. In the math sample, CLASS ranges from 2.54 to 5.58, with a mean of 4.34. Inthe ELA sample, CLASS ranges from 2.95 to 5.58, with a mean of 4.39. We split teachersin three categories according to this measure. We choose the cutoﬀ values of 4 and 4.5 tocarry out the split. Both in the math and in the ELA sample, 20 percent of the teachers areclassiﬁed as low, 42 percent as middle, and 38 percent as high.61

Optimal allocation: linear program

The linear program aims at maximizing the aggregate test score outcomes in the data. Weuse (cid:98) Y c ( w ) to denote the predicted aggregate outcome of classroom c when assigned a teacherof level w , where w = w L when assigned a low-, w = w M when assigned a middle-, and w = w H when assigned a high-FFT teacher (see Section 4 for details). We use C to denotethe total number of teachers in the dataset, and C w to denote the number of teachers of level w in the dataset. We deﬁne an indicator variable α cw , which takes the value 1 if classroom c is taught by a teacher of level w , and 0 otherwise. We also deﬁne an assignment matrix A ,which contains all α cw . The linear program can then be written as:max A C (cid:88) c =1 (cid:88) w ∈{ w L ,w M ,w H } α cw (cid:98) Y c ( w )subject to (cid:88) w ∈{ w L ,w M ,w H } α cw = 1 ∀ c ∈ C C (cid:88) c =1 α cw = C w for w ∈ { w L , w M , w H } α cw ∈ { , } This is a transportation problem with C + 3 + (3 × C ) constraints. We solve the trans-portation problem in R using lpSolve.lpSolve.