[PDF] Recruitment, effort, and retention effects of performance contracts for civil servants: Experimental evidence from Rwandan primary schools

Abstract

This paper reports on a two-tiered experiment designed to separately identify the selection and effort margins of pay-for-performance (P4P). At the recruitment stage, teacher labor markets were randomly assigned to a 'pay-for-percentile' or fixed-wage contract. Once recruits were placed, an unexpected, incentive-compatible, school-level re-randomization was performed, so that some teachers who applied for a fixed-wage contract ended up being paid by P4P, and vice versa. By the second year of the study, the within-year effort effect of P4P was 0.16 standard deviations of pupil learning, with the total effect rising to 0.20 standard deviations after allowing for selection.

Full PDF

RRecruitment, eﬀort, and retention eﬀects of performance contractsfor civil servants: Experimental evidence from Rwandan primaryschools

Clare Leaver, Owen Ozier, Pieter Serneels, and Andrew Zeitlin ∗ January 31, 2021

Abstract

This paper reports on a two-tiered experiment designed to separately identify the selectionand eﬀort margins of pay-for-performance (P4P). At the recruitment stage, teacher labor mar-kets were randomly assigned to a ‘pay-for-percentile’ or ﬁxed-wage contract. Once recruits wereplaced, an unexpected, incentive-compatible, school-level re-randomization was performed, sothat some teachers who applied for a ﬁxed-wage contract ended up being paid by P4P, and viceversa. By the second year of the study, the within-year eﬀort eﬀect of P4P was 0.16 standarddeviations of pupil learning, with the total eﬀect rising to 0.20 standard deviations after allowingfor selection. ∗ Leaver: Blavatnik School of Government, University of Oxford and CEPR (email: [email protected]).Ozier: Department of Economics, Williams College, World Bank Development Research Group, BREAD, and IZA(email: [email protected]). Serneels: School of International Development, University of East Anglia, EGAP,and IZA (email: [email protected]). Zeitlin: McCourt School of Public Policy, Georgetown University, andCGD (email: [email protected]). We thank counterparts at REB and MINEDUC for advice andcollaboration, and David Johnson for help with the design of student and teacher assessments. We are grateful tothe three anonymous referees, Katherine Casey, Jasper Cooper, Ernesto Dal B´o, Erika Deserranno, David Evans,Dean Eckles, Frederico Finan, James Habyarimana, Caroline Hoxby, Macartan Humphreys, Pamela Jakiela, JulienLabonne, David McKenzie, Ben Olken, Berk ¨Ozler, Cyrus Samii, Kunal Sen, Martin Williams, and audiences atBREAD, DfID, EDI, NBER, SIOE, and SREE for helpful comments. IPA staﬀ members Kris Cox, Stephanie De Mel,Olive Karekezi Kemirembe, Doug Kirke-Smith, Emmanuel Musaﬁri, and Phillip Okull, and research assistants ClaireCullen, Robbie Dean, Ali Hamza, Gerald Ipapa, and Saahil Karpe all provided excellent support. Financial supportwas provided by the U.K. Department for International Development (DfID) via the International Growth Centre andthe Economic Development and Institutions Programme, by Oxford University’s John Fell Fund, and by the WorldBank’s SIEF and REACH trust funds. Leaver is grateful for the hospitality of the Toulouse School of Economics,2018–2019. Research was conducted under Rwanda Ministry of Education permit number MINEDUC/S&T/308/2015and received IRB approval from the Rwanda National Ethics Committee (protocol 00001497) and from Innovationsfor Poverty Action (protocol 1502). This study is registered as AEA RCT Registry ID AEARCTR-0002565 (Leaveret al., 2018). The ﬁndings in this paper are the opinions of the authors, and do not represent the opinions of theWorld Bank, its Executive Directors, or the governments they represent. All errors and omissions are our own. a r X i v : . [ ec on . GN ] J a n he ability to recruit, elicit eﬀort from, and retain civil servants is a central issue for anygovernment. This is particularly true in a sector such as education where people—that is, humanrather than physical resources—play a key role. Eﬀective teachers generate private returns forstudents through learning gains, educational attainment, and higher earnings (Chetty, Friedmanand Rockoﬀ, 2014 a , b ), as well as social returns through improved labor-market skills that driveeconomic growth (Hanushek and Woessmann, 2012). And yet in varying contexts around theworld, governments struggle to maintain a skilled and motivated teacher workforce (Bold et al.,2017).One policy option in this context is pay-for-performance . These compensation schemes typicallyreward teacher inputs such as presence and conduct in the classroom, teacher value added basedon student learning, or both (see, e.g., Muralidharan and Sundararaman, 2011 b ). In principle, theycan address the diﬃculty of screening for teacher quality ex ante (Staiger and Rockoﬀ, 2010), aswell as the limited oversight of teachers on the job (Chaudhury et al., 2006).Yet pay-for-performance divides opinion. Critics, drawing upon public administration, socialpsychology, and behavioral economics, argue that pay-for-performance could dampen the eﬀort ofworkers (B´enabou and Tirole, 2003; Deci and Ryan, 1985; Krepps, 1997). Concerns are that suchschemes may: recruit the wrong types, individuals who are “in it for the money”; lower eﬀortby eroding intrinsic motivation; and fail to retain the right types because good teachers becomede-motivated and quit. By contrast, proponents point to classic contract theory (Lazear, 2003;Rothstein, 2015) and evidence from private-sector jobs with readily measurable output (Lazear,2000) to argue that pay-for-performance will have positive eﬀects on both compositional and ef-fort margins. Under this view, such schemes: recruit the right types, individuals who anticipateperforming well in the classroom; raise eﬀort by strengthening extrinsic motivation; and retain theright types because good teachers feel rewarded and stay put.This paper conducts the ﬁrst prospective, randomized controlled trial designed to identify boththe compositional and eﬀort margins of pay-for-performance. A novel, two-tiered experiment sep-arately identiﬁes these eﬀects. This is combined with detailed data on applicants to jobs, the skillsand motivations of actual hires, and their performance over two years on the job, to evaluate theeﬀects of pay-for-performance on the recruitment, eﬀort, and retention of civil servant teachers.At the center of this study is a pay-for-performance (hereafter P4P) contract, designed jointlywith the Rwanda Education Board and Ministry of Education. Building on extensive consultationsand a pilot year, this P4P contract rewards the top 20 percent of teachers with extra pay usinga metric that equally weights learning outcomes in teachers’ classrooms alongside three measuresof teachers’ inputs into the classroom (presence, lesson planning, and observed pedagogy). Themeasure of learning used was based on a pay-for-percentile scheme that makes student performanceat all levels relevant to teacher rewards (Barlevy and Neal, 2012). The tournament nature of thiscontract allows us to compare it to a ﬁxed-wage (hereafter FW) contract that is equal in expectedpayout.Our two-tiered experiment ﬁrst randomly assigns labor markets to either P4P or FW adver- isements , and then uses a surprise re-randomization of experienced contracts at the school levelto enable estimation of pure compositional eﬀects within each realized contract type. The ﬁrststage was undertaken during recruitment for teacher placements for the 2016 school year. Teacherlabor markets are deﬁned at the district by subject-family level. We conducted the experiment insix districts (18 labor markets) which, together, cover more than half the upper-primary teacherhiring lines for the 2016 school year. We recruited into the study all primary schools that receivedsuch a teacher to ﬁll an upper-primary teaching post (a total of 164 schools). The second stagewas undertaken once 2016 teacher placements had been ﬁnalized. Here, we randomly re-assignedeach of these 164 study schools in their entirety to either P4P or FW contracts; all teachers whotaught core-curricular classes to upper-primary students, including both newly placed recruits andincumbents, were eligible for the relevant contracts. We oﬀered a signing bonus to ensure that norecruit, regardless of her belief about the probability of winning, could be made worse oﬀ by there-randomization and, consistent with this, no one turned down their (re-)randomized contract. Asadvertised at the time of recruitment, incentives were in place for two years, enabling us to studyretention as well as to estimate higher-powered tests of eﬀects using outcomes from both years.Our three main ﬁndings are as follows. First, on recruitment, advertised P4P contracts did notchange the distribution of measured teacher skill either among applicants in general or among newhires in particular. This is estimated suﬃciently precisely to rule out even small negative eﬀects ofP4P on measured skills. Advertised P4P contracts did, however, select teachers who contributedless in a framed Dictator Game played at baseline to measure intrinsic motivation. In spite of this,teachers recruited under P4P were at least as eﬀective in promoting learning as were those recruitedunder FW (holding experienced contracts constant).Second, in terms of incentivizing eﬀort, placed teachers working under P4P contracts elicitedbetter performance from their students than teachers working under FW contracts (holding adver-tised contracts constant). Averaging over the two years of the study, the within-year eﬀort eﬀect ofP4P was 0.11 standard deviations of pupil learning and for the second year alone, the within-yeareﬀort eﬀect of P4P was 0.16 standard deviations. There is no evidence of a diﬀerential impact ofexperienced contracts by type of advertisement.In addition to teacher characteristics and student outcomes, we observe a range of teacherbehaviors. These behaviors corroborate our ﬁrst ﬁnding: P4P recruits performed no worse thanthe FW recruits in terms of their presence, preparation, and observed pedagogy. They also indicatethat the learning gains brought about by those experiencing P4P contracts may have been driven, atleast in part, by improved teacher presence and pedagogy. Teacher presence was 8 percentage pointshigher among recruits who experienced the P4P contract compared to recruits who experiencedthe FW contract. This is a sizeable impact given that baseline teacher presence was close to 90percent. And teachers who experienced P4P were more eﬀective in their classroom practices thanteachers who experienced FW by 0.10 points, as measured on a 4-point scale.Third, on retention, teachers working under P4P contracts were no more likely to quit duringthe two years of the study than teachers working under FW contracts. There was also no evidence2f diﬀerential selection-out on baseline teacher characteristics by experienced contract, either interms of skills or measured motivation. On the retention margin, we therefore ﬁnd little evidenceto support claims made by either proponents or opponents of pay-for-performance.To sum up, by the second year of the study, we estimate the within-year eﬀort eﬀect of P4P tobe 0.16 standard deviations of pupil learning, with the total eﬀect rising to 0.20 standard deviationsafter allowing for selection. Despite evidence of lower intrinsic motivation among those recruitedunder P4P, these teachers were at least as eﬀective in promoting learning as were those recruitedunder FW. These results support the view that pay-for-performance can improve eﬀort while alsoallaying fears of harmful eﬀects on selection. Of course, we have studied a two-year intervention—impacts of a long-term policy might be diﬀerent, particularly if P4P inﬂuences individuals’ early-career decisions to train as a teacher.Our ﬁndings bring new experimental results on pay-for-performance to the literature on therecruitment of civil servants in low- and middle-income countries. Existing papers have examinedthe impact of advertising higher unconditional salaries and career-track motivations, with mixedresults. In Mexico, Dal B´o, Finan and Rossi (2013) ﬁnd that higher base salaries attracted bothskilled and motivated applicants for civil service jobs. In Uganda, Deserranno (2019) ﬁnds that theexpectation of higher earnings discouraged pro-social applicants for village promoter roles, resultingin lower eﬀort and retention. And in Zambia, Ashraf et al. (2020) ﬁnd that emphasis on career-trackmotivations for community health work, while attracting some applicants who were less pro-social,resulted in hires of equal pro-sociality and greater talent overall, leading to improvements in a rangeof health outcomes. By studying pay-for-performance and by separately manipulating advertisedand experienced contracts, we add evidence on the compositional and eﬀort margins of a diﬀerent,and widely debated, compensation policy for civil servants.How the teaching workforce changes in response to pay-for-performance is of interest in high-income contexts as well. In the United States, there is a large (but chieﬂy observational) literatureon the impact of compensation on who enters and leaves the teaching workforce. Well-known studieshave simulated the consequences of dismissal policies (Chetty, Friedman and Rockoﬀ, 2014 b ; Neal,2011; Rothstein, 2015) or examined the role of teachers’ outside options in labor supply (Chingosand West, 2012). Recent work has examined the District of Columbia’s teacher evaluation system,where ﬁnancial incentives are linked to measures of teacher performance (including student testscores): Dee and Wyckoﬀ (2015) use a regression discontinuity design to show that low-performingteachers were more likely to quit voluntarily, while Adnot et al. (2017) conﬁrm that these ‘quit-ters’ were replaced by higher-performers. In Wisconsin, a reform permitted approximately half ofthe state’s school districts to introduce ﬂexible salary schemes that allow pay to vary with perfor-mance. In that setting, Biasi (2019) ﬁnds that high-value-added teachers were more likely to moveto districts with ﬂexible pay, and were less likely to quit, than their low-value-added counterparts.Our prospective, experimental study of pay-for-performance contributes to this literature method-ologically but also substantively since the Rwandan labor market shares important features with3igh-income contexts. While our paper is not the ﬁrst on the broader topic of incentive-based contracts for teachers, we go to some length to address two challenges thought to be important for policy implementationat scale. One is that the structure of the incentive should not unfairly disadvantage any particulargroup (Barlevy and Neal, 2012); the other is that the incentive should not be inappropriatelynarrow (Stecher et al., 2018). We address the ﬁrst issue by using a measure of learning based on apay-for-percentile scheme that makes student performance at all levels relevant to teacher rewards,and the second by combining this with measures of teachers’ inputs into the classroom to createa broad, composite metric. There is a small but growing literature studying pay-for-percentileschemes in education: Loyalka et al. (2019) in China, Gilligan et al. (forthcoming) in Uganda, andMbiti, Romero and Schipper (2018) in Tanzania. Our contribution is to compare the eﬀectivenessof contracts, P4P versus FW, that are based on a composite metric and are budget neutral insalary.A ﬁnal, methodological contribution of the paper, in addition to the experimental design, isthe way in which we develop a pre-analysis plan. In our registered plan (Leaver et al., 2018), wepose three questions. What outcomes to study? What hypotheses to test for each outcome? Andhow to test each hypothesis? We answered the ‘what’ questions on the basis of theory, policyrelevance, and available data. With these questions settled, we then answered the ‘how’ questionusing blinded data. Speciﬁcally, we used a blinded dataset that allowed us to learn about a subset ofthe statistical properties of our data without deriving hypotheses from realized treatment responses,as advocated by, e.g., Olken (2015). This approach achieves power gains by choosing from amongspeciﬁcations and test statistics on the basis of simulated power, while protecting against the risk offalse positives that could arise if speciﬁcations were chosen on the basis of their realized statisticalsigniﬁcance. The spirit of this approach is similar to recent work by Anderson and Magruder (2017)and Fafchamps and Labonne (2017). For an experimental study in which one important dimensionof variation occurs at the labor-market level, and so is potentially limited in power, the gains fromthese speciﬁcation choices are particularly important. The results reported in our pre-analysis plandemonstrate that, with speciﬁcations appropriately chosen, the study design is well powered, suchthat even null eﬀects would be of both policy and academic interest.In the remainder of the paper, Sections 1 and 2 describe the study design and data, Sections 3and 4 report and discuss the results, and Section 5 concludes. Notably, there is no public sector pay premium in Rwanda, which is unusual for a low-income country and moretypical of high-income countries (Finan, Olken and Pande, 2017). The 2017 Rwanda Labour Force Survey includesa small sample of recent Teacher Training College graduates (aged below age 30). Of these, 37 percent were inteaching jobs earning an average monthly salary of 43,431 RWF, while 15 percent were in non-teaching jobs earninga higher average monthly salary of 56,347 RWF—a private sector premium of close to 30 percent (National Instituteof Statistics of Rwanda, 2017). See, e.g., Imberman (2015) and Jackson, Rockoﬀ and Staiger (2014) who provide a review. We have not found prior examples of such blinding in economics. Humphreys, Sanchez de la Sierra and van derWindt (2013) argue for, and undertake, a related approach with partial endline data in a political science application. In contrast to those two papers, we forsake the opportunity to undertake exploratory analysis because our primaryhypotheses were determined a priori by theory and policy relevance. In return, we avoid having to discard part ofour sample, with associated power loss. Study design

The ﬁrst tier of the study took place during the actual recruitment for civil service teaching jobsin upper primary in six districts of Rwanda in 2016. To apply for a civil service teaching job, anindividual needs to hold a Teacher Training College (TTC) degree. Eligibility is further deﬁnedby specialization. Districts solicit applications at the district-by-subject-family level, aggregatingcurricula subjects into three ‘families’ that correspond to the degree types issued by TTCs: mathand science (TMS); modern languages (TML); and social studies (TSS). Districts invite applicationsbetween November and December, for the academic year beginning in late January/early February.Individuals keen to teach in a particular district submit one application and are then consideredfor all eligible teaching posts in that district in that hiring round.Given this institutional setting, we can think of district-by-subject-family pairs as labor markets .The subject-family boundaries of these labor markets are rigid; within each district, TTC degreeholders are considered for jobs in pools alongside others with the same qualiﬁcation. The districtboundaries may be more porous, though three quarters of the new teaching jobs in our study wereﬁlled by recruits living within the district at the time of application. Since this is the majority ofjobs, we proceed by treating these labor markets as distinct for our primary analysis and providerobustness checks for cross-district applications in Online Appendix C. There are 18 such labor markets in our study. This is a small number in terms of statisticalpower (as we address below) but not from a system-scale perspective. The study covers morethan 600 hiring lines constituting over 60 percent of the country’s planned recruitment in 2016.Importantly, it is not a foregone conclusion that TTC graduates will apply for these civil serviceteaching jobs. Data from the 2017 Rwanda Labour Force Survey indicate that only 37 percent ofrecent TTC graduates were in teaching jobs, with 15 percent in non-teaching, salaried employment(National Institute of Statistics of Rwanda, 2017). This is not because the teacher labor marketis tight; nationwide close to a quarter of vacancies created by a teacher leaving a school remainunﬁlled in the following school year (Zeitlin, 2021). A more plausible explanation is that the recentgraduates in the outside sector earned a premium of close to 30 percent, making occupational choiceafter TTC a meaningful decision.

Contract structure

The experiment was built around the comparison of two contracts payinga bonus on top of teacher salaries in each of the 2016 and 2017 school years, and was managedby Innovations for Poverty Action (IPA) in coordination with REB. The ﬁrst of these was a P4P Upper primary refers to grades 4, 5, and 6; schools typically include grades 1 through 6. As we note in the appendix, cross-district applications would not lead us to ﬁnd a selection eﬀect where noneexisted but we might overstate the magnitude of any selection eﬀect. Inference based on asymptotics could easily be invalid with 18 randomizable markets. We address this risk bycommitting to randomization inference for all aspects of statistical testing. This metric equally weighted student learning alongside three measures of teachers’ inputs intothe classroom (presence, lesson preparation, and observed pedagogy). The measure of learningwas based on a pay-for-percentile scheme that makes student performance at all levels relevantto teacher rewards (Barlevy and Neal, 2012). The 2016 performance award was conditional onremaining in post during the entire 2016 school year, and was to be paid early in 2017. Likewise,the 2017 performance award was conditional on remaining in post during the entire 2017 schoolyear, and was to be paid early in 2018. The second was a ﬁxed-wage (FW) contract that paid RWF20,000 to all upper-primary teachers. This bonus was paid at the same time as the performanceaward in the P4P contract.Although P4P contracts based on a composite metric of teacher inputs and student performancehave been used in a number of policy settings in the US (Imberman, 2015; Stecher et al., 2018),such contracts have been relatively less studied in low- and middle-income countries. In theircomprehensive review, Glewwe and Muralidharan (2016) discuss several evaluations of teacherincentives based on student test scores or attendance checks, but none based on a combination ofboth. After extensive discussions with REB about what would be suitable in this policy setting, adecision was made to use the P4P contract described above, based on a composite metric.

Design overview

The design, summarized visually in Figure A.1 in Online Appendix A, drawson a two-tiered experiment, as used elsewhere (see Karlan and Zinman (2009), Ashraf, Berry andShapiro (2010), and Cohen and Dupas (2010) in credit-market and public-health contexts). Bothtiers employ the contract variation described above.Potential applicants, not all of whom were observed, were assigned to either advertised FW oradvertised P4P contracts, depending on the labor market in which they resided. Those who actuallyapplied, and were placed into schools, fall into one of the four groups summarized in Figure 1. Forexample, group a denotes teachers who applied to jobs advertised as FW, and who were placed inschools assigned to FW contracts, while group c denotes teachers who applied to jobs advertised asFW and who were then placed in schools re-randomized to P4P contracts. Under this experimentaldesign, comparisons between groups a and b , and between groups c and d , allow us to learn abouta pure compositional eﬀect of P4P contracts on teacher performance, whereas comparisons alongthe diagonal of a – d are informative about the total eﬀect of such contracts, along both margins. The exchange rate on January 1, 2016 was 734 RWF to 1 USD, so the RWF 100,000 bonus was worth roughly136 USD. Student learning contributed to an individual teacher’s score via percentiles within student-based brackets sothat a teacher with a particular mix of low-performing and high-performing students was, in eﬀect, competing withother teachers with similar mixes of students. The data used to construct this measure, and the measures of teachers’inputs, are described in Sections 2.3 and 2.4 respectively, and we explain the adaptation of the Barlevy-Neal measureof learning outcomes to a repeated cross-section of pupils in Online Appendix D. a b

P4P c d

First tier randomization: Advertised contracts

Our aim in the ﬁrst tier was to randomizethe 18 distinct labor markets to contracts, ‘treating’ all potential applicants in a given market sothat we could detect the supply-side response to a particular contract. The result of the randomizedassignment is that 7 of these labor markets can be thought of as being in a ‘P4P only’ advertisedtreatment, 7 in a ‘FW only’ advertised treatment, and 4 in a ‘Mixed’ advertised treatment. Empirically, we consider the Mixed treatment as a separate arm; we estimate a correspondingadvertisement eﬀect only as an incidental parameter.This ﬁrst-tier randomization was accompanied by an advertising campaign to increase awarenessof the new posts and their associated contracts. In November 2015, as soon as districts revealedthe positions to be ﬁlled, we announced the advertised contract assignment. In addition to radio,poster, and ﬂyer advertisements, and the presence of a person to explain the advertised contractsat District Education Oﬃces, we also held three job fairs at TTCs to promote the interventions.These job fairs were advertised through WhatsApp networks of TTC graduates. All advertisementsemphasized that the contracts were available for recruits placed in the 2016 school year and thatthe payments would continue into the 2017 school year. Applications were then submitted inDecember 2015. In January 2016, all districts held screening examinations for potential candidates.Successful candidates were placed into schools by districts during February–March 2016, and werethen assigned to particular grades, subjects, and streams by their head teachers.

Second-tier randomization: Experienced contracts

Our aim in the second tier was torandomize the schools to which REB had allocated the new posts to contracts. A school wasincluded in the sample if it had at least one new post that was ﬁlled and assigned to an upper-primary grade. Following a full baseline survey, schools were randomly assigned to either P4P orFW. Of the 164 schools in the second tier of the experiment, 85 were assigned to P4P and 79 were This randomization was performed in MATLAB by the authors. The Mixed advertised treatment arose due tologistical challenges detailed in the pre-analysis plan: the ﬁrst-tier randomization was carried out at the level of thesubject rather than the subject-family. An example of a district-by-subject-family assigned to the Mixed treatmentis Ngoma-TML. An individual living in Ngoma with a TML qualiﬁcation could have applied for an advertised Ngomapost in English on a FW contract, or an advertised Ngoma post in Kinyarwanda on a P4P contract. In contrast,Kirehe-TML is in the P4P only treatment. So someone in Kirehe with a TML qualiﬁcation could have applied foreither an English or Kinyarwanda post, but both would have been on a P4P contract. Details of the promotional materials used in this campaign are provided in Online Appendix E. Because schools could receive multiple recruits, for diﬀerent teaching specializations, it was possible for enrolledschools to contain two recruits who had experienced distinct advertised treatments. Recruits hired under the mixedadvertisement treatment, and the schools in which they were placed, also met our enrollment criteria. These weresimilarly re-randomized to either experienced P4P or experienced FW in the second-tier randomization. end-of-year retention bonus of RWF80,000 on top of their school-randomized P4P or FW contract. An individual who applied underadvertised P4P in the hope of receiving RWF 100,000 from the scheme, but who was subsequentlyre-randomized to experienced FW, was therefore still eligible to receive RWF 100,000 (RWF 20,000from the FW contract plus RWF 80,000 as a retention bonus). Conversely, an individual whoapplied under advertised FW safe in the knowledge of receiving RWF 20,000 from the scheme,but who was subsequently re-randomized to experienced P4P, was still eligible for at least RWF80,000. None of the recruits objected to the (re)randomization or turned down their re-randomizedcontract.Of course, surprise eﬀects, disappointment or otherwise, may still be present in on-the-jobperformance. When testing hypotheses relating to student learning, we include a secondary spec-iﬁcation with an interaction term to allow the estimated impact of experienced P4P to diﬀer byadvertised treatment. We also explore whether surprise eﬀects are evident in either retention or jobsatisfaction. We ﬁnd no evidence for any surprise eﬀect. To ensure that teachers in P4P schoolsunderstood the new contract, we held a compulsory half-day brieﬁng session in every P4P schoolto explain the intervention. This session was conducted by a team of qualiﬁed enumerators andDistrict Education Oﬃce staﬀ, who themselves received three days of training from the PrincipalInvestigators in cooperation with IPA. Online Appendix E reproduces an extract of the Englishversion of the enumerator manual, which was piloted before use. The sessions provided ample spacefor discussion and made use of practical examples. Teachers’ understanding was tested informallyat the end of the session. We also held a comparable (but simpler) half-day brieﬁng session in everyFW school.

Pre-commitment to an analytical approach can forestall p -hacking, but requires clear speciﬁcation ofboth what to test and how to test it; this presents an opportunity, as we now discuss. A theoreticalmodel, discussed brieﬂy below, and included in our pre-analysis plan and Online Appendix B,guides our choice of what hypotheses to test. However, exactly how to test these hypotheses in away that maximizes statistical power is not fully determined by theory, as statistical power maydepend on features of the data that could not be known in advance: the distribution of outcomes,their relationships with possible baseline predictors, and so on. We used blinded data to help decidehow to test the hypotheses. In what follows we ﬁrst brieﬂy describe the theoretical model, andthen discuss our statistical approach. 8 heory The model considers a fresh graduate from teacher training who decides whether to applyfor a state school teaching post, or a job in another sector (a composite ‘outside sector’). The riskneutral individual cares about compensation w and eﬀort e . Her payoﬀ is sector speciﬁc: in teachingit is w − ( e − τ e ), while in the outside sector it is w − e . The parameter τ ≥ intrinsic motivation to teach, which is perfectly observed by the individual herself butnot by the employer at the time of hiring. Eﬀort generates a performance metric m = e θ + ε , where θ ≥ ability , which is also private information at the time of hiring. Compensationcorresponds to one of the four cells in Figure 1. The timing is as follows. Teacher vacancies areadvertised as either P4P or FW. The individual, of type ( τ, θ ), applies either to a teaching job orto an outside job. Employers hire, at random, from the set of ( τ, θ ) types that apply. Thereafter,contracts are re-randomized. If the individual applies to, and is placed in a school, she learns abouther experienced contract and chooses her eﬀort level, which results in performance m at the end ofthe year. Compensation is paid according to the experienced contract.This model leads to the following hypotheses, as set out in our pre-analysis plan:I. Advertised P4P induces diﬀerential application qualities;II. Advertised P4P aﬀects the observable skills of recruits placed in schools;III. Advertised P4P induces diﬀerentially intrinsically motivated recruits to be placed in schools;IV. Advertised P4P induces the supply-side selection-in of higher- (or lower-) performing teachers,as measured by the learning outcomes of their students;V. Experienced P4P creates eﬀort incentives which contribute to higher (or lower) teacher per-formance, as measured by the learning outcomes of their students;VI. These selection and incentive eﬀects are apparent in the composite performance metric.The model predicts that the set of ( τ, θ ) types preferring a teaching job advertised under P4P to ajob in the outside sector is diﬀerent from the set of types preferring a teaching job advertised underFW to a job in the outside sector. This gives Hypothesis I. Since the model abstracts from labordemand eﬀects (by assuming employers hire at random from the set of ( τ, θ ) types that apply), thisprediction simply maps through to placed recruits; i.e. to Hypothesis II via θ , Hypothesis III via τ , and Hypotheses IV to VI via the eﬀect of θ and τ on performance. The model also predictsthat any given ( τ, θ ) type who applies to, and is placed in, a teaching job will exert more eﬀortunder experienced P4P than experienced FW. This gives Hypotheses V and VI via the eﬀect of e on performance. See Delfgaauw and Dur (2007) for a related approach to modeling diﬀerential worker motivation across sectors. When mapping the theory to our empirical context, we distinguish between these hypotheses for two reasons:we have better data for placed recruits because we were able to administer detailed survey instruments to this well-deﬁned sub-sample; and for placed recruits we can identify the advertised treatment eﬀect from student learningoutcomes, avoiding the use of proxies for ( τ, θ ). A further consideration is that the impact of advertised treatmentmight diﬀer between placed recruits and applicants due to labor-demand eﬀects. We discuss this important issue inSection 4. nalysis of blinded data Combining several previously-known insights, we used blinded datato maximize statistical power for our main hypothesis tests.The ﬁrst insights, pertaining to simulation, are due to Humphreys, Sanchez de la Sierra and vander Windt (2013) and Olken (2015). Researchers can use actual outcome data with the treatmentvariable scrambled or removed to estimate speciﬁcations in ‘mock’ data. This permits navigationof an otherwise intractable ‘analysis tree’. They can also improve statistical power by simulatingtreatment eﬀects and choosing the speciﬁcation that minimizes the standard error. Without truetreatment assignments, the inﬂuence of any decision over eventual treatment eﬀect estimates isunknown; thus, these beneﬁts are garnered without risk of p -hacking. The second set of insights pertain to randomization inference. Since the market-level random-ization in our study involves 18 randomizable units, asymptotic inference is unsuitable, so we userandomization inference. It is known that any scalar function of treatment and comparison groupsis a statistic upon which a (correctly-sized) randomization-inference-based test of the sharp nullhypothesis could be built, but also that such statistics may vary in their statistical power in theface of any particular alternative hypothesis (Imbens and Rubin, 2015). We anticipated that, evenwith correctly-sized tests, the market-level portion of our design may present relatively low sta-tistical power. Consequently, we conducted blinded analysis to choose, on the basis of statisticalpower, among testing approaches for several hypotheses: Hypothesis I, and a common frameworkfor Hypotheses IV and V. Hypothesis I is the test of whether applicants to diﬀerent contracts vary in their TTC scores.Blinded analysis, in which we simulated additive treatment eﬀects and calculated the statisticalpower under diﬀerent approaches, suggested that ordinary least squares regression (OLS) wouldyield lower statistical power than would a Kolmogorov-Smirnov (KS) test of the equality of twodistributions. Over a range of simulations, the KS test had between one and four times the powerof OLS. We therefore committed to KS (over OLS and two other alternatives) as our primary testof this hypothesis. This prediction is borne out in Table C.1 in Online Appendix C. Hypotheses IV and V relate to the eﬀects of advertised and experienced contracts on studenttest scores. Here, with the re-randomization taking place at the school level, we had many possiblespeciﬁcations to choose from. We examined 14 speciﬁcations (modeling random eﬀects or ﬁxedeﬀects at diﬀerent levels), and committed to one with the highest power. Simulations suggested This would not be true if, for example, an outcome in question was known to have diﬀerent support as a functionof treatment, allowing the ‘blinded’ researcher to infer treatment from the outcome variable. For our blinded pre-analysis, we only consider outcomes (TTC score, and student test scores) that are nearly continuously distributedand which we believe are likely to have the same support in all study arms. To make this analysis possible, we drewinspiration from Fafchamps and Labonne (2017), who suggest dividing labor within a research team. In our case, IPAoversaw the data-blinding process. Results of the blinded analysis (for which IPA certiﬁed that we used only blindeddata) are in our pre-analysis plan. Our RCT registry entry (Leaver et al., 2018, AEARCTR-0002565) is accompaniedby IPA’s letter specifying the date after which treatment was unblinded. Hypotheses II and III employ data that our team collected, so did not have power concerns associated with them;Hypothesis VI oﬀered fewer degrees of freedom. This refers to Leaver et al. (2018), Table C.1, comparing the ﬁrst and third rows. The conﬁdence interval for the KS test is roughly half the width of the corresponding OLS conﬁdence interval:a gain in precision commensurate with more than tripling the sample size. Comparing Table 3 to Table A.4 in Online Appendix A, this was substantivelyborne out. On the basis of this theory and analysis of blinded data, we settled on six primary tests: anoutcome, a sample, a speciﬁcation and associated test statistic, and an inference procedure for eachof Hypotheses I-VI, as set out in Table A.1 in Online Appendix A. We also included a small numberof secondary tests based on diﬀerent outcomes, samples, and/or speciﬁcations. In Section 3, wereport results for every primary test; secondary tests are in Section 3 or in an appendix. To aidinterpretation, we also include a small amount of supplementary analysis that was not discussedin the pre-analysis plan—e.g. impacts of advertised P4P on teacher attributes beyond observableskill and intrinsic motivation, and estimates from a teacher value added model—but are cautiousand make clear when this is post-hoc.

The primary analyses make use of several distinct types of data. Conceptually, these trace outthe causal chain from the advertisement intervention to a sequence of outcomes: that is, from thecandidates’ application decisions, to the set (and attributes) of candidates hired into schools, tothe learning outcomes that they deliver, and, ﬁnally, to the teachers’ decisions to remain in theschools. In this section, we describe the administrative, survey, and assessment data available foreach of these steps in the causal chain. Our understanding of these data informs our choices ofspeciﬁcation for analysis, as discussed in detail in the pre-analysis plan.

Table 1 summarizes the applications for the newly advertised jobs, submitted in January 2016,across the six districts. Of the 2,184 applications, 1,962 come from candidates with a TTCdegree—we term these qualiﬁed since a TTC degree is required for the placements at stake. In thetable, we present TTC scores, genders, and ages—the other observed CV characteristics—for allqualiﬁed applicants. Besides these two demographic variables, TTC scores are the only consistentlymeasured characteristics of all applicants.The 2,184 applications come from 1,424 unique individuals, of whom 1,246 have a TTC quali-ﬁcation. The majority (62 percent) of qualiﬁed applicants complete only one application, with 22 This refers to Leaver et al. (2018), Table C.3, comparing row 12 to row 1. For the pooled advertised treatment eﬀect, the pre-committed random eﬀects model yields a conﬁdence intervalthat is 67 percent as wide as the interval from OLS: a gain in precision commensurate with increasing the samplesize by 125 percent. The gain in precision for the pooled experienced treatment eﬀect is smaller and commensuratewith increasing sample size by 22 percent. All data generated by the study and used in this paper are made available in the replication materials (Leaveret al., 2020). These data were obtained from the six district oﬃces and represent a census of applications for the new postsacross these districts.

Gatsibo Kayonza Kirehe Ngoma Nyagatare Rwamagana AllApplicants 390 310 462 380 327 315 2,184Qualiﬁed 333 258 458 364 272 277 1,962Has TTC score 317 233 405 337 260 163 1,715Mean TTC score 0.53 0.54 0.50 0.53 0.54 0.55 0.53SD TTC score 0.14 0.15 0.19 0.15 0.14 0.12 0.15Qualiﬁed female 0.53 0.47 0.45 0.50 0.44 0.45 0.48Qualiﬁed age 27.32 27.78 27.23 27.25 26.98 27.50 27.33 percent applying to two districts and 16 percent applying to three or more. Multiple applicationsare possible but not the norm, most likely because each district requires its own exam. Of thoseapplying twice, 92 percent applied to adjacent pairs of districts. In Online Appendix C, we use thisgeographical feature of applications to test for cross-district labor-supply eﬀects and fail to rejectthe null that these eﬀects are zero.

During February and March 2016, we visited schools soon after they were enrolled in the studyto collect baseline data using surveys and ‘lab-in-the-ﬁeld’ instruments. School surveys were ad-ministered to head teachers or their deputies, and included a variety of data on managementpractices—not documented here—as well as administrative records of teacher attributes, includingage, gender, and qualiﬁcations. The data cover all teachers in the school, regardless of their eligibil-ity for the intervention. Teacher surveys were administered to all teachers responsible for at leastone upper-primary, core-curricular subject and included questions about demographics, training,qualiﬁcations and experience, earnings, and other characteristics.The ‘lab-in-the-ﬁeld’ instruments were administered to the same set of teachers, and were in-tended to measure the two characteristics introduced in the theory: intrinsic motivation and ability.In the model, more intrinsically motivated teachers derive a higher beneﬁt (or lower cost) from theireﬀorts to promote learning. To capture this idea of other-regarding preferences towards students,taking inspiration from the work of Ashraf, Bandiera and Jack (2014), we used a framed versionof the

Dictator Game (Eckel and Grossman, 1996). Teachers were given 2,000 Rwandan francs(RWF) and asked how much of this money they wished to allocate to the provision of school supplypackets for students in their schools, and how much they wished to keep for themselves. Eachpacket contained one notebook and pen and was worth 200 RWF. Teachers could decide to allocateany amount, from zero to all 2,000 RWF, which would supply ten randomly chosen students witha packet.We also asked teachers to undertake a

Grading Task which measured their mastery of the cur-riculum in the main subject that they teach. Teachers were asked to grade a student examination Previous work shows the reliability of the DG as a measure of other-regarding preferences related to intrinsicmotivation (Banuri and Keefer, 2016; Brock, Lange and Leonard, 2016; Deserranno, 2019). See Bold et al. (2017) who use a similar approach to assess teacher content knowledge.

Student learning was measured in three rounds of assessment: baseline, the end of the 2016 schoolyear, and the end of the 2017 school year (indexed by r = 0 , , In each round, we randomly sampled a subset of students fromeach grade to take the test. In Year 1, both baseline and endline student samples were drawn fromthe oﬃcial school register of enrolled students compiled by the head teacher at the beginning ofthe year. This ensured that the sampling protocol did not create incentives for strategic exclusionof students. In Year 2, students were assessed at the end of the year only, and were sampled froma listing that we collected in the second trimester.Student samples were stratiﬁed by teaching streams (subgroups of students taught together forall subjects). In Round 0, we sampled a minimum of 5 pupils per stream, and oversampled streamstaught in at least one subject by a new recruit to ﬁll available spaces, up to a maximum of 20pupils per stream and 40 per grade. In rare cases of grades with more than 8 streams, we sampled5 pupils from all streams. In Round 1, we sampled 10 pupils from each stream: 5 pupils retainedfrom the baseline (if the stream was sampled at baseline) and 5 randomly sampled new pupils. Weincluded the new students to alleviate concerns that teachers in P4P schools might teach (only) topreviously sampled students. In Round 2, we randomly sampled 10 pupils from each stream usingthe listing for that year. The tests were orally administered by trained enumerators. Students listened to an enumeratoras he/she read through the instructions and test questions, prompting students to answer. The Test scores are approximately normally distributed with a mean of close to 50 percent of questions answeredcorrectly. A validation exercise of the test at baseline found its scores to be predictive of the national PrimaryLeaving Exam scores (both measured in school averages). Consequently, the number of pupils assessed in Year 2 who have also been assessed in Year 1 is limited. Becausestreams are reshuﬄed across years and because we were not able to match Year 2 pupil registers to Year 1 registersin advance of the assessment, it was not possible to sample pupils to maintain a panel across years while continuingto stratify by stream.

We collected data on several dimensions of teachers’ inputs into the classroom. This was undertakenin P4P schools only during Year 1, and in both P4P and FW schools in Year 2. This compositemetric is based on three teacher input measures (presence, lesson preparation, and observed peda-gogy), and one output measure (pupil learning)—the ‘4Ps’. Here we describe the input componentsmeasured.To assess the three inputs, P4P schools received three unannounced surprise visits: two spotchecks during Summer 2016, and one spot check in Summer 2017. During these visits, SectorEducation Oﬃcers (SEOs) from the District Education Oﬃces (in Year 1) or IPA staﬀ (for logisticalreasons, in Year 2) observed teachers and monitored their presence, preparation and pedagogy withthe aid of specially designed tools. FW schools also received an unannounced visit in Year 2, atthe same time as the P4P schools. Table A.2 in Online Appendix A shows summary statistics foreach of these three input measures over the three rounds of the study.

Presence is deﬁned as the fraction of spot-check days that the teacher is present at the start ofthe school day. For the SEO to record a teacher present, the head teacher had to physically showthe SEO that the teacher was in school.Lesson preparation is deﬁned as the planning involved with daily lessons, and is measuredthrough a review of teachers’ weekly lesson plans. Prior to any spot checks, teachers in grades4, 5, and 6 in P4P schools were reminded how to ﬁll out a lesson plan in accordance with REBguidelines. Speciﬁcally, SEOs provided teachers with a template to record their lesson preparation,focusing on three key components of a lesson—the lesson objective, the instructional activities, andthe types of assessment to be used. A ‘hands-on’ session enabled teachers to practice using thistemplate. During the SEO’s unannounced visit, he/she collected the daily lesson plans (if any hadbeen prepared) from each teacher. Field staﬀ subsequently used a lesson-planning scoring rubric Training of SEOs took place over two days. Day 1 consisted of an overview of the study and its objectives andfocused on how to explain the intervention (in particular the 4Ps) to teachers in P4P schools using the enumeratormanual in Online Appendix E. During Day 2, SEOs learned how to use the teacher monitoring tools and how toconduct unannounced school visits. SEOs were shown videos recorded during pilot visits. SEOs were briefed on theimportance of not informing teachers or head teachers ahead of the visits. Field staﬀ monitored the SEOs’ adherenceto protocol.

14o provide a subjective measure of quality. Because a substantial share of upper-primary teachersdid not have a lesson plan on a randomly chosen audit day, we used the presence of such a lessonplan as a summary measure in both the incentivized contracts and as an outcome for analysis.

Pedagogy is deﬁned as the practices and methods that teachers use in order to impact studentlearning. We collaborated with both the Ministry of Education and REB to develop a monitoringinstrument to measure teacher pedagogy through classroom observation. Our classroom observa-tion instrument measured objective teacher actions and skills as an input into scoring teachers’pedagogical performance. Our rubric was adapted from the Danielson Framework for Teaching,which is widely used in the U.S. (Danielson, 2007). The observer evaluated the teachers’ eﬀectiveuse of 21 diﬀerent activities over the course of a full 45-minute lesson. Based on these observationsand a detailed rubric, the observer provided a subjective score, on a scale from zero to three, of fourcomponents of the lesson: communication of lesson objectives, delivery of material, use of assess-ment, and student engagement. The teacher’s incentivized score, as well the measure of pedagogyused in our analysis, is deﬁned as the average of these ratings across the four domains.

We use the baseline data described in this section to check whether the second-tier randomiza-tion produced an appropriately ‘balanced’ experienced treatment assignment. Table 2 conﬁrmsthat across a wide range of school, teacher, and student characteristics there are no statisticallysigniﬁcant diﬀerences in means between the experienced P4P and FW treatment arms. Our two-tiered experiment allows us to estimate impacts of pay-for-performance on the type ofindividuals applying to, and being placed in, primary teaching posts (the compositional margin),and on the activities undertaken by these new recruits (the eﬀort margin). We report these results inSections 3.1 and 3.2 respectively. Of course, the long-run eﬀects of pay-for-performance will dependnot only on selection-in, but also selection- out , as well as the dynamics of the behavioral response onthe part of teachers who stay. We address dynamic issues in Section 3.3, and postpone a substantivediscussion of results until Section 4. All statistical tests are conducted via randomization inferencewith 2,000 permutations of the experienced treatment.

We study three types of compositional eﬀect of pay-for-performance. These are impacts on: thequality of applicants; the observable skill and motivation of placed recruits on arrival; and thestudent learning induced by these placed recruits during their ﬁrst and second year on the job. Since the teacher inputs described in Section 2.4 were collected after the second-tier randomization, they are notincluded in Table 2. See instead Table A.2 in Online Appendix A.

Control mean Experienced P4P[St. Dev.] ( p -value) Obs. Panel A. School attributes

Number of streams 9.99 -0.10 164[4.48] (0.881)Number of teachers 20.47 0.56 164[8.49] (0.732)Number of new recruits 1.94 0.13 164[1.30] (0.505)Number of students 410.06 1.42 164[206.71] (0.985)Share female students 0.58 0.00 164[0.09] (0.777)

Panel B. Upper-primary teacher recruit attributes

Female 0.36 -0.02 242[0.48] (0.770)Age 25.82 -0.25 242[4.05] (0.616)DG share sent 0.28 -0.04 242[0.33] (0.450)Grading task score -0.24 0.12 242[0.93] (0.293)

Panel C. Pupil learning assessments

English -0.00 0.04 13826[1.00] (0.551)Kinyarwanda -0.00 0.05 13831[1.00] (0.292)Mathematics 0.00 -0.00 13826[1.00] (0.950)Science -0.00 0.03 13829[1.00] (0.607)Social Studies -0.00 0.02 13829[1.00] (0.670)The table provides summary statistics for attributes of schools, teachers (new recruits placed in upper primaryonly), and students collected at baseline. The ﬁrst column presents means in FW schools, (with standard deviationsin brackets); the second column presents estimated diﬀerences between FW and P4P schools (with randomizationinference p -values in parentheses). The sample in Panel B consists of new recruits placed in upper-primary classroomsat baseline, who undertook the lab-in-the-ﬁeld exercises. In Panel B, Grading Task IRT scores are standardized basedon the distribution among incumbent teachers. In Panel C, student learning IRT scores are standardized based onthe distribution in the experienced FW arm. uality of applicants Motivated by the theoretical model sketched in Section 1.3, we begin bytesting for impacts of advertised P4P on the quality of applicants to a given district-by-qualiﬁcationpool (Hypothesis I). We focus on Teacher Training College ﬁnal exam score since this is the onlyconsistently measured quality-related characteristic we observe for all applicants.Our primary test uses a Kolmogorov-Smirnov (henceforth, KS) statistic to test the null thatthere is no diﬀerence in the distribution of TTC scores across advertised P4P and advertised FWlabor markets. This test statistic can be written as T KS = sup y (cid:12)(cid:12)(cid:12) ˆ F P P ( y ) − ˆ F F W ( y ) (cid:12)(cid:12)(cid:12) = max i =1 ,...,N (cid:12)(cid:12)(cid:12) ˆ F P P ( y i ) − ˆ F F W ( y i ) (cid:12)(cid:12)(cid:12) . (1)Here, ˆ F P P ( y ) denotes the empirical cumulative distribution function of TTC scores among ap-plicants who applied under advertised P4P, evaluated at some speciﬁc TTC score y . Likewise,ˆ F F W ( y ) denotes the empirical cumulative distribution function of TTC scores among applicantswho applied under advertised FW, evaluated at the same TTC score y . We test the statisticalsigniﬁcance of this diﬀerence in distributions by randomization inference. To do so, we repeatedlysample from the set of potential (advertised) treatment assignments T A and, for each such permu-tation, calculate the KS test statistic. The p -value is then the share of such test statistics larger inabsolute value than the statistic calculated from the actual assignment.Figure 2: Distribution of applicant TTC score, by advertised treatment arm TTC scores E m p i r i ca l C D F FWP4PMixed

KS test statistic is 0.026, with a p -value of 0.909. Figure 2 depicts the distribution of applicant TTC score, by advertised treatment arm. Thesedistributions are statistically indistinguishable between advertised P4P and advertised FW. TheKS test-statistic has a value of 0.026, with a p -value of 0.909. Randomization inference is well-17owered, meaning that we can rule out even small eﬀects on the TTC score distribution: a 95percent conﬁdence interval based on inversion of the randomization inference test rules out additivetreatment eﬀects outside of the range [ − . , . Below, we move on to consider impacts of advertised P4P on the quality of applicants who wereoﬀered a post and chose to accept it—a subset that we term placed recruits . It is worth emphasizingthat we may ﬁnd results here even though there is no evidence of an impact on the distribution ofTTC score of applicants. This is because, for this well-deﬁned set of placed recruits, we have accessto far richer data: lab-in-the-ﬁeld instruments measuring attributes on arrival, as well as measuresof student learning in the ﬁrst and second years on the job.

Skill and motivation of placed recruits

Along the lines suggested by Dal B´o and Finan (2016),we explore whether institutions can attract the most capable or the most intrinsically motivatedinto public service. We include multidimensional skill and motivation types in the theoretical modeland test the resulting hypotheses (Hypotheses II and III) using the data described in Section 2.2.Speciﬁcally, we use the Grading Task IRT score to measure a placed recruit’s skill on arrival, andthe framed Dictator Game share sent to capture baseline intrinsic motivation.Our primary tests use these baseline attributes of placed recruits as outcomes. For attribute x of teacher j with qualiﬁcation q in district d , we estimate a regression of the form x jqd = τ A T Aqd + γ q + δ d + e jqd , (2)where treatment T Aqd denotes the contractual condition under which a candidate applied. Our testof the null hypothesis is the t statistic associated with coeﬃcient τ A . We obtain a randomizationdistribution for this t statistic under the sharp null of no eﬀects for any hire by estimating equation(2) under the set of feasible randomizations of advertised treatments, T A ∈ T A .Before reporting these t statistics, it is instructive to view the data graphically. Figure 3a showsthe distribution of Grading Task IRT score, and Figure 3b the framed Dictator Game share sent,by advertised treatment arm and measured on placed recruits’ arrival in schools. A diﬀerence in thedistributions across treatment arms is clearly visible for the measure of intrinsic motivation but notfor the measure of skill. Our regression results tell the same story. In the Grading Task IRT scorespeciﬁcation, our estimate of τ A is − . p -value of 0 . τ A is − . p -value of 0 . This conclusion is further substantiated by the battery of secondary tests in Online Appendix C. Here and throughout the empirical speciﬁcations, we will deﬁne T Aqd as a vector that includes indicators for boththe P4P and mixed-treatment advertisement condition. However, for hypothesis testing, we are interested only inthe coeﬃcient on the pure P4P treatment. Deﬁning treatment in this way ensures that only candidates who applied(and were placed) under the pure FW treatment are considered as the omitted category here, to which P4P recruitswill be compared. less to the students on average.Figure 3: Distribution of placed recruit attributes on arrival, by advertised treatment arm -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Teacher ability E m p i r i ca l C D F FWP4PMixed (a) Grading task score

Teacher DG contribution E m p i r i ca l C D F FWP4PMixed (b) Dictator Game contributionIn Figure 3a, the t statistic for a diﬀerence in mean Grading Task IRT score across the P4P and FW treatments is − . p -value of 0 . t statistic for a diﬀerence in mean DG share sent across the P4Pand FW treatments is − . p -value of 0 . We chose not to include additional teacher attributes in the theoretical model, or in the listof pre-speciﬁed hypotheses to avoid multiple hypothesis testing concerns. Notwithstanding thisdecision, we did collect additional data on placed recruits at baseline, meaning that we can use ourtwo-tiered experimental design to conduct further exploratory analysis of the impact of advertisedP4P. Speciﬁcally, we estimate regressions of the form given in equation (2) for four additionalteacher attributes: age, gender, risk aversion, and an index capturing the Big Five personalitytraits. Results are reported in Table A.3 in Online Appendix A, with details of the variableconstruction provided in the table-note. We are unable to reject the sharp null of no advertisedP4P treatment eﬀect for any of these exploratory outcomes.

Student learning induced by placed recruits

The skill and motivation of placed recruits onarrival are policy relevant insofar as these attributes translate into teacher eﬀectiveness. To assessthis, we combine experimental variation in the advertised contracts to which recruits applied, withthe second-stage randomization in experienced contracts under which they worked. This allows usto estimate the impact of advertised P4P on the student learning induced by these recruits, holdingconstant the experienced contract—a pure compositional eﬀect (Hypothesis IV).Our primary test is derived from estimates on student-subject-year level data. The advertised Here we follow Dal B´o, Finan and Rossi (2013) who measure the risk preferences and Big Five personality traitsof applicants for civil service jobs in Mexico, and Callen et al. (2018) who study the relevance of Big Five personalitytraits for the performance of health workers in Pakistan. T Aqd for teacher j with qualiﬁcation type q in district d , and suppress the dependence ofthe teacher’s qualiﬁcation q on the subject b , stream k , school s , and round r , which implies that q = q ( bksr ). The experienced treatment is assigned at the school level, and is denoted by T Es . Wepool data across the two years of intervention to estimate a speciﬁcation of the type z ibksr = τ A T Aqd + τ E T Es + λ I I j + λ E T Es I j + ρ br ¯ z ks,r − + δ d + ψ r + e ibksr (3)for the learning outcome of student i in subject b , stream k , school s , and round r . We deﬁne j = j ( bksr ) as an identiﬁer for the teacher assigned to that subject-stream-school-round. Thevariable I j is an indicator for whether the teacher is an incumbent, and the index q = q ( j ) denotesthe qualiﬁcation type of teacher j if that teacher is a recruit (and is undeﬁned if the teacher is anincumbent, so that T Aqd is always zero for incumbents). Drawing on the pseudo-panel of studentoutcomes, the variable ¯ z ks,r − denotes the vector of average outcomes in the once-lagged assessmentamong students placed in that stream, and its coeﬃcient, ρ br , is subject- and round-speciﬁc. Thecoeﬃcient of interest is τ A : the average of the within-year eﬀect of advertised P4P on pupil learningin Year 1 and the within-year eﬀect of advertised P4P on pupil learning in Year 2. The theoretical model of Online Appendix B, as well as empirical evidence from other con-tractual settings (Einav et al., 2013), suggests that pay-for-performance may induce selection onthe responsiveness to performance incentives. If so, then the impact of advertised treatment willdepend on the contractual environment into which recruits are placed. Consequently, we also es-timate a speciﬁcation that allows advertised treatment eﬀects to diﬀer by experienced treatment,including an interaction term between the two treatments. This interacted model takes the form z ibksr = τ A T Aqd + τ E T Es + τ AE T Aqd T Es + λ I I j + λ E T Es I j + ρ bgr ¯ z ks,r − + δ d + ψ r + e ibksr . (4)Here, the compositional eﬀect of advertised P4P among recruits placed in FW schools is givenby τ A (a comparison of on-the-job performance across groups a and b , as deﬁned in Figure 1).Likewise, the compositional eﬀect of advertised P4P among recruits placed in P4P schools is givenby τ A + τ AE (a comparison of groups c and d ). If τ AE is not zero, then this interacted modelyields the more policy relevant estimands (Muralidharan, Romero and W¨uthrich, 2019). Notingthe distinction between estimands and test statistics (Imbens and Rubin, 2015), we pre-speciﬁed thepooled coeﬃcient τ A from equation (3) as the primary test statistic for the presence of compositionaleﬀects. Our simulations, using blinded data, show that this pooled test is better powered undercircumstances where the interaction term, τ AE , is small. We focus on within-year impacts because there is not a well-deﬁned cumulative treatment eﬀect. Individualstudents receive diﬀering degrees of exposure to the advertised treatments depending on their path through streams(and hence teachers) over Years 1 and 2.

20e estimate equations (3) and (4) by a linear mixed eﬀects model, allowing for normally dis-tributed random eﬀects at the student-round level. Randomization inference is used throughout.To do so, we focus on the distribution of the estimated z -statistic (i.e., the coeﬃcient divided byits estimated standard error), which allows rejections of the sharp null of no eﬀect on any student’sperformance to be interpreted, asymptotically, as rejection of the non-sharp null that the coeﬃcientis equal to zero (DiCiccio and Romano, 2017). Inference for τ A is undertaken by permutation ofthe advertised treatment, T A ∈ T A , while inference for τ E likewise proceeds by permuting theexperienced treatment T E ∈ T E . To conduct inference about the interaction term, τ AE in equation(4), we simultaneously permute both dimensions of the treatment, considering pairs ( T A , T E ) fromthe set T A × T E .Results are presented in Table 3. Pooling across years, the compositional eﬀect of advertisedP4P is small in point-estimate terms, and statistically indistinguishable from zero (Model A, ﬁrstrow). We do not ﬁnd evidence of selection on responsiveness to incentives; if anything, the eﬀect ofP4P is stronger among recruits who applied under advertised FW contracts, although the diﬀerenceis not statistically signiﬁcant and the 95 percent conﬁdence interval for this estimate is wide (ModelB, third row). The eﬀect of advertised P4P on student learning does, however, appear to strengthenover time. By the second year of the study, the within-year compositional eﬀect of P4P was 0.04standard deviations of pupil learning. OLS estimates of this eﬀect are larger, at 0.08 standarddeviations, with a p -value of 0.10, as shown in Table A.4.For the purposes of interpretation, it is useful to recast the data in terms of teacher valueadded. As detailed in Online Appendix D, we do so by estimating a teacher valued-added (TVA)model that controls for students’ lagged test scores, as well as school ﬁxed eﬀects, with the latterabsorbing diﬀerences across schools attributable to the experienced P4P treatment. This TVAmodel gives a sense of magnitude to the student learning estimates in Table 3. Applying the Year2 point estimate for the eﬀect of advertised P4P would raise a teacher from the 50th to above the73rd percentile in the distribution of (empirical Bayes estimates of) teacher value added for placedrecruits who applied under FW. The TVA model also reveals the impact of advertised P4P on thedistribution of teacher eﬀectiveness. Figure 4b shows that the distribution of teacher value addedamong recruits in their second year on the job is better, by ﬁrst order stochastic dominance, underadvertised P4P than advertised FW. This ﬁnding is consistent with the view that a contract thatrewards the top quintile of teachers attracts individuals who deliver greater learning. Having studied the type of individuals applying to, and being placed in, upper-primary posts, wenow consider the activities undertaken by these new recruits. In our pre-analysis plan, simulations using the blinded data indicated that the linear mixed eﬀects model witha student-round normal random eﬀects would maximize statistical power. We found precisely this in the unblindeddata. For completeness, and purely as supplementary analysis, we also present estimates and hypotheses tests viaordinary least squares. See Table A.4 in Online Appendix A. These OLS estimates are generally larger in magnitudeand stronger in statistical signiﬁcance.

Pooled Year 1 Year 2

Model A: Direct eﬀects only

Advertised P4P ( τ A ) 0.01 -0.03 0.04[-0.04, 0.08] [-0.06, 0.03] [-0.05, 0.16](0.75) (0.20) (0.31)Experienced P4P ( τ E ) 0.11 0.06 0.16[0.02, 0.21] [-0.03, 0.15] [0.04, 0.28](0.02) (0.17) (0.00)Experienced P4P × Incumbent ( λ E ) -0.06 -0.05 -0.09[-0.20, 0.07] [-0.19, 0.11] [-0.24, 0.06](0.36) (0.54) (0.27) Model B: Interactions between advertised and experienced contracts

Advertised P4P ( τ A ) 0.01 -0.02 0.03[-0.05, 0.14] [-0.06, 0.07] [-0.05, 0.21](0.46) (0.62) (0.22)Experienced P4P ( τ E ) 0.12 0.06 0.18[0.05, 0.25] [-0.01, 0.19] [0.08, 0.33](0.01) (0.10) (0.00)Advertised P4P × Experienced P4P ( τ AE ) -0.03 -0.01 -0.04[-0.17, 0.09] [-0.15, 0.10] [-0.22, 0.13](0.51) (0.65) (0.58)Experienced P4P × Incumbent ( λ E ) -0.08 -0.05 -0.11[-0.31, 0.15] [-0.30, 0.18] [-0.36, 0.14](0.43) (0.56) (0.38)Observations 154594 70821 83773For each estimated parameter, or combination of parameters, the table reports the point estimate (stated in standarddeviations of student learning), 95 percent conﬁdence interval in brackets, and p -value in parentheses. Randomizationinference is conducted on the associated z statistic. The measure of student learning is based on the empirical Bayesestimate of student ability from a two-parameter IRT model, as described in Section 2.3. tudent learning induced by placed recruits We start by using the two-tiered experimentalvariation to estimate the impact of experienced P4P on the student learning induced by the placedrecruits, holding constant the advertised contract—a pure eﬀort eﬀect (Hypothesis V). Our primarytest uses the speciﬁcation in equation (3), again estimated by a linear mixed eﬀects model. Thecoeﬃcient of interest is now τ E . To investigate possible ‘surprise eﬀects’ from the re-randomization,we also consider the interacted speciﬁcation of equation (4). In this model, τ E gives the eﬀect ofexperienced P4P among recruits who applied under FW contractual conditions (a comparison ofgroups a and c, as deﬁned in Figure 1), while τ E + τ AE gives the eﬀect of experienced P4P amongrecruits who applied under P4P contractual conditions (a comparison of groups b and d). If recruitsare disappointed, because it is groups b and c who received the ‘surprise’, τ E should be smallerthan τ E + τ AE . Results are presented in Table 3. Pooling across years, the within-year eﬀect of experienced P4Pis 0 .

11 standard deviations of pupil learning (Model A, second row). The randomization inference p -value is 0 .

02, implying that we can reject the sharp null of no experienced P4P treatment eﬀecton placed recruits at the 5 percent level. We do not ﬁnd evidence of disappointment caused by there-randomization. The interaction term is insigniﬁcant (Model B, third row) and, in point-estimateterms, τ E is larger than τ E + τ AE . As was the case for the compositional margin, the eﬀort eﬀectof experienced P4P on student learning appears to strengthen over time. By the second year of thestudy, the within-year eﬀort eﬀect of P4P was 0 .

16 standard deviations of pupil learning. We are grateful to a referee for highlighting a further interpretation: τ E in Model B is the policy-relevant estimateof experienced P4P at the start of any unexpected transition to P4P, while τ E + τ AE is the policy-relevant estimatefor that eﬀect slightly further into a transition—the eﬀect of P4P on a cohort anticipating P4P. Across all speciﬁcations, the interaction term between experienced P4P and an indicator for incumbent teachers

Figure 4: Teacher value added among recruits, by advertised treatment and year -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.800.10.20.30.40.50.60.70.80.91

FWP4P (a) Year 1 -1 -0.5 0 0.5 1 1.5 200.10.20.30.40.50.60.70.80.91

FWP4P (b) Year 2The ﬁgures plot distributions of teacher value added under advertised P4P and advertised FW in Years 1 and 2.Value-added models estimated with school ﬁxed eﬀects. Randomization inference p -value for equality in distributionsbetween P4P and FW applicants, based on one-sided KS test, is 0.796 using Year 1 data; 0.123 using Year 2 data;and 0.097 using pooled estimates of teacher value added (not pre-speciﬁed).

23o put this in perspective, we compare the magnitude of this eﬀort eﬀect to impacts in similarstudies in the US, and beyond. Sojourner, Mykerezi and West (2014) study pay-for-performanceschemes in Minnesota, typically based on a composite metric of subjective teacher evaluation andstudent performance, and ﬁnd an eﬀect of 0.03 standard deviations of pupil learning. Dee andWyckoﬀ (2015) study a high-stakes incentive over a composite metric in Washington, DC, andﬁnd eﬀects consistent with those of Sojourner, Mykerezi and West (2014) in terms of the impliedmagnitude of eﬀects on pupil learning. Glewwe and Muralidharan (2016) review a range of studies,including several in Benin, China, India and Kenya that employ incentives for either students orteachers based solely on student performance; eﬀect sizes are larger, typically above 0.2 standarddeviations of pupil learning. Our eﬀort eﬀect falls within this range and is of a comparable magni-tude to the impact in Duﬂo, Hanna and Ryan (2012), who study incentives for teacher attendancein India.

Dimensions of the composite performance metric

The results in Table 3 speak to theobvious policy question, namely whether there are impacts of advertised and experienced P4Pcontracts on student learning. For completeness, and to gain an understanding into mechanisms,we complete our analysis by studying whether there are impacts on the contracted metrics whichare calculated at teacher-level (Hypothesis VI). For these tests, we use the following speciﬁcations: m jqsdr = τ A T Aqd + τ E T Es + λ I I j + λ E T Es I j + γ q + δ d + ψ r + e jqsdr (5) m jqsdr = τ A T Aqd + τ E T Es + τ AE T Aqd T Es + λ I I j + λ E T Es I j + γ q + δ d + ψ r + e jqsdr , (6)for the metric of teacher j with qualiﬁcation q in school s of district d , as observed in post-treatmentround r . As above, the variable I j is an indicator for whether the teacher is an incumbent (recallthat T Aqd is always zero for incumbents). A linear mixed eﬀects model with student-level randomeﬀects is no longer applicable; outcomes are constructed at the teacher-level, and given their rank-based construction, normality does not seem a helpful approximation to the distribution of errorterms. As stated in our pre-analysis plan, we therefore estimate equations (5) and (6) with around-school random-eﬀects estimator to improve eﬃciency. The permutations of treatments usedfor inferential purposes mirror those above.Results are reported in Table 4 and, to the extent available, are based on pooled data. Con-sistent with the pooled results in Table 3, we see a positive and signiﬁcant impact of experiencedP4P on both the summary metric and the learning sub-component. The speciﬁcations with teacherinputs as dependent variables suggest that this impact on student learning is driven, at least inpart, by improvements in teacher presence and pedagogy. Teacher presence was 8 percentage points is negative, though statistically insigniﬁcant, and smaller in magnitude than the direct eﬀect of experienced P4P,implying a weaker—though still positive—eﬀect of P4P on incumbents in point-estimate terms. Note that any attribute of recruits themselves, even if observed at baseline, suﬀers from the ‘bad controls’ problem,as the observed values of this covariate could be an outcome of the advertised treatment. These variables are thereforenot included as independent variables. As discussed in Section 2.4, FW schools only received unannounced visits to measure teacher inputs in Year 2.

Summarymetric Preparation Presence Pedagogy Pupillearning

Model A: Direct eﬀects only

Advertised P4P ( τ A ) -0.04 0.07 0.00 0.03 -0.02[-0.09, 0.01] [-0.13, 0.32] [-0.05, 0.07] [-0.06, 0.10] [-0.08, 0.02](0.11) (0.40) (0.93) (0.42) (0.27)Experienced P4P ( τ E ) 0.23 0.02 0.08 0.10 0.09[0.19, 0.28] [-0.13, 0.16] [0.02, 0.14] [-0.00, 0.21] [0.03, 0.15](0.00) (0.84) (0.01) (0.05) (0.00)Experienced P4P × Incumbent ( λ E ) 0.03 0.07 -0.01 0.07 -0.00[-0.01, 0.07] [-0.03, 0.18] [-0.06, 0.05] [-0.01, 0.16] [-0.04, 0.03](0.10) (0.17) (0.70) (0.11) (0.86) Model B: Interactions between advertised and experienced contracts

Advertised P4P ( τ A ) -0.03 0.16 -0.01 0.12 -0.01[-0.12, 0.05] [-0.11, 0.48] [-0.16, 0.17] [-0.27, 0.55] [-0.12, 0.11](0.42) (0.19) (0.86) (0.44) (0.91)Experienced P4P ( τ E ) 0.22 -0.00 0.08 0.17 0.08[0.15, 0.29] [-0.26, 0.25] [-0.01, 0.16] [-0.05, 0.38] [0.00, 0.16](0.00) (0.97) (0.07) (0.12) (0.04)Advertised P4P × Experienced P4P ( τ AE ) -0.02 -0.11 0.02 -0.11 -0.03[-0.11, 0.07] [-0.45, 0.23] [-0.12, 0.16] [-0.45, 0.24] [-0.15, 0.08](0.65) (0.53) (0.69) (0.53) (0.64)Experienced P4P × Incumbent ( λ E ) 0.05 0.09 -0.01 0.00 0.00[-0.01, 0.10] [-0.07, 0.26] [-0.09, 0.07] [-0.13, 0.14] [-0.05, 0.06](0.07) (0.27) (0.82) (0.96) (0.90)Observations 3996 2514 3455 2136 3049FW recruit mean 0.49 0.65 0.89 1.98 0.48(SD) (0.22) (0.49) (0.31) (0.57) (0.27)FW incumbent mean 0.37 0.50 0.87 2.05 0.45(SD) (0.24) (0.50) (0.33) (0.49) (0.28)For each estimated parameter, the table reports the point estimate, 95 percent conﬁdence interval in brackets, and p -value (or for FW means, standard deviations) in parentheses. Randomization inference is conducted on the associated t statistic. All estimates are pooled across years, but outcomes are observed in the FW arm during the second yearonly. Outcomes are constructed at teacher-round-level as follows: preparation is a binary indicator for existence of alesson plan on a randomly chosen spot-check day; presence is the fraction of spot-check days present at the start ofthe school day; pedagogy is the classroom observation score, measured on a four-point scale; and pupil learning is theBarlevy-Neal percentile rank. The summary metric places 50 percent weight on learning and 50 percent on teacherinputs, and is measured in percentile ranks. Our two-tiered experiment was designed to evaluate the impact of pay-for-performance and, inparticular, to quantify the relative importance of a compositional margin at the recruitment stageversus an eﬀort margin on the job. The hypotheses speciﬁed in our pre-analysis plan refer toselection-in and incentives among placed recruits. Since within-year teacher turnover was limitedby design and within-year changes in teacher skill and motivation are likely small, the total eﬀectof P4P in Year 1 can plausibly only be driven by a change in the type of teachers recruited and/ora change in eﬀort resulting from the provision of extrinsic incentives.Interpreting the total eﬀect of P4P in Year 2 is more complex, however. First, we made noattempt to discourage between -year teacher turnover, and so there is the possibility of a further com-positional margin at the retention stage (c.f. Muralidharan and Sundararaman 2011). ExperiencedP4P may have selected-out the low skilled (Lazear, 2000) or, more pessimistically, the highly intrin-sically motivated. Second, given the longer time frame, teacher characteristics could have changed.Experienced P4P may have eroded a given teacher’s intrinsic motivation (as hypothesized in thelargely theoretical literature on motivational crowding out) or, more optimistically, encouraged agiven teacher to improve her classroom skills. In this section, we conduct an exploratory analysisof these dynamic eﬀects. Retention eﬀects

We begin by exploring whether experienced P4P aﬀects retention rates amongrecruits. Speciﬁcally, we look for an impact on the likelihood that a recruit is still employed atmidline in February 2017 at the start of the Year 2; i.e. after experiencing pay-for-performancein Year 1, although before the performance awards were announced. To do so, we use a linearprobability model of the formPr[ employed iqd = 1] = τ E T Es + γ q + δ d , (7)where employed iqd is an indicator for whether teacher i with subject-family qualiﬁcation q indistrict d is still employed by the school at the start of Year 2, and γ q and δ d are the usual subject-family qualiﬁcation and district indicators. We emphasize that this material is exploratory; the hypotheses tested in this section were not part of our pre-analysis plan. That said, the structure of the analysis in this section does follow a related pre-analysis plan (intendedfor a companion paper) which we uploaded to our trial registry on October 3, 2018, prior to unblinding of our data.

26s the ﬁrst column of Table 5 reports, our estimate of τ E is zero with a randomization inference p -value of 0.94. There is no statistically signiﬁcant impact of experienced P4P on retention ofrecruits; the retention rate is practically identical—at around 80 percent—among recruits experi-encing P4P and those experiencing FW.Table 5: Retention of placed recruits (1) (2) (3)Experienced P4P 0 . − . − . .

94) (0 .

42) (0 . − .

05 0 . .

39) (0 . p -value in parentheses. Randomizationinference is conducted on the associated t statistic. In each column, the outcome is an indicator for whether theteacher is still employed at the start of Year 2. The mean of this dependent variable for FW recruits is 0.80. In thesecond column, the speciﬁcation includes an interaction of experienced treatment with the teacher’s baseline GradingTask IRT score (not de-meaned); in the third column, the interaction is with the teacher’s share sent in the baselineframed Dictator Game (again not de-meaned). All speciﬁcations include controls for districts and subjects of teacherqualiﬁcation. It is worth noting that there is also no impact of experienced P4P on intentions to leave inYear 3. In the endline survey in November 2017, we asked teachers the question:“How likely is itthat you will leave your job at this school over the coming year?”. Answers were given on a 5-pointscale. For analytical purposes we collapse these answers into a binary indicator coded to 1 for ‘verylikely’ or ‘likely’ and 0 otherwise, and estimate speciﬁcations analogous to equations (5) and (6).As the second column of Table A.5 in Online Appendix A shows, there is no statistically signiﬁcantimpact of experienced P4P on recruits’ self-reported likelihood of leaving in Year 3. Our estimateof τ E is − .

06 with a randomization inference p -value of 0.39.Of course, a retention rate of 80 percent implies 20 percent attrition from Year 1 to Year2, which is non-negligible. And the fact that retention rates are similar does not rule out thepossibility of an impact of experienced P4P on the type of recruits retained. To explore this, wetest whether experienced P4P induces diﬀerentially skilled recruits to be retained. Here, we useteachers’ performance on the baseline Grading Task in the primary subject they teach to obtain anIRT estimate of their ability in this subject, denoted z i , and estimate an interacted model of theform Pr[ employed iqd = 1] = τ E T Es + ζT Es z i + βz i + γ q + δ d . (8)Inference for the key parameter, ζ , is undertaken by performing randomization inference for alter-native assignments of the school-level experienced treatment indicator. As the second column ofTable 5 reports, our estimate of ζ is − .

05, with a randomization inference p -value of 0 .

39. Thereis not a signiﬁcant diﬀerence in selection-out on baseline teacher skill across the experienced treat-ments. Hence, there is no evidence that experienced P4P induces diﬀerentially skilled recruits to27able 6: Characteristics of retained recruits at endline

Grading Task Dictator GameExperienced P4P 0 . − . .

57) (0 . p -value in parentheses. Randomizationinference is conducted on the associated t statistic. In the ﬁrst column, the outcome is the Grading Task score of theteacher at endline on a (raw) scale from 0 to 30; in the second column, it is the teacher’s share sent in the framedDictator Game played at endline. All speciﬁcations include the outcome measured at baseline and controls for districtand subject-of-qualiﬁcation. be retained.We also test whether experienced P4P induces diﬀerentially intrinsically motivated recruits tobe retained. Here, we use the contribution sent in the framed Dictator Game played by all recruitsat baseline, denoted x i , and re-estimate the interacted model in equation (8), replacing z i with x i . As the third column of Table 5 reports, our estimate of ζ in this speciﬁcation is 0 .

15, with arandomization inference p -value of 0 .

37. There is not a signiﬁcant diﬀerence in selection-out onbaseline teacher intrinsic motivation across the experienced treatments. Hence, there is also noevidence that experienced P4P induces diﬀerentially intrinsically motivated recruits to be retained.

Changes in retained teacher characteristics

To assess whether experienced P4P changeswithin-retained-recruit teacher skill or intrinsic motivation from baseline to endline, we estimatethe following ANCOVA speciﬁcation y isd = τ E T Es + ρy isd + γ q + δ d + e isd , (9)where y iqsd is the characteristic (raw Grading Task score or framed Dictator Game contribution)of retained recruit i with qualiﬁcation q in school s and district d at endline (round 2), and y iqsd isthis characteristic of retained recruit i at baseline (round 0). As the ﬁrst column of Table 6 reports,our estimate of τ E in the Grading Task speciﬁcation is 0 .

68, with a randomization inference p -valueof 0.57. Our estimate of τ E in the Dictator Game speciﬁcation is − .

04, with a randomizationinference p -value of 0.06. Both estimates are small in magnitude and, in the case of the DictatorGame share sent, we reject the sharp null only at the 10 percent level. Hence, to the extent thatcontributions in the Dictator Game are positively associated with teachers’ intrinsic motivation,we ﬁnd no evidence that the rising eﬀects of experienced P4P from Year 1 to Year 2 are driven by positive changes in our measures of within-retained-recruit teacher skill or intrinsic motivation. Before moving on, it is worth noting that the Dictator Game result could be interpreted as weakevidence that the experience of P4P contracts crowded out the intrinsic motivation of recruits. We Although repeated play of lab experimental games may complicate interpretation in some contexts, several factorsallay this concern here. First, unlike strategic games, the ‘Dictator Game’ has no second ‘player’ about whom tolearn. Second, the two rounds of play were fully two years apart.

28o not have any related measures observed at both baseline and endline with which to furtherprobe changes in motivation. However, we do have a range of related measures at endline: jobsatisfaction, likelihood of leaving, and positive/negative aﬀect. As Table A.5 shows, there is nostatistically signiﬁcant impact of experienced P4P on any of these measures.Further substantiating this point, Table A.6 in Online Appendix A shows the distribution ofanswers to the endline survey question: “What is your overall opinion about the idea of providinghigh-performing teachers with bonus payments on the basis of objective measures of student per-formance improvement?” The proportion giving a favorable answer exceeds 75 percent in everystudy arm. In terms of Figure 1, group a (recruits who both applied for and experienced FW) hadthe most negative view of pay-for-performance, while group c (who applied for FW but experiencedP4P) had the most positive view. Hence it seems that it was the idea, rather than the reality, ofpay-for-performance that was unpopular with (a minority of) recruits. Compositional margin

To recap from Section 3.1, we ﬁnd no evidence of an advertised treat-ment impact on the measured quality of applicants for upper-primary teaching posts in studydistricts, but we do ﬁnd evidence of an advertised treatment impact on the measured intrinsicmotivation of individuals who are placed into study schools. We draw three conclusions from theseresults.First, potential applicants were aware of, and responded to, the labor market intervention. Thediﬀerences in distributions across advertised treatment arms in Figure 3b (Dictator Game sharesent) and Figure 4b (teacher valued added in Year 2) show that the intervention changed behavior.Since these diﬀerences are for placed recruits not applicants, it could be that this behavior changewas on the labor demand rather than supply side. In Figure A.2 in Online Appendix A, we plot theempirical probability of hiring as a quadratic function of the rank of an applicant’s TTC score withinthe set of applicants in their district. It is clear from the ﬁgure that the predicted probabilities aresimilar across P4P and FW labor markets. We also test formally whether the probability of hiring,as a function of CV characteristics (TTC score, age and gender), is the same under both P4P andFW advertisements. We ﬁnd no statistically signiﬁcantly diﬀerences across advertised treatmentarms. We follow Bloom et al. (2015) in using the Maslach Burnout Index to capture job satisfaction and the Clark-Tellgen Index of positive and negative aﬀect to capture the overall attitude of teachers. We follow the phrasing used in the surveys run by Muralidharan and Sundararaman (2011 a ). Consistent with our failure to ﬁnd ‘surprise eﬀects’ in student learning, there is no evidence that the re-randomization resulted in hostility toward pay-for-performance; if anything the reverse. Note that this is a suﬃcient but not necessary test of the absence of a demand-side response. It is suﬃcientbecause districts do not interview applicants, so CVs give us the full set of characteristics that could determine hiring.It is not necessary, however, because we observe hires rather than oﬀers. The probability that an oﬀer is acceptedcould be aﬀected by the advertised contract associated with that post, even if applicants apply to jobs of both typesand even if DEOs do not take contract oﬀer types into account when selecting the individuals to whom they wouldlike to make oﬀers. positive eﬀects on learning by recruits’ second year onthe job. It therefore appears that only positively selected attribute(s) mattered, at least in the ﬁvecore subjects that we assessed.Finally, districts would struggle to achieve this compositional eﬀect directly via the hiringprocess. The positively selected attribute(s) were not evident in the metrics observed at baseline—either in TTC scores, or in the Grading Task scores that districts could in principle adopt. This suggests that there is not an obvious demand-side policy alternative to contractually inducedsupply-side selection.

Eﬀort margin

To recap from Section 3.2, we ﬁnd evidence of a positive impact of experiencedP4P on student learning, which is considerably larger (almost tripling in magnitude) in recruits’second year on the job. In light of Section 3.3, we draw the following conclusions from these results.The additional learning achieved by recruits working under P4P, relative to recruits workingunder FW, is unlikely to be due to selection-out—the compositional margin famously highlightedby Lazear (2000). Within-year teacher turnover was limited by design. Between-year turnover didhappen but cannot explain the experienced P4P eﬀect. In Online Appendix D, we show that therank correlation between recruits’ baseline Grading Task IRT score and their teacher value addedis positive. However, in Section 3.3 we reported that, if anything, selection-out on baseline teacherskill runs the wrong way to explain the experienced P4P eﬀect.Neither is the experienced P4P eﬀect likely to be due to within-teacher changes in skill ormotivation. We ﬁnd no evidence that recruits working under P4P made greater gains on theGrading Task from baseline to endline than did recruits working under FW. As already noted,recruits’ Dictator Game share sent is not a good predictor of teacher value added. But even ifit were, we ﬁnd no evidence that recruits working under P4P contributed more from baseline toendline than did recruits working under FW, if anything the reverse.Instead, the experienced P4P eﬀect is most plausibly driven by teacher eﬀort. This conclusionfollows from the arguments above and the direct evidence that recruits working under P4P providedgreater inputs than did recruits working under FW. Speciﬁcally, the P4P contract encouragedrecruits to be present in school more often and to use better pedagogy in the classroom, behaviorsthat were incentivized components of the 4P performance metric. An alternative explanation for the null KS test on applicant TTC scores is that individuals applied everywhere.If this were true, we would expect to see most candidates make multiple applications, and a rejection of the null in aKS test on placed recruits’

TTC scores (if the supply-side response occurred at acceptance rather than application).We do not see either in the data. otal eﬀect The total eﬀect of the P4P contract combines both the advertised and experiencedimpacts: τ A + τ E . By the second year of the study, the within-year total eﬀect of P4P is 0 . .

16 =0 .

20 standard deviations of pupil learning, which is statistically signiﬁcant at the one percent level.Roughly four ﬁfths of the total eﬀect can thus be attributed to increased teacher eﬀort, while theremainder arises from supply-side selection during recruitment. At a minimum, our results suggestthat in relation to positive eﬀort-margin eﬀects, fears of pay-for-performance causing motivationalcrowd-out among new public-sector employees may be overstated.Our estimates raise the question of why this eﬀect is so much stronger in Year 2 compared toYear 1, particularly on the eﬀort margin. One interpretation is that this is because it takes time forrecruits to settle into the job and for the signal to noise ratio in our student learning measures toimprove (Staiger and Rockoﬀ, 2010). Consistent with this interpretation, we note that the impactof experienced P4P on incumbents did not increase in the second year. This interpretation suggeststhat Year 2 eﬀects are the best available estimates of longer-term impacts.

This two-tier, two-year, randomized controlled trial featuring extensive data on teachers—theirskills and motivations before starting work, multiple dimensions of their on-the-job performance,and whether they left their jobs—oﬀers new insights into the compositional and eﬀort margins ofpay-for-performance. We found that potential applicants were aware of, and responded to, theﬁrst-tier labor market intervention. This supply-side response to advertised P4P was, if anything,beneﬁcial for student learning. We also found a positive impact of experienced P4P that appearsto stem from increased teacher eﬀort, rather than selection-out or changes in measured skill orintrinsic motivation.Given these encouraging results, it is natural to ask whether it would be feasible and costeﬀective to implement this P4P contract at scale. We worked closely with the government todesign a contract that was contextually feasible and well-grounded in theory. A composite P4Pmetric was used to avoid narrowly emphasizing any single aspect of teacher performance and, whenmeasuring learning, we followed the pay-for-percentile approach that aims to give all teachers afair chance, regardless of the composition of the students they teach. We also took care to ensurethat the P4P contract, if successful, could be built into the growth path of teacher wages. While alarger bonus might have elicited stronger impacts, the expected value of the P4P bonus was set atthree percent of teacher salaries to be commensurate with annual teacher salary increments (anddiscretionary pay in other sectors under Rwanda’s imihigo system of performance contracts for civilservants).The fact that we compared a P4P contract with an expenditure-equivalent ﬁxed wage alternativethat is equal in magnitude to annual teacher salary increments means that it is reasonable to thinkabout cost eﬀectiveness primarily in terms of measurement. For pupil learning, the minimumrequirement for the P4P contract we study is a system of repeated annual assessments across31rades and key subjects. Measurement of the other aspects of performance—teacher presence,preparation, and pedagogy—can in principle be conducted by head teachers or district staﬀ (whoare increasingly being asked to monitor teacher performance) at modest cost.There are nonetheless limitations of our work. Inasmuch as the impacts on either the com-positional or eﬀort margin might diﬀer after ﬁve or ten years, there is certainly scope for furtherstudy of this topic in low- and middle-income countries. For instance, it would be interesting toexplore whether long-term P4P commitments inﬂuence early-career decisions to train as a teacher;our study restricts attention to employment choices by individuals who have already received TTCdegrees.Another set of issues relate to unintended consequences of pay-for-performance. We foundthat advertised P4P attracted teachers with lower intrinsic motivation, as measured by the sharesent in the framed baseline Dictator Game. It is possible that the students taught by these moreself-regarding teachers became more self-regarding themselves or otherwise developed diﬀerent softskills. We also found that experienced P4P improved performance on three of the four incentivizeddimensions of the composite metric: teacher presence and pedagogy, and pupil learning. It isconceivable that the students in P4P schools may have been impacted by ‘multi-tasking’ as teachersfocused on these dimensions to the detriment of others. Since we did not measure aspects of studentdevelopment beyond test score gains, it would be interesting to explore these issues in future work.Rwanda’s labor market has a characteristic that is unusual for low- and middle-income countries:it has no public sector pay premium, and consequently many of those qualiﬁed to teach choose notto, making it more similar to high- income country labor markets in this regard. Whether thepositive eﬀects we ﬁnd in Rwanda of a multidimensional, pay-for-percentile contract—improvingperformance without dampening employee satisfaction—will generalize to settings where public-sector wage premiums diﬀer remains an open question, for the education sector and beyond. eferences Adnot, Melinda, Thomas Dee, Veronica Katz, and James Wyckoﬀ.

Educational Evaluation and PolicyAnalysis , 39(1): 54–76.

Anderson, Michael L, and Jeremy Magruder.

NBER Working Paper No. 23544 . Ashraf, Nava, James Berry, and Jesse M Shapiro.

American Economic Review , 100(5): 2382–2413.

Ashraf, Nava, Oriana Bandiera, and B. Kelsey Jack.

Journal of Public Economics , 120: 1–17.

Ashraf, Nava, Oriana Bandiera, Edward Davenport, and Scott S. Lee.

American Economic Review , 110(5): 1355–1394.

Banuri, Sheheryar, and Philip Keefer.

European Economic Review , 83: 139–164.

Barlevy, Gadi, and Derek Neal.

American Economic Review ,102(5): 1805–1831.

B´enabou, Roland, and Jean Tirole.

Review ofEconomic Studies , 70: 489–520.

Biasi, Barbara.

NBER Work-ing Paper No. 24813 . Bloom, Nicholas, James Liang, John Roberts, and Zhichun Jenny Ying.

Quartely Journal of Economics ,165–218.

Bold, Tessa, Deon Filmer, Gayle Martin, Ezequiel Molina, Brian Stacy, ChristopheRockmore, Jakob Svensson, and Waly Wane.

Journal of Economic Perspectives ,31(4): 185–204.

Brock, Michelle, Andreas Lange, and Kenneth Leonard.

Journal of HumanResources , 51(1): 133–162. 33 allen, Michael, Saad Gulzar, Ali Hasanain, Yasir Khan, and Arman Rezaee.

NBER Working Paper no. 21180 . Chaudhury, Nazmul, Jeﬀrey Hammer, Michael Kremer, Karthik Muralidharan, andF. Halsey Rogers.

Journal of Economic Perspectives , 20(1): 91–116.

Chetty, Raj, John N Friedman, and Jonah E Rockoﬀ. a . “Measuring the impacts ofteachers I: Evaluating bias in teacher value-added estimates.” American Economic Review . Chetty, Raj, John N Friedman, and Jonah E Rockoﬀ. b . “Measuring the impactsof teachers II: Teacher value-added and student outcomes in adulthood.” American EconomicReview , 104(9): 2633–2679.

Chingos, Matthew M, and Martin R West.

Education Finance and Policy , 7(1): 8–43.

Cohen, Jessica, and Pascaline Dupas.

Quarterly Journal of Economics , 125(1): 1–45.

Dal B´o, Ernesto, and Frederico Finan.

EDI Working Paper . Dal B´o, Ernesto, Frederico Finan, and Martin Rossi.

Quarterly Journal of Economics ,128(3): 1169–1218.

Danielson, Charlotte.

Enhancing professional practice: A framework for teaching. . 2 ed.,Alexandria, VA:Association for Supervision and Curriculum Development.

Deci, Edward L., and Richard M. Ryan.

Intrinsic motivation and self-determination inhuman behavior.

New York:Plenum.

Dee, Thomas, and James Wyckoﬀ.

Journal of Policy , 34(2): 267–297.

Delfgaauw, Josse, and Robert Dur.

Economic Journal , 118(525): 171–191.

Deserranno, Erika.

American Economic Journal: Applied Economics ,11(1): 277–317.

DiCiccio, Cyrus J, and Joseph P Romano.

Journal of the American Statistical Association , 112(519): 1211–1220.34 uﬂo, Esther, Rema Hanna, and Stephen P Ryan.

American Economic Review , 102(4): 1241–78.

Eckel, Catherine, and Philip Grossman.

Games and Economic Behavior , 16(1): 181–191.

Einav, Liran, Amy Finkelstein, Stephen P. Ryan, Paul Schrimpf, and Mark R. Cullen.

American Economic Review , 103(1): 178–219.

Fafchamps, Marcel, and Julien Labonne.

Political Analysis , 25: 465–482.

Finan, Frederico, Benjamin A Olken, and Rohini Pande.

Handbook of Field Experiments . Vol. 2, , ed. Abhijit Banerjee and Esther Duﬂo,467–514. Elsevier.

Gilligan, Daniel O., Naureen Karachiwalla, Ibrahim Kasirye, Adrienne Lucas, andDerek A. Neal. forthcoming. “Educator incentives and educational triage in rural primaryschools.”

Journal of Human Resources . Glewwe, Paul, and Karthik Muralidharan.

Handbook of the Economicsof Education . Vol. 5, 653–743. Elsevier.

Hanushek, Eric A, and Ludger Woessmann.

Journal of Economic Growth , 17: 267–321.

Humphreys, Macartan, Raul Sanchez de la Sierra, and Peter van der Windt.

Political Analysis , 21(1): 1–20.

Imbens, Guido W, and Donald B Rubin.

Causal inference for statistics, social, andbiomedical sciences: An introduction.

Cambridge, U.K.:Cambridge University Press.

Imberman, Scott.

IZA World ofLabor, No. 158 . Jackson, C. Kirabo, Jonah E. Rockoﬀ, and Douglass O. Staiger.

Annual Review of Economics , 6: 801–825.

Karlan, Dean, and Jonathan Zinman.

Econometrica , 77(6): 1993—2008.

Krepps, David.

American Economic Review ,87(2): 359–64. 35 azear, Edward P.

American Economic Review ,90(5): 1346–1361.

Lazear, Edward P.

Swedish Economic Policy Review , 10(3): 179–214.

Leaver, Clare, Owen Ozier, Pieter Serneels, and Andrew Zeitlin.

AEA RCT Registry, October 23 , https://doi.org/10.1257/rct.2565-5.0 . Leaver, Clare, Owen Ozier, Pieter Serneels, and Andrew Zeitlin.

American Economic Associ-ation [publisher], Inter-university Consortium for Political and Social Research [distributor] , https://doi.org/10.3886/E121941V1 . Loyalka, Prashant, Sean Sylvia, Chengfang Liu, James Chu, and Yaojiang Shi.

Journal of Labour Economics , 37(3): 621–662.

Mbiti, Isaac, Mauricio Romero, and Youdi Schipper.

NBER Working Paper No. 25903 . Muralidharan, Karthik, and Venkatesh Sundararaman. a . “Teacher opinions on per-formance pay: Evidence from India.” Economics of Education Review , 30: 394–403.

Muralidharan, Karthik, and Venkatesh Sundararaman. b . “Teacher performance pay:Experimental evidence from India.” Journal of Political Economy , 119(1): 39–77.

Muralidharan, Karthik, Mauricio Romero, and Kaspar W¨uthrich.

NBER WorkingPaper 26562 . National Institute of Statistics of Rwanda.

Republic ofRwanda , https://microdata.statistics.gov.rw/index.php/catalog/81 (accessed Decem-ber 1, 2019). Neal, Derek A.

Handbook of the Eco-nomics of Education . Vol. 4, , ed. Eric A. Hanushek, Stephen J. Machin and Ludger Woessmann.Amsterdam:North Holland.

Olken, Benjamin A.

Journal of EconomicPerspectives , 29(3): 61–80.

Rothstein, Jesse.

American Economic Re-view , 105(1): 100–130. 36 ojourner, Aaron J, Elton Mykerezi, and Kristine L West.

Journal of HumanResources , 49(4): 945–981.

Staiger, Douglas O, and Jonah E Rockoﬀ.

Journal of Economic Perspectives , 24(3): 97–118.

Stecher, Brian M., Deborah J. Holtzman, Michael S. Garet, Laura S. Hamilton, JohnEngberg, Elizabeth D. Steiner, Abby Robyn, Matthew D. Baird, Italo A. Gutierrez,Evan D. Peet, Iliana Brodziak de los Reyes, Kaitlin Fronberg, Gabriel Weinberger,Gerald P. Hunter, and Jay Chambers.

Santa Monica, CA: RANDCorporation . Zeitlin, Andrew.

Journal of African Economies , 30(1): 81–102. 37 nline Appendix

Recruitment, eﬀort, and retention eﬀects of performance contracts for civilservants: Experimental evidence from Rwandan primary schools

Clare Leaver, Owen Ozier, Pieter Serneels, and Andrew Zeitlin ppendix A Supplemental ﬁgures and tables

Figure A.1: Study proﬁle

Study sample deﬁnition

Randomization of labor markets to advertised contracts

Advertised P4P Advertised FW

Applications placed at District Education Oﬃces

Teachers placed into schools and assigned to classesBaseline schools enrolled

164 schools enrolled in study

Randomization of schools to experienced contractsExperienced P4P contracts

85 schools176 new recruits at baseline (131 upper primary)1,608 incumbent and other teachers at baseline(657 upper primary of these 1,608)7,229 pupils assessed

Year 1 teacher inputs measured

Presence, preparation, pedagogy

Year 1 endline

Year 2 teacher inputs measuredYear 2 endline

Experienced FW contracts

79 schools153 new recruits at baseline (125 upper primary)1,459 incumbent and other teachers at baseline(595 upper primary of these 1,459)6,602 pupils assessed

Year 1 endline

Year 2 teacher inputs measuredYear 2 endline

Advertised mixed

A.1igure A.2: Probability of hiring as a function of TTC score, by advertised treatment arm . . . . . P r ob a b ilit y o f p l ace m e n t .2 .3 .4 .5 .6 .7TTC scoreFWP4P Note : The ﬁgure illustrates estimated hiring probability as a (quadratic) function of the rank of an applicant’s TTCﬁnal exam score within the set of applicants in their district.

A.2able A.1: Summary of hypotheses, outcomes, samples, and speciﬁcations

Outcome Sample Test statistic Randomizationinference

Hypothesis I: Advertised P4P induces differential application qualities ∗ TTC exam scores Universe of applications KS test of eq. (1) T A District exam scores Universe of applications KS test of eq. (1) T A TTC exam scores Universe of applications t A in eq. (10) T A TTC exam scores Applicants in the top ˆ H number of applicants, where ˆ H isthe predicted number of hires based on subject and district,estimated oﬀ of FW applicant pools t A in eq. (10) T A TTC exam scores Universe of application, weighted by probability of place-ment t A in eq. (10) T A Number of applicants Universe of applications t A in eq. (11) T A Hypothesis II: Advertised P4P affects the observable skills of placed recruits in schools ∗ Teacher skills assessmentIRT model EB score Placed recruits t A in eq. (2) T A Hypothesis III: Advertised P4P induces differentially ‘intrinsically’ motivated recruits to be placed in schools ∗ Dictator-game donations Placed recruits t A in eq. (2) T A Perry PSM instrument Placed recruits retained through Year 2 t A in eq. (2) T A Hypothesis IV: Advertised P4P induces the selection of higher-(or lower-) value-added teachers ∗ Student assessments (IRTEB predictions) Pooled Year 1 & Year 2 students t A in eq. (3) T A Student assessments Pooled Year 1 & Year 2 students t A and t A + AE ; t AE in eq. (4) T A T A × T E Student assessments Year 1 students t A in eq. (3) T A Student assessments Year 2 students t A in eq. (3) T A Hypothesis V: Experienced P4P creates incentives which contribute to higher (or lower) teacher value-added ∗ Student assessments (IRTEB predictions) Pooled Year 1 & Year 2 students t E in eq. (3) T E Student assessments Pooled Year 1 & Year 2 students t E and t E + AE ; t AE in eq. (4) T E T A × T E Student assessments Year 1 students t E in eq. (3) T E Student assessments Year 2 students t E in eq. (3) T E Continues. . . A . able A.1, continuedOutcome Sample Test statistic Randomizationinference Hypothesis VI: Selection and incentive effects are apparent in the 4P performance metric ∗ Composite 4P metric Teachers, pooled Year 1 (experienced P4P only) & Year 2 t A in eq. (5) T A Composite 4P metric Teachers, pooled Year 1 (experienced P4P only) & Year 2 t A and t A + AE ; t E and t E + AE ; t AE in eq. (6) T A T E T A × T E Barlevy-Neal rank As aboveTeacher attendance As aboveClassroom observation As aboveLesson plan (indicator) As above

Note : Primary tests of each family of hypotheses appear ﬁrst, preceded by a superscript ∗ ; those that appear subsequently under each familywithout the superscript ∗ are secondary hypotheses. Under inference, T A refers to randomization inference involving the permutation ofthe advertised contractual status of the recruit only ; T E refers to randomization inference that includes the permutation of the experienced contractual status of the school; T A × T E indicates that randomization inference will permute both treatment vectors to determine adistribution for the relevant test statistic. Test statistic is a studentized coeﬃcient or studentized sum of coeﬃcients (a t statistic), exceptwhere otherwise noted (as in Hypothesis I); in linear mixed eﬀects estimates of equation (3) and (4), which are estimated by maximumlikelihood, this is a z rather than t statistic, but we maintain notation to avoid confusion with the test score outcome, z jbksr . A . able A.2: Measures of teacher inputs in P4P schools Mean St Dev Obs

Year 1, Round 1

Teacher present 0.97 (0.18) 640Has lesson plan 0.53 (0.50) 569Classroom observation: Overall score 2.01 (0.40) 631Lesson objective 2.00 (0.71) 631Teaching activities 1.94 (0.47) 631Use of assessment 1.98 (0.50) 629Student engagement 2.12 (0.56) 631

Year 1, Round 2

Teacher present 0.97 (0.18) 629Has lesson plan 0.53 (0.50) 587Classroom observation: Overall score 2.27 (0.41) 628Lesson objective 2.22 (0.76) 627Teaching activities 2.18 (0.46) 627Use of assessment 2.23 (0.48) 627Student engagement 2.46 (0.49) 628

Year 2, Round 1

Teacher present 0.91 (0.29) 675Has lesson plan 0.79 (0.41) 568Classroom observation: Overall score 2.37 (0.34) 520Lesson objective 2.45 (0.68) 520Teaching activities 2.28 (0.43) 518Use of assessment 2.25 (0.47) 519Student engagement 2.49 (0.45) 520

Note : Descriptive statistics for upper-primary teachers only. Overall score for the classroom observation is theaverage of four components: lesson objective, teaching activities, use of assessment, and student engagement, witheach component scored on a scale from 0 to 3.

A.5able A.3: Impacts of advertised P4P on characteristics of placed recruits

Primary outcomes Exploratory outcomesTeacher skills DG contribution Age Female Risk aversion Big FiveAdvertisedP4P -0.184 -0.100 -0.161 0.095 0.010 -0.007[-0.836, 0.265] [-0.160, -0.022] [-1.648, 1.236] [-0.151, 0.255] [-0.125, 0.208] [-0.270, 0.310](0.367) (0.029) (0.782) (0.325) (0.859) (0.951)Observations 242 242 242 242 242 241

Note : The table reports the point estimate of τ A , together with the 95 percent conﬁdence interval in brackets, and the randomization inference p -value inparentheses, from the speciﬁcation in equation (2). The primary outcomes are described in detail in Section 2.2. In the third column, the outcome is placedrecruit age, measured in years. In the fourth column, the outcome is coded to 1 for female recruits and 0 for males. In the ﬁfth column, the outcome is a binarymeasure of risk aversion constructed from placed recruits’ responses in a hypothetical lottery choice game (Chetan et al., 2010; Eckel and Grossman, 2008). Itis coded to 1 when the respondent chooses either of the two riskiest of the ﬁve available lotteries, and 0 otherwise (53 percent of the sample make one of thesechoices). In the ﬁnal column, the outcome is an index of the Big Five personality traits constructed from the 15 item version, validated by Lang et al. (2011) andfollowing Dohmen and Falk (2010). A . able A.4: Impacts on student learning, OLS model Pooled Year 1 Year 2

Model A: Direct eﬀects only

Advertised P4P ( τ A ) 0.03 -0.03 0.08[-0.04, 0.14] [-0.10, 0.08] [-0.03, 0.24](0.37) (0.51) (0.10)Experienced P4P ( τ E ) 0.13 0.10 0.17[0.03, 0.24] [0.00, 0.20] [0.04, 0.32](0.01) (0.05) (0.02)Experienced P4P × Incumbent ( λ E ) -0.09 -0.10 -0.09[-0.31, 0.15] [-0.32, 0.16] [-0.34, 0.16](0.44) (0.40) (0.48) Model B: Interactions between advertised and experienced contracts

Advertised P4P ( τ A ) 0.04 -0.03 0.12[-0.07, 0.23] [-0.14, 0.13] [-0.03, 0.33](0.41) (0.59) (0.10)Experienced P4P ( τ E ) 0.14 0.10 0.17[0.03, 0.26] [-0.02, 0.22] [0.02, 0.35](0.01) (0.11) (0.03)Advertised P4P × Experienced P4P ( τ AE ) -0.03 0.01 -0.06[-0.22, 0.17] [-0.18, 0.21] [-0.32, 0.18](0.72) (0.97) (0.60)Experienced P4P × Incumbent ( λ E ) -0.09 -0.09 -0.09[-0.52, 0.36] [-0.47, 0.40] [-0.56, 0.51](0.62) (0.62) (0.68)Observations 154594 70821 83773 Note : For each estimated parameter, or combination of parameters, the table reports the point estimate (statedin standard deviations of student learning), 95 percent conﬁdence interval in brackets, and p -value in parentheses.Randomization inference is conducted on the associated t statistic. The measure of student learning is based on theempirical Bayes estimate of student ability from a two-parameter IRT model, as described in Section 2.3. A.7able A.5: Teacher endline survey responses

Job satisfaction Likelihood of leaving Positive aﬀect Negative aﬀect

Model A: Direct eﬀects only

Advertised P4P -0.04 -0.07 -0.06 -0.02[-0.41, 0.48] [-0.27, 0.08] [-0.44, 0.33] [-0.29, 0.32](0.82) (0.36) (0.74) (0.86)Experienced P4P 0.05 -0.06 -0.00 0.09[-0.25, 0.36] [-0.18, 0.06] [-0.28, 0.28] [-0.14, 0.33](0.72) (0.39) (0.99) (0.47)Experienced P4P × Incumbent -0.00 0.04 0.04 -0.07[-0.45, 0.48] [-0.13, 0.21] [-0.45, 0.52] [-0.50, 0.37](0.99) (0.61) (0.84) (0.70)

Model B: Interactions between advertised and experienced contracts

Advertised P4P -0.10 -0.01 0.02 -0.33[-0.57, 0.55] [-0.26, 0.18] [-0.52, 0.44] [-0.75, 0.30](0.67) (0.93) (0.89) (0.20)Experienced P4P 0.08 -0.07 -0.02 -0.25[-0.42, 0.54] [-0.27, 0.14] [-0.56, 0.47] [-0.67, 0.17](0.75) (0.50) (0.93) (0.23)Advertised P4P × Experienced P4P 0.13 -0.13 -0.16 0.64[-0.66, 0.85] [-0.42, 0.14] [-0.81, 0.43] [0.04, 1.28](0.71) (0.34) (0.59) (0.03)Experienced P4P × Incumbent -0.03 0.05 0.06 0.27[-0.90, 0.90] [-0.28, 0.37] [-0.86, 0.90] [-0.54, 1.09](0.92) (0.69) (0.84) (0.40)Observations 1483 1492 1474 1447FW recruit mean (SD) 5.42 0.26 0.31 0.00(0.90) (0.44) (0.93) (0.99)FW incumbent mean (SD) 5.26 0.29 -0.05 0.00(1.10) (0.46) (1.00) (1.04)

Note : For each estimated parameter, or combination of parameters, the table reports the point estimate (stated instandard deviations of student learning), 95 percent conﬁdence interval in brackets, and p -value in parentheses. Ran-domization inference is conducted on the associated t statistic. Outcomes are constructed as follows: job satisfaction is scored on a 7-point scale with higher numbers representing greater satisfaction; likelihood of leaving is a binaryindicator coded to 1 if the teacher responds that they are likely or very likely to leave their job at the current schoolover the coming year; positive aﬀect and negative aﬀect are standardized indices derived from responses on a 5-pointLikert scale. A.8able A.6: Teacher attitudes toward pay-for-performance at endline

Very unfa-vorable Somewhatunfavorable Neutral Somewhatfavorable Very favor-ableRecruits applying under FW (64) 4.7% 4.7% 7.8% 10.9% 71.9%—Experiencing FW (33) 6.1% 9.1% 9.1% 3.0% 72.7%—Experiencing P4P (31) 3.2% 0.0% 6.5% 19.4% 71.0%Recruits applying under P4P (60) 5.0% 3.3% 8.3% 1.7% 81.7%—Experiencing FW (32) 6.2% 0.0% 6.2% 0.0% 87.5%—Experiencing P4P (28) 3.6% 7.1% 10.7% 3.6% 75.0%Incumbent teachers (1,113) 5.0% 7.5% 7.2% 9.9% 70.4%—Experiencing FW (537) 5.2% 8.6% 8.0% 8.6% 69.6%—Experiencing P4P (576) 4.9% 6.6% 6.4% 11.1% 71.0%

Note : The table reports the distribution of answers to the following question on the endline teacher survey: “Whatis your overall opinion about the idea of providing high-performing teachers with bonus payments on the basis ofobjective measures of student performance improvement?” Figures in parentheses give the number of respondents ineach treatment category.

A.9 ppendix B Theory

This appendix sets out a simple theoretical framework, adapted from Leaver, Lemos and Scur(2019), that closely mirrors the experimental design described in Section 1. We used this frameworkas a device to organize our thinking when choosing what hypotheses to test in our pre-analysis plan.We did not view the framework as a means to deliver sharp predictions for one-tailed tests.

The model

We focus on an individual who has just completed teacher training, and who must decide whetherto apply for a teaching post in a public school, or a job in a generic ‘outside sector’. Preferences

The individual is risk neutral and cares about compensation w and eﬀort e . Eﬀortcosts are sector-speciﬁc. The individual’s payoﬀ in the education sector is w − ( e − τ e ), whileher payoﬀ in the outside sector is w − e . The parameter τ ≥ intrinsicmotivation to teach, and can be thought of as the realization of a random variable. The individualobserves her realization τ perfectly, while (at the time of hiring) employers observe nothing. Performance metrics

Irrespective of where the individual works, her eﬀort generates a perfor-mance metric m = e θ + ε . The parameter θ ≥ ability , and can also be thoughtof as the realization of a random variable. The individual observes her realization of θ perfectly,while (at the time of hiring) employers observe nothing. Draws of the error term ε are made from U [ ε, ε ], and are independent across employments. Compensation schemes

Diﬀerent compensation schemes are available depending on advertisedtreatment status. In the advertised P4P treatment, individuals choose between: (i) an educationcontract of the form, w G + B if m ≥ m , or w G otherwise; and (ii) an outside option of the form w if m ≥ m , or 0 otherwise. In the advertised FW treatment, individuals choose between: (i) aneducation contract of the form w F ; and (ii) the same outside option. In our experiment, the bonus B was valued at RWF 100,000, and the ﬁxed-wage contract exceeded the guaranteed income in theP4P contract by RWF 20,000 (i.e. w F − w G = 20 , Timing

The timing of the game is as follows.1. Outside options and education contract oﬀers are announced.2. Nature chooses type ( τ, θ ).3. Individuals observe their type ( τ, θ ), and choose which sector to apply to. Leaver, Lemos and Scur (2019) focus on a teacher who chooses between three alternatives: (i) accepting an oﬀerof a job in a public school on a ﬁxed wage contract, (ii) declining and applying for a job in a private school on apay-for-performance contract, and (iii) declining and applying for a job in an outside sector on a diﬀerent performancecontract.

B.1igure B.1: Compensation schemes in the numerical example

Performance metricCompensation Fixed WageOutside optionP4P 𝑚𝑚𝑤 𝑤 $ 𝑤 %

4. Employers hire (at random) from the set of applicants.5.

Surprise re-randomization occurs.6. Individuals make eﬀort choice e .7. Individuals’ performance metric m is realized, with ε ∼ U [ ε, ¯ ε ].8. Compensation paid in line with (experienced) contract oﬀers. Numerical example

To illustrate how predictions can be made using this framework, we drawon a numerical example. First, in terms of the compensation schemes, we assume that w O = 50, B = 40, w G = 15, m = 1, and m = 4 . ε = − ε = 5, pin down eﬀort and occupational choices by a given ( τ, θ )-type.If, in addition, we make assumptions concerning the distributions of τ and θ , then we can alsomake statements about the expected intrinsic motivation and expected ability of applicants, andthe expected performance of placed recruits. Here, since our objective is primarily pedagogical, wego for the simplest case possible and assume that τ and θ are drawn independently from uniformdistributions. Speciﬁcally, τ is drawn from U [0 , θ is drawn from U [1 , Analysis

As usual, we solve backwards, starting with eﬀort choices.

Eﬀort incentives

Eﬀort choices under the three compensation schemes are: e F = τ / e P = θ B ε − ε ) + τ / e O = θ w O ε − ε ) , B.2igure B.2: Decision rules under alternative contract oﬀer treatments 𝜏 𝜃

Education P4P Other 𝜏 ∗ (𝜃) 𝜏 = 𝜃 𝜏 𝜃 Education FW Other 𝜏 ∗∗ (𝜃) where we have used the fact that ε is drawn from a uniform distribution. Intuitively, eﬀort incentivesare higher under P4P than under FW, i.e. e P > e F . Supply-side selection.

The individual applies for a teaching post advertised under P4P if, givenher ( τ, θ ) type, she expects to receive a higher payoﬀ teaching in a school on the P4P contract thanworking in the outside sector. We denote the set of such ( τ, θ ) types by T P . Similarly, the individualapplies for a teaching post advertised under FW if, given her ( τ, θ ) type, she expects to receive ahigher payoﬀ teaching in a school on the FW contract than working in the outside sector. We denotethe set of such ( τ, θ ) types by T F . Figure B.2 illustrates these sets for the numerical example. Notethat the function τ ∗ ( θ ) traces out motivational types who, given their ability, are just indiﬀerentbetween applying to the education sector under advertised P4P and applying to the outside sector,i.e.: Pr (cid:2) θe P + ε > m (cid:3) B + w G − ( e P ) + τ ∗ e P = Pr (cid:2) θe O + ε > m (cid:3) w O − ( e O ) . Similarly, the function τ ∗∗ ( θ ) traces out motivational types who, given their ability, are just indif-ferent between applying to the education sector under advertised FW and applying to the outsidesector, i.e.: w F − ( e F ) + τ ∗∗ = Pr (cid:2) θe O + ε > m (cid:3) · w O − ( e O ) . In the numerical example, we see a case of positive selection on intrinsic motivation and negativeselection on ability under both the FW and P4P treatments. But there is less negative selectionon ability under P4P than under FW.

Empirical implications

We used this theoretical framework when writing our pre-analysis plan to clarify what hypothesesto test. We summarize this process for Hypotheses I and VI below.B.3 ypothesis I: Advertised P4P induces diﬀerential application qualities.

Deﬁne 1 { ( τ,θ ) ∈T F } and 1 { ( τ,θ ) ∈T P } as indicator functions for the application event in the advertised FW and P4P treat-ments respectively. The diﬀerence in expected intrinsic motivation and expected ability across thetwo advertised treatments, can be written as:E (cid:104) τ · { ( τ,θ ) ∈T F } (cid:105) − E (cid:104) τ · { ( τ,θ ) ∈T P } (cid:105) and E (cid:104) θ · { ( τ,θ ) ∈T F } (cid:105) − E (cid:104) θ · { ( τ,θ ) ∈T P } (cid:105) . In the numerical example, both diﬀerences are negative: expected intrinsic motivation and expectedability are higher in the P4P treatment than in the FW treatment.

Hypothesis VI: Selection and incentive eﬀects are apparent in the composite 4P perfor-mance metric.

We start with the selection eﬀect. Maintaining the assumption of no demand-sideselection treatment eﬀects, and using the decomposition in Leaver, Lemos and Scur (2019), we canwrite the diﬀerence in expected performance across sub-groups a and b (i.e. placed recruits whoexperienced FW) as:E[ m a ] − E[ m b ] = E (cid:104) ( θ e F − θ e F ) · { ( τ,θ ) ∈T F } (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) incentive eﬀect = 0 + E (cid:104) θ e F · (cid:16) { ( τ,θ ) ∈T F } − { ( τ,θ ) ∈T P } (cid:17)(cid:105)(cid:124) (cid:123)(cid:122) (cid:125) selection eﬀect . Similarly, the diﬀerence in expected performance across sub-groups c and d (i.e. placed recruitswho experienced P4P) can be written as:E[ m c ] − E[ m d ] = E (cid:104) ( θ e P − θ e P ) · { ( τ,θ ) ∈T F } (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) incentive eﬀect = 0 + E (cid:104) θ e P · (cid:16) { ( τ,θ ) ∈T F } − { ( τ,θ ) ∈T P } (cid:17)(cid:105)(cid:124) (cid:123)(cid:122) (cid:125) selection eﬀect . In the numerical example, both diﬀerences are negative, and the second is larger than the ﬁrst.Turning to the incentive eﬀect, we can write the diﬀerence in expected performance acrosssub-groups a and c (i.e. placed recruits who applied under advertised FW) as:E[ m a ] − E[ m c ] = E (cid:104) ( θ e F − θ e P ) · { ( τ,θ ) ∈T F } (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) incentive eﬀect + E (cid:104) θ e F · (cid:16) { ( τ,θ ) ∈T F } − { ( τ,θ ) ∈T F } (cid:17)(cid:105)(cid:124) (cid:123)(cid:122) (cid:125) selection eﬀect=0 . Similarly, the diﬀerence in expected performance across sub-groups b and d (i.e. placed recruitswho applied under advertised P4P) can be written as:E[ m b ] − E[ m d ] = E (cid:104) ( θ e F − θ e P ) · { ( τ,θ ) ∈T P } (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) incentive eﬀect + E (cid:104) θ e P · (cid:16) { ( τ,θ ) ∈T P } − { ( τ,θ ) ∈T P } (cid:17)(cid:105)(cid:124) (cid:123)(cid:122) (cid:125) selection eﬀect=0 . In the numerical example, both diﬀerences are negative, and the second is larger than the ﬁrst.B.4ypothesis IV and V focus on one component of the performance metric—student performance—and follow from the above. B.5 ppendix C Applications

Here, we report results from secondary tests of Hypothesis I: advertised P4P induces diﬀerentialapplication qualities, and also provide a robustness check of our assumption that district-by-subject-family labour markets are distinct.

Secondary tests

Our pre-analysis plan included a small number of secondary tests of Hypothesis I (see Table A.1).Three of these tests use estimates from TTC score regressions of the form y iqd = τ A T Aqd + γ q + δ d + e iqd , with weights w iqd , (10)where y iqd denotes the TTC exam score of applicant teacher i with qualiﬁcation q in district d andtreatment T Aqd denotes the contractual condition under which a candidate applied. The weightedregression parameter τ A estimates the diﬀerence in (weighted) mean applicant skill induced byadvertised P4P. The fourth test is for a diﬀerence in the number of applicants by treatment status,conditional on district and subject-family indicators. Here, we use a speciﬁcation of the formlog N qd = τ A T Aqd + γ q + δ d + e qd , (11)where q indexes subject families and d indexes districts; N qd measures the number of qualiﬁedapplicants in each district. Although our pre-analysis plan proposes a ﬁfth test—a KS test ofequation (1) using district exam scores—we did not do this because our sample of these scores wasincomplete.To undertake inference about these diﬀerences in means, we use randomization inference, sam-pling repeatedly from the set of potential (advertised) treatment assignments T A . Following Chungand Romano (2013), we studentize this parameter by dividing it by its (cluster-robust, clusteredat the district-subject level) standard error to control the asymptotic rejection probability againstthe null hypothesis of equality of means. These are two-sided tests. The absolute value of theresulting test statistic, | t A | , is compared to its randomization distribution in order to provide a testof the hypothesis that τ A = 0.Results are in Table C.1. The ﬁrst column restates the conﬁdence interval and p -value from theKS test for comparison purposes. The second column reports results for the TTC score regressionwhere all observations are weighted equally (i.e. a random hiring rule, as assumed in the theory).Our estimate of τ A is − . p -value is 0 . ‘Qualiﬁed’ here means that the applicant has a TTC degree. In addition to being a useful ﬁlter for policy-relevantapplications, since only qualiﬁed applicants can be hired, in some districts’ administrative data this is also necessaryin order to determine the subject-family under which an individual has applied. We calculated p -values for two-sided tests as provided in Rosenbaum (2010) and in the ‘Standard OperatingProcedures’ of Donald Green’s Lab at Columbia (Lin, Green and Coppock, 2016). C.1able C.1: Secondary tests of impacts on teacher ability in application pool

KS Unweighted Empiricalweights Top Number ofApplicantsAdvertisedP4P n.a. -0.001 -0.001 -0.009 -0.040[-0.020, 0.020] [-0.040, 0.036] [-0.038, 0.032] [-0.025, 0.008] [-0.306, 0.292](0.909) (0.984) (0.948) (0.331) (0.811)Observations 1715 1715 1715 1715 18

Note : The ﬁrst column shows the conﬁdence interval in brackets, and the p -value in parentheses, from the primaryKS test discussed in Section 3.1. The second column reports the (unweighted OLS) point estimate of τ A from theapplicant TTC exam score speciﬁcation in (10). The third and fourth columns report the point estimate of τ A fromthe same speciﬁcation with the stated weights. The ﬁfth column reports the point estimate of τ A from the numberof applicants per labor market speciﬁcation in (11), with the outcome N qd in logs. the TTC score regression with weights w iqd = ˆ p iqd , where ˆ p iqd is the estimated probability of beinghired as a function of district and subject indicators, as well as a ﬁfth-order polynomial in TTCexam scores, estimated using FW applicant pools only (i.e. the status quo mapping from TTCscores to hiring probabilities). The fourth column reports results for the TTC score regressionwith weights w iqd = 1 for the top ˆ H teachers in their application pool, and zero otherwise (i.e.a meritocratic hiring rule based on TTC scores alone). Here, we test for impacts on the averageability of the top ˆ H applicants, where ˆ H is the predicted number hired in that district and subjectbased on outcomes in advertised FW district-subjects. Neither set of weights changes the conclusionfrom the second column: we cannot reject the sharp null of no impact of advertised P4P. The ﬁnalcolumn reports results for the (logged) application volume regression. Our estimate of τ A is − . p -value of 0 . Robustness

To illustrate the implications of cross-district applications, consider an individual living in, say,Ngoma with the TTC qualiﬁcation of TSS. On the assumption that this individual is willing totravel only to the neighbouring district of Rwamagana, she could be impacted by the contractualoﬀer of P4P in her home ‘Ngoma-TSS’ market and/or the contractual oﬀer of P4P in the adjacent‘Rwamagana-TSS’ market. That is, she might apply in both markets, or in Rwamagana insteadof Ngoma—what we term a cross-district labor-supply eﬀect . The former behavior would simplymake it harder to detect a selection eﬀect at the application stage (although not at the placementstage since only one job can be accepted). But the latter cross-district labor-supply eﬀect wouldbe more worrying. We would not ﬁnd a selection eﬀect where none existed—without a directeﬀect of advertised P4P on a given market, there cannot be cross-district eﬀects by this positedmechanism—but we might overstate the magnitude of any selection eﬀect.Our random assignment provides us with an opportunity to test for the presence cross-districtC.2abor-supply eﬀects. To do so, we construct an adjacency matrix , deﬁning two labor marketsas adjacent if they share a physical border and the same TTC subject-family qualiﬁcation. Wethen construct a count of the number of adjacent markets that are assigned to Advertised P4P,and an analogous count for ‘mixed’ treatment status. Conditional on the number of adjacentmarkets, this measure of the local saturation of P4P is randomly determined by the experimentalassignment of districts to advertised contractual conditions. A regression of labor-market outcomesin a given district on both its own advertised contractual status (direct eﬀect) and this measure oflocal saturation, conditional on the number of neighboring labor markets, provides an estimate ofcross-district labor-supply eﬀects and, by randomization inference, a test for their presence.Table C.2: Cross-district eﬀects in teacher labor market outcomes

TTC scores Number of applicantsAdvertised P4P 0.032 -0.085[-0.050, 0.103] [-0.469, 0.972](0.297) (0.900)Adjacent P4P markets 0.027 -0.047[-0.022, 0.087] [-0.833, 0.573](0.115) (0.710)Observations 1715 18

Note : The table shows point estimates for the direct and local saturation eﬀects of P4P contracts, with conﬁdenceintervals in brackets and randomization inference p -values in parentheses. In the ﬁrst column, the unit of analysis isthe application and the outcome is the TTC score of the applicant. In the second column, the unit of analysis is thelabor market and the outcome is the number of applications, in logs. All speciﬁcations control for the total numberof adjacent markets. Table C.2 shows results of this analysis for two key labor-market outcomes—applicant TTCscores analyzed at the application level, and the number of applications per labor-market analyzedat the labor-market level. The direct eﬀects of advertised P4P on each of these outcomes arepresented for comparison and remain qualitatively unchanged relative to the estimates in TableC.1, which did not allow for saturation eﬀects. Estimated saturation eﬀects of neighboring P4Pmarkets are modest in estimated size and statistically insigniﬁcant for both outcomes. This suggeststhat saturation eﬀects were of limited consequence in our setting.C.3 ppendix D Test-score constructs

Barlevy-Neal metric

At the core of our teacher evaluation metric is a measure of the learning gains that teachers bringabout, measured by their students’ performance on assessments. (See Section 2 for a description ofassessment procedures; throughout, we use students’ IRT-based predicted abilities to capture theirlearning outcomes in a given subject and round.) To address concerns over dysfunctional strategicbehavior, our objective was to follow Barlevy and Neal’s pay-for-percentile scheme as closely as waspractically possible (Barlevy and Neal, 2012, henceforth BN).The logic behind the BN scheme is that it creates a series of ‘seeded tournaments’ that incen-tivize teachers to promote learning gains at all points in the student performance distribution. Inshort, a teacher expects to be rewarded equally for enabling a weak student to outperform his/hercomparable peers as for enabling a strong student to outperform his/her comparable peers. Roughlyspeaking, the implemented BN scheme works as follows. Test all students in the district in eachsubject at the start of the year. Take student i in stream k for subject b at grade g and ﬁnd thatstudent’s percentile rank in the district-wide distribution of performance in that subject and gradeat baseline. Call that percentile (or interval of percentiles if data is sparse) student i ’s baselinebin. Re-test all students in each subject at the end of the year. Establish student i ’s end-of-yearpercentile rank within the comparison set deﬁned by his/her baseline bin. This metric constitutesstudent i ’s contribution to the performance score of the teacher who taught that subject-stream-grade that school year. Repeat for all students in all subjects-streams-grades taught by that teacherin that school year, and take the average to give the BN performance metric at teacher level.We adapt the student test score component of the BN scheme to allow for the fact that weobserve only a sample of students in each round in each school-subject-stream-grade. (This wasdone for budgetary reasons and is a plausible feature of the cost-eﬀective implementation of sucha scheme at scale, in an environment in which centrally administered standardized tests are nototherwise taken by all students in all subjects.) To avoid gaming behavior—and in particular, therisk that teachers would distort eﬀort toward those students sampled at baseline—we re-sampled(most) students across rounds, and informed teachers in advance that we would do so.Speciﬁcally, we construct pseudo-baseline bins as follows. Students sampled for testing at theend of the year are allocated to district-wide comparison bins using empirical CDFs of start ofyear performance (of diﬀerent students). To illustrate, suppose there are 20 baseline bins within adistrict, and that the best baseline student in a given school-stream-subject-grade is in the (top)bin 20. Then the best endline student in the same school-stream-subject-grade will be assigned tobin 20, and will be compared against all other endline students within the district who have also In setting such as ours where the number of students is modest, there is a tradeoﬀ in determining how wide tomake the percentile bins. As these become very narrowly deﬁned, they contain few students, and the potential formeasurement error to add noise to the results increases. But larger bins make it harder for teachers to demonstratelearning gains in cases where their students start at the bottom of a bin. In practice, we use vigintiles of thedistrict-subject distribution.

D.1een placed in bin 20.To guard against the possibility that schools might selectively withhold particular studentsselected from the exam, all test takers were drawn from beginning-of-year administrative registersof students in each round. Any student who did not take the test was assigned the minimumtheoretically possible score. This feature of our design parallels similar incentives to mitigateincentives for selective test-taking in Glewwe, Ilias and Kremer (2010).Denote by z ibkgdr the IRT estimate of the ability of student i in subject b , stream k , grade g ,district d , and round r . We can outline the resulting algorithm for producing the student learningcomponent of the assessment score for rounds r ∈ { , } in the following steps:1. Create baseline bins . • Separately for each subject and grade, form a within-district ranking of the studentssampled at round r − z ibkgd,r − . Use this ranking to place these round r − B baseline bins. • For each subject-grade-school-stream within a school, calculate the empirical CDF ofthese baseline bins. Place end-of-year students into pseudo-baseline bins . • Form a within subject-stream-grade-school percentile ranking of the students sampledat round r on the basis of z ibkgdr . In practice, numbers of sampled students varies for agiven stream between baseline and endline, so we use percentile ranks rather than simplecounts. Assign the lowest possible learning level to students who were sampled to takethe test but did not do so. • Map percentile-ranked students at endline onto baseline bins through the empirical CDFof baseline bins. For example, if there are 20 bins and the best round 1 student in thatsubject-stream-grade-school was in the top bin, then the best round 2 student in thatsubject-stream-grade-school will be placed in pseudo-bin 20.3.

BN performance metric at student-subject level . Separately for each subject, grade, anddistrict, form a within-psuedo-baseline bin ranking of the students sampled at round r on thebasis of z ibgdr . This is the BN performance metric at student-subject level, which we denoteby π ibkgdr . It constitutes student i ’s contribution to the performance score of the teacher whotaught subject b stream k at grade g for school year r .4. BN performance metric at teacher-level . For each teacher, compute the weighted average ofthe π ibkgt for all the students in the subject-stream-grades that they taught in round r schoolyear. This is the BN performance metric at teacher-level. Weights w ik are given by the There are 40 subject-grade-school streams (out of a total of 4,175) for which no baseline students were sampled.In such cases, we use the average of the CDFs for the same subject in other streams of the same school and grade (ifavailable) or in the school as a whole to impute baseline learning distributions for performance award purposes.

D.2inverse of the) probability that student i was sampled in stream k : the number of sampledstudents in that stream divided by the number of students enrolled in the same stream. Notethese weights are determined by the number of students sampled for the test, not the numberof students who actually took the test (which may be smaller), since our implementation ofthe BN metric includes, with the penalty described above, students who were sampled forbut did not sit the test. To construct the BN performance metric at teacher-level for the second performance round, r = 2, we must deal with a further wrinkle, namely the fact that we did not sample students at thestart of the year. We follow the same procedure as above except that at Step 2 we use the set ofstudents who were sampled for and actually sat the round 1 endline exam, and can be linked to anenrollment status in a speciﬁc stream round 2, to create the baseline bins and CDFs for that year. Teacher value added

This section brieﬂy summarizes how we construct the measure of teacher value added for the placedrecruits, referred to at the end of Section 3.1.We adapt the approach taken in prior literature, most notably Kane and Staiger (2008) andBau and Das (2020). Denoting as in equations (3) and (4) the learning outcomes of student i insubject b , stream k of grade g , taught by teacher j in school s and round r by z ibgjsr , we expressthe data-generating process as: z ibgjsr = ρ bgr ¯ z ks,r − + µ bgr + λ s + θ j + η jr + ε ibgjr , (12)This adapts a standard TVA framework to use the full pseudo-panel of student learning measures.Our sampling strategy implies that most students are not observed in consecutive assessments,as discussed in Section 2.3. We proxy for students’ baseline abilities using the vector of meansof lagged learning outcomes in all subjects, ¯ z ks,r − , where the parameter ρ bgr allows these laggedmean outcomes to have distinct own- and cross-subject associations with subsequent learning forall subjects, grades, and rounds. In a manner similar to including means instead of ﬁxed eﬀects(Chamberlain, 1982; Mundlak, 1978), these baseline peer means block any association betweenteacher ability (value added) and the baseline learning status of sampled students.In equation (12), the parameter θ j is the time-invariant eﬀect of teacher j : her value added.We allow for ﬁxed eﬀects by subject-grade-rounds, µ bgr , and schools λ s , estimating these withinthe model. We then form empirical Bayes estimates of TVA as follows.1. Estimate the variance of the TVA, teacher-year, and student-level errors, θ j , η jr , ε ibgjr respec-tively, from equation (12). Deﬁning the sum of these errors as v ibgjr = θ j + η jr + ε ibr : thelast variance term can be directly estimated by the variance of student test scores around Our endline sampling frame covered all grades, streams, and subjects. In practice, out of 4,200 school-grade-stream-subjects in the P4P schools, we have data for a sample of students in all but ﬁve of these, which were missedin the examination.

D.3heir teacher-year means: ˆ σ ε = Var( v ibgjr − ¯ v jr ); the variance of TVA can be estimated fromthe covariance in teacher mean outcomes across years: ˆ σ θ = Cov(¯ v jr , ¯ v j,r − ), where this co-variance calculation is weighted by the number of students taught by each teacher; and thevariance of teacher-year shocks can be estimated as the residual, ˆ σ η = Var( v ibgjr ) − ˆ σ θ − ˆ σ ε .2. Form a weighted average of teacher-year residuals ¯ v jr for each teacher.3. Construct the empirical Bayes estimate of each teacher’s value added by multiplying thisweighted average of classroom residuals, ¯ v j , by an estimate of its reliability: (cid:100) V A j = ¯ v j (cid:18) ˆ σ θ Var(¯ v j ) (cid:19) (13)where Var(¯ v j ) = ˆ σ θ + ( (cid:80) r h jr ) − , with h jr = Var(¯ v jr | θ j ) − = (cid:16) ˆ σ η + ˆ σ ε n jr (cid:17) − .Following this procedure, we obtain a distribution of (empirical Bayes estimates of) teachervalue added for placed recruits who applied under advertised FW. The Round 2 point estimatefrom the student learning model in Equation (3) would raise a teacher from the 50th to above the76th percentile in this distribution. Figure 4 plots the distributions of (empirical Bayes estimatesof) θ j + η jr separately for r = 1 ,

2, and for recruits applying under advertised FW and advertisedP4P.It is of interest to know whether the measures of teacher ability and intrinsic motivation thatwe use in Section 3.1 are predictive of TVA. This is undertaken in Table D.1, where TVA is theestimate obtained pooling across rounds and treatments. Interestingly, the measure of teacherability that we observe among recruits at baseline, Grading Task IRT score, is positively correlatedwith TVA (rank correlation of 0 . p -value of 0 . . p -value of 0 . We obtain qualitatively similar results for the FW sub-sample, where TVA cannot be impacted by treatmentwith P4P.

D.4able D.1: Rank correlation between TVA estimates, TTC scores, Grading Task IRT scores, andDictator Game behavior among new recruits

TVA TTC score Grading taskTTC score -0.087 . .(0.178)Grading task 0.132 0.150 .(0.039) (0.029)DG share sent -0.078 0.062 -0.047(0.203) (0.349) (0.468)

Note : The table provides rank correlations and associated p -values (in parentheses) for relationships between recruits’teacher value added and various measures of skill and motivation: TTC ﬁnal exam scores, baseline Grading Task IRTscores, and baseline Dictator Game share sent. We obtain the empirical Bayes estimate of TVA from θ j estimated inthe school ﬁxed-eﬀects model in equation (12). D.5 ppendix E Communication about the intervention

Promotion to potential applicants

The subsections below give details of the (translated) promotional materials that were used inNovember and December 2015.

Leaﬂets and posters in district oﬃces

A help desk was set up in every District Education Oﬃce. Staﬀers explained the advertised contractsto individuals interested in applying, and distributed the leaﬂet shown in Figure E.1, and stickers.Permanent posters, like the example shown in Figure E.2 further summarised the programme.Staﬀers kept records of the number of visitors and most frequent questions, and reported back tohead oﬃce. Radio Ads

Radio ads were broadcast on Radio Rwanda, the national public broadcaster, during Novem-ber/December 2015 to promote awareness of the intervention. The scripts below were developedin partnership with a local advertising agency.

Radio script 1

SFX: Noise of busy environment like a trading centre

FVO: Hey, Have you seen how good Gasasira’s children look? [

This is a cultural referenceimplying that teachers are smart, respected individuals and nothing literal about how the childlooks. ]MVO: Yeah! That’s not surprising though, their parents are teachers.FVO: Hahahahah...[

Sarcastic laugh as if to say, what is so great about that. ]MVO: Don’t laugh...haven’t you heard about the new programme in the district to recognizeand reward good teachers? I wouldn’t be surprised if Gasasira was amongst those that havebeen recognised.ANNOUNCER: Innovations for Poverty Action in collaboration with REB and MINEDUC,is running the STARS program in the districts Kayonza, Ngoma, Rwamagama, Kirehe, Gat-sibo, and Nyagatare for the 2016 academic year. Some new teachers applying to these dis-tricts will be eligible for STARS which rewards the hardest working, most prepared and bestperforming teachers. Eligible districts are still being ﬁnalized—keep an eye out for furtherannouncements!

Radio script 2

SFX: Sound of a street with traﬃc and cars hooting

VO1: Mari, hey Mariko!....What’s the rush, is everything OK?VO2: Oh yes, everything is ﬁne. I am rushing to apply for a job and don’t want to ﬁnd allthe places taken. The respective number of visitors were: Gatsibo 305, Kayonza 241, Kirehe 411, Ngoma 320, Nyagatara 350, andRwamagama 447.

E.1igure E.1: Leaﬂet advertising treatmentsE.2igure E.2: Poster explaining the programmeVO1: Oh that’s good. And you studied to be a teacher right?VO2: Exactly! Now I am going to submit my papers at the District Oﬃce and hope I getlucky on this new programme that will be recognizing good teachers!ANNOUNCER: Innovations for Poverty Action in collaboration with REB and MINEDUC,is running the STARS program in the districts Kayonza, Ngoma, Rwamagama, Kirehe, Gat-sibo, and Nyagatare for the 2016 academic year. Some new teachers applying to these dis-tricts will be eligible for STARS which rewards the hardest working, most prepared and bestperforming teachers. Eligible districts are still being ﬁnalized—keep an eye out for furtherannouncements!

Radio script 3

SFX: Calm peaceful environment

VO1: Yes honestly, Kalisa is a very good teacher!VO2: You are right, ever since he started teaching my son, the boy now understands maths!VO1: Yes and because of him other parents want to take their children to his school.VO2: Aaah!...That must be why he was selected for the programme that rewards goodteachers.VO1: He deﬁnitely deserves it, he is an excellent teacher.ANNOUNCER: Innovations for Poverty Action in collaboration with REB and MINEDUC,is running the STARS program in the districts Kayonza, Ngoma, Rwamagama, Kirehe, Gat-sibo, and Nyagatare for the 2016 academic year. Some new teachers applying to these dis-tricts will be eligible for STARS which rewards the hardest working, most prepared and bestE.3erforming teachers. Eligible districts are still being ﬁnalized—keep an eye out for furtherannouncements!

Brieﬁng in P4P schools

The subsections below provide extracts of the (translated) script that was used during brieﬁngsessions with teachers in P4P schools in April 2016. The main purpose of these sessions was toexplain the intervention and maximise understanding of the new contract.

Introduction [Facilitator speaks.] You have been selected to participate in a pilot program that Rwanda Edu-cation Board (REB) and Innovations for Poverty Action (IPA) are undertaking together on paidincentives and teacher performance. As a participant in this study, you will be eligible to receivea competitive bonus based on your performance in the study. The top 20 percent of teachers inparticipating schools in your district will receive this bonus. All participants will be considered forthis paid bonus. It is important to note that your employment status will not be aﬀected by yourparticipation in this study. It will not aﬀect whether you keep your job, receive a promotion, etc.You will be evaluated on four diﬀerent categories :1.

Presence , which we will measure through whether you are present in school on days whenwe visit;2.

Preparation , which we will measure through lesson planning;3.

Pedagogy , which we will measure through teacher observation; and4.

Performance , which we will measure through student learning assessments. You will receiveadditional information on each of these categories throughout this training.In your evaluation, the ﬁrst three categories (presence, preparation, and pedagogy) will con-tribute equally to your ‘inputs’ score. This will be averaged with your ‘performance’ score (basedon student learning assessments) which will therefore be worth half of your overall score. [Teachersare then provided with a visual aid.]The SEO will now tell you how we are going to measure each of these components of yourperformance. Before I do so, are there any questions?

Presence: Teacher attendance score [SEO now speaks.] I will now explain to you the ﬁrst component of your performance score:Teacher Presence. During this pilot program, I will visit your school approximately one time perterm. Sometimes I will come twice or more; you will not know in advance how many times I planto visit in any term. These visits to your school will be unannounced. Neither your Head Teachernor you will know in advance when I plan to visit your school. I will arrive approximately at thestart of the school day. Teachers who are present at that time will be marked ‘present’; those thatare not will be marked ‘late’ or ‘absent’. The type of absence will be recorded. Teachers who haveexcused reasons for not being present in school will be marked ‘excused’. These reasons includepaid leaves of absence, oﬃcial trainings, and sick leaves that have been granted in advance bythe Head Teacher. If you are not present because you feel unwell but have not received advancepermission from the Head Teacher, you will be marked as absent.E.4t is in your best interest to be present every day, or in the case of emergency, notify the headteacher of your absence with an appropriate excuse before the beginning of classes. I will alsorecord what time you arrive to school. You will be marked for arriving on time and arriving lateto work. It is in your best interest to arrive on time to school every day.

Preparation: Lesson planning score

Later in this session, you’ll be shown how to use a lesson planning form. Lesson planning is a toolto help you improve both your organization and teaching skills. The lesson planning form will helpyou to include the following components into your lesson: • A clear lesson objective to guide the lesson. • Purposeful teaching activities that help students learn the skill. • Strong assessment opportunities or exercise to assess students’ understanding of the skill.This lesson planning form consists of three categories: lesson objective, teaching activities,assessment/exercises. You will be evaluated on these three categories. I will not evaluate yourlesson plans. Instead, I will collect your lesson planning forms at the end of the study. An IPAeducation specialist will review your lesson plans and score them. They will compare your lessonplans to other teachers’ plans in the district. Please be aware that these lesson plans will onlybe used for this study and will not be reviewed by any MINEDUC oﬃcials. They will use thefollowing scoring scale, with 0 being the lowest score and 3 being the highest score. [Teachers arethen provided with a visual aid.]You will be responsible for ﬁlling out the lesson planning form to be eligible for the paid bonus.You will ﬁll out a lesson plan for each day and each subject you teach. You will ﬁll out the lessonplanning form in addition to your MINEDUC lesson journal. Later in this session, you will have achance to practice using the lesson planning form. You will also see examples of strong and weaklesson plans to help you understand our expectations.

Pedagogy: Teacher observation score

The third component that will aﬀect your eligibility for the paid bonus is your observation score.I will observe your classrooms during the next few weeks at least once, and again next term. Iwill score your lesson in comparison to other teachers in your district using a rubric. During theobservation, I will record all the activities and teaching strategies you use in your lesson. At theend of your lesson, I will use my notes to evaluate your performance in the following four categories: • Lesson objective, does your lesson have a clear objective?; • Teaching activities, does your lesson include activities that will help students learn the lesson?; • Assessment and exercises, does your lesson include exercises for students to practice the skill?;and • Student engagement, are students engaged during the lesson and activities?I will use a scoring rubric designed by IPA, Georgetown, and Oxford University to evaluate yourperformance in each category. You will receive a score from 0 (unsatisfactory) to 3 (exemplary)in each category. I will observe your entire lesson, from beginning to end. I will then evaluateE.5our performance based on the observation. You will not know when I am coming to observe yourlesson, so it is in your best interest to plan your lessons everyday as if I were coming to observe.After the lesson, I will share your results with the Head Teacher. You will be able to obtain a copyof your scores, together with an explanation, from the Head Teacher.

Performance: Student test scores [Field supervisor now speaks.] Half of your overall evaluation will be determined by the learningachievements of your students. We have devised a system to make sure that all teachers competeon a level playing ﬁeld. If students in your school are not as well oﬀ as students in other schools,you do not have to worry: we are rewarding teachers for how much their students can improve, notfor where they start.Here is how this works. We randomly selected a sample of your students to take a cumula-tive test, testing their knowledge of grade level content. These tests were designed based on thecurriculum, to allow us to measure the learning of students for each subject separately. The per-formance of each teacher will be measured by the learning outcomes of students in the subjectsand streams that they themselves teach. (So, if you are a P4 Maths Teacher, your performancewill not be aﬀected by students’ scores in P4 English. And if you teach P4 Math for Stream Abut not Stream B, your performance measure will not depend on students’ scores in Stream B.)We will compare the marks for this test with those from other students in the same district, andplace each student into one of ten groups, with Group 1 being the best performing, Group 2 beingthe next-best performing, and so on, down to Group 10. In the district as a whole, there are equalnumbers of students placed in each of these groups, but some of your students may be in the samegroup, and there may be some groups in which you do not have any students at all.At the end of this school year, we will return to your school and we will sample 10 new pupilsfrom every stream in Upper Primary school to take a new test. This will be a random sample. Wedo not know in advance who will be drawn, and students who participated in the initial assessmenthave the same chance of appearing in the end-of-year sample as anyone else. We will draw studentsfor this assessment based on the student enrollment register. If any student from that register isasked to participate in the test but is no longer enrolled at the school, they will receive a score ofzero. So, you should do your best to encourage students to remain enrolled and to participate inthe assessment if asked. Once the new sample has taken the assessment, we will sort them intogroups, with the best-performing student from the ﬁnal assessment being placed into the group thatwas determined by the best-performing student in the initial assessment. The second-best studentfrom the ﬁnal assessment will be placed into the second-highest group achieved from the initialassessment, and so on, until all students have been placed into groups. We will then compare yourstudents’ learning levels with the learning levels of other students in the same group only. Each ofyour students will receive a rank, with 1 being the best, 2 being the next, and so on, within theirgroup. (This means there will be a 1st-ranked student in Group 1, and another student rankedﬁrst in Group 2, and so on.) The measure of your performance that we will use for your score isthe average of these within-group ranks of the students whom you teach.This all means that you do not have to have the highest performing students in the District inorder to be ranked well. It is possible to be evaluated very well even if, for example, all of yourstudents are in Group 10, the lowest-performing group: what matters is how they perform relativeto other students at the same starting point. I will now demonstrate how this works with someexamples. Please feel free to ask questions as we go along.E.6 orked example 1 [Field supervisor sets up Student Test Scores Poster and uses the StudentTest Scores Figures to explain this example step by step.] Let us see how the learning outcomesscore works with a ﬁrst example. For this example, suppose that we were to sample 5 studentsfrom your class in both the beginning-of-year and end-of-year assessments. (In reality there will beat least ten, but this is to make the explanation easier.) Now, suppose in the initial assessment, wedrew 5 students. And those students’ scores on the assessment might mean that they are placedas follows: • One student in Group 1 (top); • One student in Group 3; • One student in Group 6; • One student in Group 9; and • One student in Group 10.Then, at the end of the school year, we will return and we will ask 5 new students to sit for adiﬀerent assessment. These are unlikely to be the same students as before. Once they have takenthe test, we will rank them, and we will put the best-performing of the new students into Group1, the next-best-performing of the new students into Group 3, the next-best performing of the newstudents into Group 6, then Group 9, and Group 10. So, the Groups into which the new studentsare placed are determined by the scores of the original students.Finally, we will compare the actual scores of the new students to the other new students fromschools in this district who have been placed into the same groups. For example: • The new student placed into Group 1 might be ranked 1st within that group; • The new student placed into Group 3 might be ranked 7th within that group; • The new student placed into Group 6 might be ranked 4th within that group; • The new student placed in Group 9 might also be ranked 4th within her group; • The new student placed into Group 10 might be ranked 1st within his group.Then, we add up these ranks to determine your score: in this case, it is 1 + 7 + 4 + 4 + 1 = 17.That is pretty good! Remember, the lower the sum of these ranks, the better. And notice thateven though the student in Group 10 did not have a very high score compared to everyone else inthe district, he really helped your performance measure by doing very well within his group.

Worked example 2

Now, let us try a second example. Again let us suppose that we wereto sample 5 students from your class in both the beginning-of-year and end-of-year assessments.(Remember: in reality there will be at least ten, but this is to make the explanation easier.) Now,suppose in the initial assessment, we drew 5 students. And those students’ scores on the assessmentmight mean that they are placed as follows: • One student in Group 1 (top); • TWO students in Group 3; E.7

One student in Group 4; and • One student in Group 5.Notice that it is possible for two or more of your students to be in the same group. Then,at the end of the school year, we will return and we will ask 5 new students to sit for a diﬀerentassessment. Again, these are unlikely to be the same students as before. Now, suppose that one outof the ﬁve students that we ask for has dropped out of school, or fails to appear for the test. Theywill still be counted, but their exam will be scored as if they answered zero questions correctly—theworst possible score. Once they have taken the test, we will rank them, and we will put the best-performing of the new students into Group 1, the two next-best-performing of the new studentsinto Group 3, the next-best performing of the new students into Group 4. The student who wasnot present for the test because they had dropped out of school is placed into Group 5. As in theprevious example, notice that the groups into which the new students are placed are determinedby the scores of the original students.Finally, we will compare the actual scores of the new students to the other new students fromschools in this district who have been placed into the same groups. For example: • The new student placed into Group 1 might be ranked 1st within that group; • The new students placed in Group 3 might be ranked 4th & 7th in that group; • The new student placed into Group 4 might be ranked 8th within that group; • The new student placed in Group 5, who did not actually take the test, will be placed lastin his group. If there are 40 students in the group from across the whole district, then thiswould mean that his rank in that group is 40th.Then, we add up these ranks to determine your score: in this case, it is 1 + 4 + 7 + 8 + 40 = 60.Notice three points. First, even though in this example, your students did better on the initialassessment than in the ﬁrst example, this does not mean that you scored better overall. All groupsare counted equally, so that no school or teacher will be disadvantaged in this process. Second,notice that the student who dropped out was ranked worst out of the group to which he wasassigned. Since the lowest-performing student in the initial assessment was in Group 5, the studentwho had dropped out was compared with other students placed into Group 5. Since he receivedthe worst possible score, he was ranked last (in this case, fortieth) within that group. This was badfor the teacher’s overall performance rank. Third, teachers will be evaluated based on the samenumber of students. So even if a teacher would be teaching in several streams, resulting in morestudents taking the tests, his ﬁnal score will be based on a random subsample of students, suchthat all teachers are evaluated on the same number of students.E.8 nline appendix references

Barlevy, Gadi, and Derek Neal.

American Economic Review ,102(5): 1805–1831.

Bau, Natalie, and Jishnu Das.

AmericanEconomic Journal: Economic Policy , 12(1): 62–96.

Chamberlain, Gary.

Journal of Econo-metrics , 18(1): 5–46.

Chetan, Dave, Catherine C. Eckel, Cathleen A. Johnson, and Christian Rojas.

Journal of Risk and Uncertainty , 41: 219–243.

Chung, EunYi, and Joseph P Romano.

The Annals of Statistics , 41(2): 488–507.

Dohmen, Thomas, and Armin Falk.

Economic Journal , 120(546): F256–F271.

Eckel, Catherine, and Philip Grossman.

Glewwe, Paul, Nauman Ilias, and Michael Kremer.

AmericanEconomic Journal: Applied Economics , 2(3): 205–227.

Kane, Thomas J, and Douglas O Staiger.

NBER Working Paper 14607 . Lang, Frieder R., Dennis John, Oliver Ludtke, Jurgen Schupp, and Gert G. Wag-ner.

Behavior Research Methods , 43: 548–567.

Leaver, Clare, Renata Lemos, and Daniela Scur.

CEPR Discussion Paper DP14069 . Lin, Winston, Donald P Green, and Alexander Coppock.

Mundlak, Yair.

Econometrica ,46(1): 69–85.

Rosenbaum, Paul R.