Long-Lived Consequences of Rapid Scale-Up? The Case of Free Primary Education in Six Sub-Saharan African Countries

Across six Sub-Saharan African countries, grade 4 students of teachers who were hired after a free primary education reform perform worse, on average, on language and math tests — statistically significantly so in language — than students of teachers who were hired before the reform. Teachers who were hired just after the reform also perform worse, on average, on tests of subject content knowledge than those hired before the reform. The results are sensitive to the time frames considered in the analysis, and aggregate results mask substantial variation across countries — gaps are large and significant in some countries but negligible in others. Analysis of teacher demographic and education characteristics — including education level or teacher certification — as well as teacher classroom-level behaviors reveals few systematic differences associated with being hired pre-or post-reform.


Introduction
In the late-1990s to early-2000s, Sub-Saharan Africa experienced a surge in primary school enrollments linked to a movement to make schooling at that level free-a reform dubbed Free Primary Education (FPE). FPE has been linked to a number of positive impacts: more children enrolling in and completing schooling, and positive impacts on health, fertility, and other noneducation outcomes. But the rapid scale-up, sometimes within education systems that were not prepared for the personnel and infrastructure demands that the surge in students would require, has led many to comment on the potential harm to the quality of the average education delivered after scale-up. In particular, there were many reports of overcrowding (i.e., very high student-teacher ratios) and a lowering of teacher qualifications to address that overcrowding in the short run after the implementation of FPE in many countries.
This study explores the narrow question of whether the impacts of this rapid increase in the number of teachers associated with FPE can be felt "now" (at the time of the survey), years after the reform, in classrooms in terms of lower student learning outcomes and worse teacher "quality." While the question has an interesting historical dimension, what makes it even more important and relevant is that at least two rapid scale-up efforts are currently being advocated foruniversal access to early childhood education, and universal free secondary education.
Understanding the consequences of the FPE effort can shed light on how to think about these new efforts. 1 The data used to analyze this question are from six Sub-Saharan African countries (Kenya, Madagascar, Mozambique, Tanzania, Togo, and Uganda) and include detailed student and teacher data collected as a part of the Service Delivery Indicators project. The analysis compares the learning outcomes of students at the time of the survey (which is, depending on the country, between 5 (Togo) and 16 (Uganda) years after FPE was implemented) whose teachers were hired just before versus just after the policy. It also compares the characteristics of teachers at the time of the survey, again contrasting those who were hired just before and just after the policy was implemented.
The main finding of the analysis is that, on average, it does appear that the students of teachers who were hired after FPE perform worse on the tests administered-and statistically significantly worse in language. Teachers themselves who were hired just after the reform also perform worse, on average, on tests of subject content knowledge than those hired before.
However, these results are not always statistically significantly different from zero (depending, for example, on the number of years surrounding the FPE policy one includes in the analysis), and they vary substantially across countries. Analyzing teacher demographic and education characteristics-including education degrees and teacher certification-as well as teacher classroom-level behaviors, reveals few systematic differences associated with being hired preversus post-FPE. So, while there is some evidence that rapid scale-up led to lower student learning outcomes, further analysis (and perhaps additional data) is needed to provide confidence in the robustness of the findings and understand potential mechanisms.

Background and selected literature review
The free primary education movement had its roots in the gap between outcomes and aspirations. In the early 1990s, as primary school enrollment rates were stagnating in many countries, the international community made high-profile commitments to increase them-such as at the "World Declaration on Education for All" made following the Jomtien conference in 1990 (UNESCO 1990); the "The Dakar Framework for Action" in 2000 following the Dakar Forum on Education for All (UNESCO 2000); and the adoption of the Millennium Development Goals which grew out of the United Nations ' Millennium Declaration in 2000(United Nations 2000. Many countries, particularly in Sub-Saharan Africa, implemented a Free Primary Education Policy (FPE) reform in order to increase access and completion (Sifuna 2016). These decisions were partly influenced by research showing that school fees were being identified as a barrier to enrollment, particularly among the poor (e.g., Bentaouet Kattan and Burnett 2004;Watkins 2000).
In many respects, the changes in outcomes that directly followed this policy were remarkable. The most salient of these is that the primary net enrollment rate across Sub-Saharan Africa increased from 61 to 90 percent between 1990 and 2010-whereas it had taken currently high-income countries 60 years to make a similar improvement (1875 to 1935; World Bank 2018 figure 2.1). This pattern of rapid improvement has been observed in a number of countries (for example, see Avenstrup, Liang, and Nellemann 2004 for Kenya, Lesotho, Malawi and Uganda).
Studies in selected countries have documented the explicit link between the policy change and increases in access, enrollment, and grade attainment. 2 Uganda was one of the first countries to implement FPE, and early studies established the role it played in lowering age at school entry (Grogan 2009), increasing enrollment (Deininger 2003, and increasing grade attainment (Nishimura, Tamano, and Sasaoka 2008). Analyses of more recent episodes find similar impacts, including in Lesotho (Moshoeshoe, Ardington and Piraino 2019), Kenya (Lucas and Mbiti 2012a) and Tanzania (Hoogeveen and Rossi 2013). In Tanzania impacts were largest for children from poorer families and for girls, while in Kenya impacts on attainment were also largest for children from poorer backgrounds (Lucas and Mbiti 2012a) but the gender difference favored boys (Lucas and Mbiti 2012b). 3 In addition to analysis of education sector impacts, the "shock" of FPE has also been used to study the impact of education on desired and realized fertility. For example, one study shows (using a regression discontinuity approach) that FPE reduced women's reported desired number of children in Ethiopia, Malawi, and Uganda (Behrman 2015). The study also argues that it is not just the education of the women that mattered, but that of their partners did as well. Other studies have used similar methods, or other approaches that exploit differences in the timing and geographic deployment of FPE (using difference-in-difference approaches), to document impacts on realized fertility. For example, research on Ethiopia (Chicoine 2021) highlights the complementary role of women's access to markets; and evidence from Ghana (Boahen and Yamauchi 2018), Malawi (Behrman 2015), Nigeria (Okonkwo Osili and Long 2008), Uganda 2 Also see the discussion in Evans and Acosta (2021). 3 There is a literature on the political economy of the conceptualization and implementation of FPE. This is beyond the scope of this paper but interested readers could see: Bennel (2021)  At the same time as it has been shown to have these positive effects, FPE has also been cited as a potential contributor to a worsening of the average quality of education delivered in Sub-Saharan Africa (see, for example, the discussion in Wedgwood (2007) for the case of Tanzania).
Many observers have pointed out the low average quality of learning outcomes in Sub-Saharan Africa (e.g. Jones and others 2014; Spaull and Taylor 2015; World Bank 2018), which may be linked to the quality of education service delivery (Bold and others 2017; World Bank 2018). The question then becomes whether the implementation of FPE led to a deterioration in quality. The difficulty in recruiting a large number of skilled teachers, and building the required physical infrastructure, to match the expected increases in student enrollments was not unforeseen. Munene (2016) includes a number of case study examples in which this was a first-order challenge in the lead-up to, and during the implementation of, FPE in several countries.
Several studies have documented reductions in quality associated with FPE. For example, using data from the Southern and Eastern Africa Consortium for Monitoring Educational Quality (SACMEQ) and school census data, Valente (2019) finds that after fees were removed in Tanzania, student-teacher ratios increased and observed teacher training, experience, and subject-specific knowledge declined. However, the analysis does not find that student test scores declined in a statistically significant way in response to the fee removal. Also using SACMEQ, Atuhurra (2016) finds that FPE in Kenya was associated with large achievement declines (in public schools) and argues that these are linked to lower teacher effort and disengagement of communities. Bold andothers (2011) andBold, Kimenyi, andSandefur (2013) find that the decline in average quality in public Kenyan schools is mostly the result of selection-with higher socio-economic-status, and potentially higher-achieving, students switching to private schools after FPE-and not a decline in value-added. 5 It is unclear whether this pattern of switching to private schools occurred similarly in other countries where the number of private schools is often much lower than in Kenya.
While not linking changes to FPE per se, Taylor and Spaull (2015) analyze the changes in "access to learning" between 2000 and 2007 in 10 African countries using SACMEQ data on test scores and household surveys (e.g. Demographic and Health Surveys-DHS) on enrollment. The study defines "access to literacy" as the product of the grade 6 completion rate and the proportion of grade 6 students who reach a basic level of literacy, with a corresponding measure of "access to numeracy." The analysis finds that access to learning increased over this period in all the countries studied-despite the fact that the proportion of students who reached the basic literacy/numeracy threshold actually fell in 3 of the countries.
Le Nestour, Moscovitz, and Sandefur (2022) use data from the DHS and Multiple Indicator Cluster Surveys (MICS) to estimate literacy rates for birth cohorts ranging from the 1950s to the 1990s. The analysis shows that while average literacy has increased in all regions, including in Sub-Saharan Africa, "education quality" (defined as the expected literacy acquired after 5 years of primary schooling in a country at a particular time) has not-and, in particular, has declined in Sub-Saharan Africa. Using an interrupted-time-series approach, they show that education quality generally declined after FPE reforms were introduced (or more precisely, they show that average declines in school quality accelerated after the reforms).
Notwithstanding the results from Kenya discussed above, the lack of preparation and the need to rapidly scale up teacher recruitment has often been blamed for the potentially negative impacts of FPE (see the case study discussions in Munene 2016). It is against this backdrop that we turn to the focus of this new analysis, namely whether the effects of this rapid scale-up can still be felt in the classroom "today," years after the initial reform period.
3. Data, empirical approach, and trends in teacher recruitment 3.1 Data The data used here are from nationally representative surveys of public schools in six Sub-Saharan African countries collected as a part of the Service Delivery Indicators (SDI) project. This project (launched in 2010 as a collaboration between the World Bank and the African Economic Research Consortium, and later joined by the William and Flora Hewlett Foundation and the African Development Bank) collected detailed data on teachers, including recruitment date and subject content knowledge, and also administered an assessment of basic literacy and numeracy to grade 4 students. Details on this effort are discussed in Gatti and others (2021). In order to be able to systematically match teachers to student test scores, we focus here only on grade 4 teachers.
These data were collected between 2012 and 2016 in Kenya, Madagascar, Mozambique, Tanzania, Togo, and Uganda (Table 1). The number of students assessed in each country ranges from 1,744 (Mozambique) to 4,236 (Tanzania), and the number of teachers assessed ranges from 310 (Mozambique) to 1,327 (Tanzania). Across these countries, FPE was launched as early as 1997 (Uganda) and as late as 2008 (Togo). The gap in the number of years between the reform and data collection ranges from 5 (Togo) to 16 (Uganda), with a mean of 11.2 and a median of 11.5. 6 In each country, representative surveys of between 198 (Togo) and 472 (Madagascar) schools were implemented using a multistage cluster-sampling design. Primary schools with at least one fourth-grade class formed the sampling frame. 7 In general, in each school, 10 students were sampled from a randomly selected grade 4 classroom. In addition, the students' current (i.e. 6 There is sometimes some uncertainty as to the exact date of the FPE reform as the announcement does not always match the implementation. The FPE years used in this paper are derived from the following sources: Kenya at the time of the survey) language and mathematics teachers were selected for testing. 8 However the exact target number of students and teachers varied across countries.
The student test was designed as a one-on-one evaluation, with enumerators reading instructions aloud to students in their mother tongue. This was done to help ensure that, for example, the results of the mathematics test did not depend on a student's mastery of reading. The language test, which evaluated ability in English (Kenya, Tanzania, and Uganda), French (Togo), or Portuguese (Mozambique), included items ranging from simple tasks that tested letter and word recognition to a more challenging reading comprehension test. 9 The mathematics test items ranged in difficulty from recognizing and ordering numbers, to the addition of one-to three-digit numbers, to the subtraction of one-and two-digit numbers, and to the multiplication and division of singledigit numbers. In both language and mathematics, the tests spanned items from the first four years of the curriculum. 10 The student tests have good reliability, with a reliability ratio (estimated by Cronbach's alpha) above 0.8 in both subjects. 11 This analysis uses Item Response Theory (IRT) derived ability scores for each of the assessments, and the scores are normalized to have a mean of 0 and standard deviation of 1 across all grade 4 students in these six countries. 12 Summary statistics for these and other variables analyzed in this paper (for data on teachers, and students of teachers, who were hired within a seven year span prior to FPE) are reported in Appendix Tables 1 to 3.
To assess teachers' subject content knowledge, teachers were asked to mark (or "grade") mock student tests in language and in mathematics. The main reason for using this approach is that it is consistent with teachers' regular teaching activities-namely, assessing student work. Both 8 In some of these countries, teachers in other grades were also assessed, but these are excluded from the analysis here. 9 In Tanzania some students were tested in Swahili and some in English. We use either in the analysis, but include a dummy variable equal to one in the empirical specifications if the language was Swahili. 10 The teacher and student subject tests were designed by experts in international pedagogy and validated against 13 Sub-Saharan African primary curricula (Botswana, Ethiopia, The Gambia, Kenya, Madagascar, Mauritius, Namibia, Nigeria, Rwanda, the Seychelles, South Africa, Tanzania, and Uganda). See Johnson, Cunningham, and Dowling (2012) for details. A few items in the tests also measured grade 5 knowledge. 11 Cronbach's alpha is defined as the square of the correlation between the measured test score and the underlying metric. A Cronbach alpha of 1 would indicate that the test is a perfect measure of the underlying metric (though not necessarily of student/teacher knowledge). As a rule of thumb, values between 0.8 and 0.9 are considered as good. 12 The specific measure we use comes from an analysis of pooled data from a number of Sub-Saharan African countries that go beyond the ones studied here. See Gatti and others (2021) for a description of this derivation. That score is renormalized across grade 4 students in these six countries to have a mean of 0 and a standard deviation of 1. the language and mathematics assessments for teachers included test items starting at grade 1 level (simple spelling or grammar exercises, addition and subtraction) and included items up to the grades at the upper primary level (Cloze passages to assess vocabulary and reading comprehension, interpretation of information in a diagram and/or a graph, and a more advanced math story problem). 13 The teacher tests have good reliability, with a reliability ratio (estimated by Cronbach's alpha) above 0.85 in both subjects on the teacher test. Like the student tests, the teacher test scores used here are based on IRT-derived ability scores for each of the assessments, and they are normalized to have a mean of 0 and standard deviation of 1 across all grade 4 teachers in these six countries.
In addition to these tests, we also incorporate data from a teacher roster, a classroom observation of a lesson, and from a module that measures teacher absence. The teacher roster collects basic demographic and education background information on teachers: age, gender, highest education level completed, whether the teacher holds a teaching certification, 14 whether they were born in this district, as well as the teacher's contract status. 15 We also use information on whether the school is in an urban or rural area.
In each school, one grade 4 lesson was observed for its entirety. During this classroom observation, enumerators recorded minute-by-minute what the teacher was doing against a set of predefined activities. Activities include descriptors such as "Teacher interacts with all children as a group" or "Teacher supervises pupil(s) writing on the board" or "Teacher in class -not teaching." We use these data to construct a measure of the share of time the teacher is teaching. After the lesson, enumerators record what they observed along three main dimensions: the teacher's general demeanor; the use of what are generally thought of as good pedagogical practices; and observations on the classroom environment. For each of these sets of variables, we code responses as a series of indicator (zero/one) variables and average across the variables to create indices for each of these dimensions. 16 Teacher absence was measured separately from this observation and during an unannounced visit to the school. Enumerators recorded the attendance of teachers who, according to the teaching schedule, were supposed to be teaching at the time of the visit. They were then recorded as being absent from the school, present in the school but absent from the classroom, or present in the classroom. We use this to define two indicator variables: (1) absent from the school versus present in the school (which we call school absence), and (2) absent from the school or classroom versus present in the classroom (which we call classroom absence).
In addition to the SDI data, we also make use of data from the UNESCO Institute for Statistics (UIS) to describe trends in the number of primary-level teachers in each country (accessed via the World Bank World Development Indicators database).

Empirical approach
The general approach used here is based on a regression discontinuity design (RDD) approach, which assesses whether there is a break in trend for teachers that were hired before or after the implementation of FPE. The models are estimated on the pooled sample (including all countries) with a specification that includes country fixed effects, and on country-specific samples. In the analysis of student test scores, the empirical specification for the pooled model is: (1) where Titc is the test score for student i, of teacher t, in country c. Year Hiredtc is the year that teacher t in country c was hired, FPE yearc is the year that the FPE reform was implemented in country c, and Hired after FPE yeartc is an indicator variable equal to 1 if the teacher was hired after FPE. The estimated coefficient δ is the main estimate of interest: if this is large it would indicate that there is a break in trend and there is a significant difference in outcomes associated with teachers having been hired before or after FPE.
The coefficient β represents the general time trend of the outcome associated with the year teachers were hired. For this outcome (student test score), it captures the fact that student test scores may vary systematically with how long a teacher has been teaching, for example. To allow a focus on the years surrounding the implementation of FPE, the sample is restricted to a "bandwidth" of a specific number of years surrounding the FPE year. For most of the models, to investigate the robustness of the estimates, we vary the bandwidth from 2 to 10 years in absolute value. This means, for example, that teachers hired 2 years prior to FPE and 2 years after FPE are included in the sample when the bandwidth is 2.
For other outcomes the estimated models are adapted from that in equation (1). For example, when estimating the model for teacher test scores or teacher characteristics, the equation is the same although the outcome is at the teacher level. When estimating student-level outcomes, standard errors are clustered at the teacher level; when estimating teacher-level outcomes, standard errors are clustered at the school level.

Trends in teacher recruitment
Before turning to the estimates of equation (1) we explore trends in teacher recruitment in two ways-first, by using the number of primary school teachers over time, and second, by using the year that grade 4 teachers in the SDI report that they began teaching.
The number of primary school teachers as reported to UIS has increased consistently over time in five of the six countries ( Figure 1). The exception is Kenya, where the reported number of teachers has fluctuated from year to year, albeit within a relatively small band. 17 In 3 of these countries there is a notable increase in the trend just after FPE (Tanzania, Togo, Uganda). In Madagascar and Mozambique the year-to-year increase was already large before FPE, and appears to continue after it. It is hard to make out a trend in Kenya due to the year-to-year fluctuations.
We next turn to the year that current grade 4 teachers in the SDI surveys report having begun teaching. It is important to note that the sample of teachers in the SDI reflects only those who have remained as teachers and the results need to be interpreted in that light, in particular because this could induce a "survivorship" bias to the results. If, for example, less-qualified teachers were hired following FPE and teachers with this profile are also more likely to have left the teaching profession by the time of the survey, then we would not be capturing them in the sample. Moreover, it is important to note that the students we test are in grade 4 and so would have been exposed to 3 years of teachers prior to this grade. These teachers may have been hired before or after FPE, meaning that total FPE effects may be hard to detect with knowledge of just the grade 4 teacher's status. These caveats mean that this analysis should be understood as an exploration of whether the impacts of the FPE scale-up are still present in the long term using this particular window into that exploration.
These data do seem to suggest a noticeable change in the density of grade 4 SDI teachers hired after FPE ( Figure 2). While it is only in Tanzania and Uganda that it would appear that this happened exactly in the FPE year, all the countries display a change around that date (although in Togo there are other spikes in hiring that do not correspond to the FPE year). In the spirit of RDD, we test for whether there is a change in the density at the time of FPE by implementing the "test for manipulation" as developed by McCrary (2008). 18 The results are presented graphically in Appendix Figure 1: in all cases the density is indeed higher after FPE. However, the difference is only statistically significant in two of the six countries (Kenya and Tanzania), meaning there is not always a clean break in the series relative to the overall year-to-year volatility in the series. 18 We use the version as implemented by Cattaneo, Jansson and Ma (2020) and Cattaneo, Jansson and Ma (2021). This test is typically implemented in RDD situations in order to test whether the running variable has been "manipulated" in such a way as to move subjects from the non-treated to the treated side of a threshold that determines treatment. As such, a break is considered evidence of a bad outcome. In our case, we expect there to be a change in the density if there is a surge in teacher recruitment after FPE.

Learning outcomes for students of teachers hired before and after FPE
We now turn to the results from estimating the model described in equation (1). The first set of results are for student test scores. Because of the large number of results, these are presented graphically to ease interpretation (the full set of results is reported in Appendix Table 4). Figure   3 shows the coefficient estimates of δ in equation (1)  Coefficient estimates for language test scores are consistently negative and generally statistically significant (for bandwidths 5 to 10). 19 The magnitudes suggest that having a teacher who was hired after FPE is associated with a 0.2 standard deviation reduction in scores on the test.
For math the coefficient estimates are smaller and never statistically significant. They are nevertheless systematically negative and on the order of 0.1 standard deviation. Figure 4 shows the results from estimating the model country-by-country. For language the results are generally consistent across countries-namely that test scores are lower for students whose teacher was hired after FPE (note that these results are generally less precise than those from the pooled models, and effects are often not statistically significant). This is especially noticeable in Kenya, Madagascar, Tanzania and Togo. In some cases the coefficients are very large-reaching as high as 0.5 standard deviation in Kenya when using a bandwidth of 7 years around the FPE year. But these disaggregated results also reveal inconsistencies: In Uganda there appears to be no language test score deficit associated with post-FPE teachers, and in fact the coefficient estimates are positive in several cases (albeit not statistically significantly so). For math the results are decidedly more mixed-ranging from positive and sometimes statistically significant for Kenya to negative and sometimes statistically significant for Mozambique and 19 Increasing the bandwidth increases the sample size and so all else equal effects are more likely to be found significant. However introducing observations away from the cutoff means that it is more likely that other factors may come into play-including whether or not the shock is persistent and effects cumulate on the one hand, or whether they are transient and impacts regress to the mean on the other.
Uganda. The coefficient estimates for math are generally small and not statistically significant in the cases of Madagascar, Tanzania, and Togo.

Teacher subject content knowledge for teachers hired before and after FPE
We next turn to the results from estimating the teacher-level counterpart to equation (1), with the test score from teacher subject content knowledge assessment as the outcome variable (the full set of results is reported in Appendix Table 5). The results for the pooled model ( Figure 5) are not always consistent with those for students: the coefficient estimates on language (left panel) are consistently small and not statistically significant, while those for math are systematically negative, and sometimes statistically significantly different from zero. The size of the estimates suggests that teachers hired after FPE score about 0.15 standard deviation lower on the math test that those hired before.
Country-by-country estimates ( Figure 6) once again reveal substantial heterogeneity. In five of the six countries, being a teacher who was hired post-FPE does not appear to be associated with worse language test scores-in fact coefficient estimates are generally positive, albeit small and not statistically significant. The exception is Tanzania where, for all the bandwidths, the estimates are negative, and for bandwidths of 3 to 5 years the estimated effect is large and statistically significant. Expanding the bandwidth in the Tanzania model reduces the size of the estimate (from about 0.5 standard deviations to about 0.2) and results are not generally significantly different from zero (with the exception of a bandwidth of 7). The results for math are also varied across countries, although in four of the six countries the results suggest a negative effect (Kenya, Mozambique, Tanzania, Togo)-but these are generally not statistically significant.
In Tanzania the effect for math is more consistently negative and generally statistically significantly different from zero (for bandwidths of 5 to 10 years). The magnitude of this effect suggests that Tanzanian teachers who were hired just after FPE scored about 0.4 standard deviation lower on the math assessment than those hired just before. Table 2 reports results from corresponding models of basic teacher characteristics. For succinctness, only the results derived from models using a seven-year bandwidth around the FPE are shown. 20 In the pooled model, the only variables that are significantly different from zero are age, holding a teaching certification, and being born in the (local) district, with post-FPE teachers being older by 2.3 years, 5 percentage points less likely to hold a certification, and 8 percentage points more likely to be born in the district. The positive age effect is also apparent in three of the six country-specific sets of estimates (with a similar magnitude of close to 3 years in Kenya, Madagascar, and Tanzania). The teaching certification effect has the same negative sign in Madagascar, Mozambique, and Togo, although none of the country-specific estimates are statistically significant. The locally born effect has the same negative sign in Madagascar, Mozambique, Tanzania, and Uganda, although it is only statistically significantly different from zero in the latter. Perhaps surprisingly, given that much of the narrative around the hiring of teachers at the time of FPE suggests a lowering of educational standards, the results do not suggest that post-FPE teachers (still working in grade 4 classrooms "today") are less likely to have at least a secondary education, at least a diploma, or at least a bachelor's degree. On the latter there is some degree of consistency that post-FPE teachers are slightly less likely to have a bachelor's degree-but the coefficient estimates are small and far from statistically significant at conventional levels. The results do not suggest that there is a difference in the contract status of teachers hired before and after FPE.

Teacher characteristics
The age and education qualification results are somewhat counter to the common narrative that FPE resulted in the recruitment of young and un(der)-educated teachers. It is possible that countries may have sought to also recruit older people, with basic general education qualifications, who were in other professions, and who were local. The certification result is more consistent with the common narrative, although the country estimates suggest that it is only in Mozambique where the effect is substantively large (albeit statistically insignificant).

Classroom observation and teacher absence
The last set of results comes from the analysis of the classroom observation and teacher absence data. 21 There is no consistent evidence that teachers hired after FPE do any worse on any of these 20 A bandwidth of 7 was chosen to report here for parsimony and because it is a relative midpoint. Results for bandwidths of 5 and 10 years are reported in Appendix Table 6. 21 Results for bandwidths of 5 and 10 years are reported in Appendix Table 7. measures than those hired before-either in the pooled model or in the country-specific models (Table 3). If anything, the estimates suggest that there is a positive relationship with being a post-FPE hire. In the pooled model (Column 1 of Table 3) the effect of being a post-FPE hire is statistically significantly positive on the use of good pedagogical practices and negative on school absence (a pattern that is repeated in some of the individual countries such as Kenya, and Tanzania). Perhaps surprisingly, the only somewhat consistent result across countries from this analysis is that teachers hired post-FPE are less likely to have been absent on the day of the unannounced visit.

Discussion and conclusion
This analysis set out to explore the extent to which the effects of the rapid scale-up in the hiring of teachers associated with FPE reforms in Sub-Saharan Africa-and the potentially lower "quality" of those teachers, due to potentially lowering of standards required to rapidly recruit substantial numbers of teachers-are still detectable at the classroom level 5 to 16 years (depending on the country) after the reform. Administrative data confirm the rapid increase in the number of primary school teachers across these countries, with inflections at the time of FPE for several. In many of the countries, the years which current grade 4 teachers report as having been their first also reflect an inflection at the time of FPE.
Comparing the learning outcomes of students whose teachers were hired in the years before and after FPE suggests that there is, on average, a learning deficit for the latter-although, depending on the bandwidth used, the effects are not always statistically significantly different from 0. Pooled across all the countries in the study, the deficit is on the order of 0.2 standard deviation on the language test, and 0.1 standard deviation on the math test (test scores have mean 0 and standard deviation 1 across all test-taking students in these countries). Deficits in language are common across almost all the countries (Uganda is the exception), but in math the results are more varied, with strong deficits in Mozambique and Uganda, milder deficits in Madagascar and Togo, and a test-score advantage in Kenya and Tanzania.
While it is hard to draw strong conclusions explaining cross-country variability on the basis of just six countries, one hypothesis could be that where pre-FPE outcomes were "better" there might have been more scope for reductions post-FPE. For student language scores there is indeed a large and statistically negative correlation between the FPE coefficient on student test scores and the pre-FPE mean student test score (correlation = -0.767; p-value=0.075, see Appendix Figure   2)-that is, the better test scores are for student of teachers hired pre-FPE the larger is the negative impact of having a teacher hired post-FPE. At the same time, however, the pattern is reversed (although not statistically significant) for student math scores.
While the learning results are intriguing, the story is less clear once other dimensions of teachers are analyzed directly. First, the post-FPE deficit in teacher test scores is generally detectable in math but not in language-the opposite of the pattern for student test scores. At the country level the direction of the subject-specific deficits or advantages is also inconsistent, with the exception being Tanzania, where there are deficits in language for both students and teachers (at least for most bandwidths). There are no detectable deficits in any other countries for language.
For math the deficits have the same direction in Mozambique and (for some specifications) Togobut the results for the other countries do not suggest a simple storyline.
The results do suggest, however, that there is a negative and statistically significant correlation across these six countries between the size of the impact of being a post-FPE teacher for language, and the language test scores of pre-FPE teachers (correlation = -0.587; p-value = 0.093; see Appendix Figure 2). This suggests that in countries where the test scores of teachers hired pre-FPE are higher, there might have been more scope to bring in teachers with lower mastery of the subject. The correlation for math is similar in magnitude but not statistically significant (coefficient = -0.555; p-value = 0.253; see Appendix Figure 2). Second, neither teacher demographic and educational characteristics, nor teaching behaviors, seem to differ consistently for pre-versus post-FPE hired teachers in ways that would strongly suggest worse "quality.' Post-FPE teachers are slightly more likely to be "born in the district," less likely to hold a teaching certification, and slightly less likely to have a bachelor's degree-but these results are not consistent across all countries, nor are they statistically significant. Post-FPE teachers are slightly older than pre-FPE teachers, which goes in the opposite direction from what one might expect if "young and inexperienced" teachers were drafted into the profession at the time of FPE. While also varied across countries, to the extent that there were statistically significant effects for teacher behavior results, these seem to go in the opposite direction from what would be implied if post-FPE teachers were of worse "quality:" they tend to use "good pedagogical practices" more, and to be absent less often.
In sum, this analysis has revealed a mixed set of findings and suggests that the processes that played out over the FPE period and beyond was clearly complex. An intriguing initial finding of learning deficits associated with having a teacher who was hired post-FPE is not systematically matched by the expected corresponding trends at the teacher level (including teacher test scores, teacher characteristics, and teacher in-class behaviors). It is likely that there are a variety of countervailing forces at play-for example recruiting teachers locally might have led to increased commitment and local accountability, perhaps lower absence, and perhaps greater ability to communicate in the local language. At the same time, these might have been offset by lower mastery of subject knowledge or other countervailing forces. Last, for the results in this analysis, it is not just what occurred at the time of the FPE reform that matters, but also what transpired after that-including potentially differential attrition of teachers hired pre-and post-FPE. At the end of the analysis we are therefore left with something of a puzzle that will take more analysis, and perhaps more or different data, to investigate.  Note: Number of teachers includes only those who were assessed and for whom the start date is known.