Policy Research Working Paper                        9647




      Enhancing Human Capital in Children
                        A Case Study on Scaling

                             Francesco Agostinelli
                                Ciro Avitabile
                                Matteo Bobba




Education Global Practice
May 2021
Policy Research Working Paper 9647


  Abstract
 This paper provides new insights on the science of scaling.                        original modality, the enhanced modality boosts children’s
 The authors study an educational mentoring program with                            outcomes, both in the field experiment and during the
 a home visit component implemented at scale in Mexico,                             government implementation. Higher-quality home visits
 under different modalities (original and enhanced train-                           encourage parent/child and parent/community interactions,
 ing for mentors) and different situations (field experiment                        which in turn are found to promote the scalability of the
 and policy implementation). While the program was                                  program. The work provides new knowledge on the socially
 ineffective when implemented by the government in its                              determined nature of scaling educational programs.




 This paper is a product of the Education Global Practice. It is part of a larger effort by the World Bank to provide open access
 to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers
 are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at cavitabile@worldbank.org.




         The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
         issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
         names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
         of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
         its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


                                                       Produced by the Research Support Team
          Enhancing Human Capital in Children:
                               A Case Study on Scaling∗

            Francesco Agostinelli†               Ciro Avitabile‡            Matteo Bobba§




       Originally published in the Policy Research Working Paper Series on May 2021. This version is
       updated on August 2022.
       To obtain the originally published version, please email prwp@worldbank.org.



        Keywords: Children’s Skills; Parental Investment and Community Engagement; Sci-
        ence of Scaling.
        JEL codes: C90, C93, D02, I3, J1.




   ∗
      We are grateful to the Consejo Nacional de Fomento Educativo (CONAFE) for the generous collabora-
tion throughout this project, Alonso Sanchez for his initial input into the project, and Miguel Angel Monroy
for excellent research assistance. We thank the anonymous referees, as well as Jere Behrman, Horacio Lar-
reguy, John List, Giuseppe Sorrenti, and Stephane Straub for helpful comments and discussions. Avitabile
acknowledges ﬁnancial support for data collection from the Strategic Impact Evaluation Fund (SIEF) of the
World Bank and the Consejo Nacional de Evaluaci´   on de la Politica de Desarollo Social (CONEVAL). Bobba
acknowledges ﬁnancial support from the AFD, the H2020-MSCA-RISE project GEMCLIME-2020 GA No
681228, and the ANR under grant ANR-17-EURE-0010 (Investissements d’Avenir program). This study is
registered in the AEA RCT Registry and the unique identifying number is AEARCTR-0001645.
    †
      Economics Department, School of Arts and Sciences, University of Pennsylvania (US). E-mail:
fagostin@sas.upenn.edu
    ‡
      World Bank (US). E-mail: cavitabile@worldbank.org
    §
      Toulouse School of Economics, University of Toulouse Capitole (France).
E-mail: matteo.bobba@tse-fr.eu
1     Introduction

One of the main challenges in using scientiﬁc insights to inform policy decisions comes
from the fact that small diﬀerences in the implementation of any given intervention can
translate into substantial diﬀerences in outcomes. Even when programs display large and
signiﬁcant eﬀect sizes in randomized control trials, their success in diﬀerent situations is far
from guaranteed (List, 2022).

This paper contributes to the recent debate about the challenges to scale-up interventions
aimed at enhancing human capital in children. In particular, we provide a case study of a
mentoring program implemented at scale in Chiapas, the poorest state in Mexico. The pro-
gram assigns recent university graduates to remote and disadvantaged communities. Among
other things, mentors encourage parental involvement in children’s education through home
visits. We evaluate the relative eﬀectiveness of two program modalities that diﬀer both in
terms of content and the intensity of the training provided to the frontline mentors. The
Original modality features a training module focused on curricular knowledge and peda-
gogical notions, which was initially implemented by the government. The Plus modality
embeds a signiﬁcant change in the training module, which was designed and tailored by
our research team to guarantee its operational continuity in the event of a national rollout
of the program. The new training protocol includes periodic peer-to-peer meetings during
which mentors share their experiences regarding the home visits and their interactions with
families.

The evidence on the relative eﬀectiveness of the two program modalities is based on two inde-
pendent ﬁeld experiments. The ﬁrst experiment was directly carried out by the government
during the ongoing national rollout of the Original modality of the mentoring intervention.
Assignment to the program was randomized across 80 program-eligible primary schools, with
40 getting access to mentors. The results show that the program had no discernible eﬀect
on children’s achievement outcomes, as measured by standardized test scores. In the second
experiment we randomly assigned both the Original and the Plus modality as well as a
control group with no mentoring program across 230 primary schools. After two years of
exposure to the mentoring program, the Original program modality displays relatively small
and noisy eﬀects on cognitive and socio-emotional scores, as well as on educational achieve-
ments when compared to the control group with no mentors. The Plus modality delivers
sizable and signiﬁcant gains in children’s reading scores (+0.32 standard deviations), math
scores (+0.24 standard deviations), and socio-emotional scores (+0.20 standard deviations)

                                               1
as well as a large, albeit marginally signiﬁcant, eﬀect on the probability of enrolling in sev-
enth grade (+12.7 percentage points, out of a basis of 62 percent enrollment in the control
group).

The large diﬀerence in eﬀect sizes between the two training modalities is corroborated by
direct evidence on parental behavior. While both experiments unequivocally display in-
conclusive evidence on parents’ investment under the Original modality, the Plus modal-
ity signiﬁcantly increases parental engagement both toward the child’s education activities
and toward the school community—including volunteering activities, as well as in-kind do-
nations. We further show some evidence that mentors with enhanced training engage in
higher-quality interactions with parents during the periodic encounters. In particular, men-
tors with enhanced training are more likely to inform parents about their children’s learning
diﬃculties, to provide concrete advice to parents on how to tackle these diﬃculties, as well
as to promote parenting styles that are centered around communicating with the child and
learning activities. We complement these empirical patterns with qualitative evidence that
conﬁrms the role of the peer-to-peer sessions as the driving factor behind both the enhanced
parent/mentor interactions and the increased parental engagement.

After the release of this evidence, the government autonomously decided to replace the Orig-
inal modality of the program with the Plus modality for all its primary schools throughout
the country, including those that were part of the experimental evaluation. This reform
provides us with a unique opportunity to study the determinants and mechanisms of scal-
ing. One of the key situational diﬀerences between the experimental setting and the policy
implementation comes from the fact that several schools in this context are at risk of closure,
an event that has disruptive consequences for children’s learning and their educational tra-
jectories.1 While the intense monitoring during the experimental evaluation has minimized
the extent of school closures, this high-stakes implementation feature may compromise the
success of the program under the business-as-usual conditions. However, two years after the
rollout, none of the schools that received a mentor during the government implementation
closed, while approximately 10 percent of the other schools did.

We next zoom into the relationship between exposure to the mentors and school closures
in order to study the sources of scalability of the program. Parents play an important role
in the community-based schooling system under study (Gertler et al., 2012). We sketch a

1
  The importance of keeping schools open for the development of children has recently gained momentum in
educational studies on the impact of the COVID-19 lockdowns on schooling outcomes (see, e.g., Agostinelli
et al., 2022; Engzell et al., 2020; Maldonado and De Witte, 2020).


                                                   2
simple model of parental investment with local spillovers to formalize the idea that parents
have an active role in promoting educational opportunities, and that educational investment
at the community level are a socially determined outcome (List et al., 2020). We show
that the extent to which an educational program preserves or loses impact at scale depends
upon its ability to promote parental coordination and engagement in the local community.
We empirically corroborate these predictions by leveraging the changes in community-level
parental engagement induced by the experiment. Using this variation, we document that
parents prevent schools from closing, and as a consequence promote the eﬀectiveness of the
mentoring intervention during the government implementation. We ﬁnd that an increase of
half standard deviations in parental engagement decreases the probability of school closures
by 11 percentage points over the subsequent two years. Our qualitative data from in-depth
surveys of mentors and local instructors further corroborate the role of parents in guaran-
teeing the continuation of educational activities in the communities, in a context with poor
school infrastructure and where schooling activities are often disrupted.

Finally, we study the educational impacts of the policy reform across the overall population
of schools in the state of Chiapas. The assignment of the mentoring program under the Plus
modality at scale was done through a rotating scheme with a priority-based mechanism. We
exploit the quasi-experimental variation in the program rollout once we condition on the set
of eligibility criteria oﬃcially used by the government. After providing evidence that this
variation appears conditionally “as-good-as random” via various placebo tests, we show that
the program was successful in the schools that were previously part of the evaluation sample
as well as in the rest of the schools in Chiapas. Within the evaluation schools, the marginal
eﬀect of the Plus modality after one year of government implementation on the probability
of enrolling in seventh grade is +5.4 percentage points. The cumulative eﬀect of continuous
exposure to the program for three years (two years under the experiment and one year under
the government) implies that the enrollment rates in these disadvantaged and rural areas
achieve the secondary school enrollment rates in urban Mexico (95 percent). For the much
larger sample of schools that did not participate in the experimental evaluation, the results
show a positive eﬀect on secondary school enrollment, with an average program impact of
4.5 percentage points under the Plus modality at scale. We further document positive eﬀects
of the program on child literacy, which imply a reduction of illiteracy rates by 20 percent
with respect to the sample mean, as well as a decrease in school closures that is remarkably
similar to the corresponding impact of the program under the experimental assignment.
Taken together, these ﬁndings corroborate the eﬀectiveness of the intervention in increasing


                                             3
schooling opportunities under the new situation created by the policy implementation.

Relationship to Literature. There is a consensus in the literature that gaps in family
investment and parent/child interactions are behind the gaps in children’s achievements
among diﬀerent socio-economic groups (Cunha et al., 2010; Fryer et al., 2015; Agostinelli
and Wiswall, 2016). Moreover, the literature provides ample evidence that successful home
visit and mentoring programs, although implemented in very diﬀerent contexts, share the
common outcome of stimulating parental investment and parent/child interactions. Several
studies in developing countries document increased cognitive and socio-emotional outcomes
for children in the ﬁrst years of life (see Attanasio et al., 2022b, for a complete review).
In the United States, successful interventions like the Perry Preschool project, and the
Carolina Abecedarian project show improvements in the home environment (see Heckman
and Mosso, 2014, for a complete review). The quality of child/home-visitor interactions and
parent/home-visitor interactions are found to be key ingredients for boosting the impact of
home visiting programs (Carneiro et al., 2019; Heckman and Zhou, 2021; Zhou et al., 2021).
However, little is known on the role of family and community interactions in sustaining
program impacts at scale.2 Our study ﬁlls this gap by providing direct evidence on how
enhanced training for home visitors promotes higher parental engagement and interactions
at the community level, which in turn promotes the success of the program at scale. To
the best of our knowledge, we are the ﬁrst to highlight that parents can act as means of
scalability, which has implications for the design and the evaluation of scalable educational
interventions that actively include parents in the learning process.

In recent years, scholars and policy makers alike have been increasingly concerned about the
ability of ﬁeld experiments to inform policy decisions, given that experimental interventions
that have been found eﬀective often fail to live up to their promises when implemented at
scale by governments or ﬁrms (Bold et al., 2018; Cameron et al., 2019; Muralidharan and
Singh, 2020). Based on the insights of recent work (Banerjee et al., 2017; Muralidharan and
Niehaus, 2017; Davis et al., 2017; Mobarak and Davis, 2021), our educational intervention is
well-equipped to overcome some of the major threats to scalability. We highlight the infor-
2
  For example, Zhou et al. (2021, p. 90) state: “The body of research discussed above clearly identiﬁes the
key mechanism by which home visiting programs positively impact short-term and long-term outcomes for
children: fostering engagement between caregiver and home visitor to improve the caregiver’s quality and
frequency of caregiver–child interaction, thereby fostering child development. This volume, including this
chapter, seeks to move the ﬁeld toward understanding how to eﬀectively scale up promising interventions and
inspire more research on the subject.” In their recent review, Attanasio et al. (2022b, p. 886) raise another
important issue: “[S]calability does not only refer to the ﬁnancial cost of running these interventions but
also to the ownership and acceptability of the intervention by the community that is targeted. How should
interventions be designed and delivered to take account of this important distinction?”


                                                     4
mative features for scalability of our experimental design following the key points analyzed
in Al-Ubaydli et al. (2020). First, we harness the value of replication by drawing joint infer-
ence from two independently run ﬁeld experiments on diﬀerent and representative samples
of schools that share one of the two program modalities. Second, the ﬁeld experiments were
run while the program was already at scale and in close collaboration with the government
agency that was later in charge of the program’s rollout and policy implementation. Third,
the government agency and the research team designed the Plus modality together, bearing
in mind the ﬁnancial and human resources constraints of the context under study. Finally,
the relatively large units of randomization (schools/communities) take into account possi-
ble local spillover eﬀects that often arise in the context of interventions evaluated at scale
(Miguel and Kremer, 2004; Bobba and Gignoux, 2019; List et al., 2020).



2    Context and Experimental Design

The Consejo Nacional de Fomento Educativo (CONAFE) is a semi-autonomous government
agency responsible for providing schooling services in highly marginalized communities of
Mexico with a population below 2,500 inhabitants. In those communities, CONAFE oﬀers
all education services from pre-school until the end of lower secondary school (hereafter, we
refer to the population of CONAFE schools as schools). In 2013, these schools accounted for
10 percent of the roughly 99,000 primary schools and 7 percent of the 38,000 lower secondary
schools across the 31 Mexican states. About 20 percent of the schools are located in Chiapas,
the Mexican state with the highest incidence of poverty in the country (CONEVAL, 2018).

Primary schools typically have a single multi-grade classroom with 10–15 students. Instruc-
tors are generally community residents between 15 and 29 years old. Only 2.6 percent report
having a college degree, while 19 percent report having only completed lower secondary ed-
ucation. Instructors are supposed to receive between ﬁve and seven weeks of training, but
more than half report four weeks of training or less. They receive a stipend of MXN $1,427
per month (US $95 in 2015). After one year of service in the community school, instructors
become eligible to receive a scholarship of MXN $982 per month for up to 30 months, which
is conditional on enrolling in a higher education institution. As a result of the very low
compensation and extremely challenging conditions, about one quarter of the instructors
drop out before completing the ﬁrst school year (Bando and Uribe, 2016).




                                              5
2.1    The Mentoring Program

In 2009, the government launched the “Mobile Mentors” (Asesores Pedagogicos Itinerantes,
API henceforth) program as an attempt to improve the quality of education provision in
primary schools located in the most-deprived communities. Initially, the program was im-
plemented in 11 states, but starting in 2012, it was extended to all 31 states in Mexico.
The mentors are selected from recent university graduates (the program was advertised both
during on-campus visits and announcements through the media). Preference is given to
applicants with degrees in pedagogy, psychology, sociology, and social services who have
previous experience as community instructors and who speak an indigenous language. They
are usually hired for a two–year period and receive a monthly salary of MXN $5,000 (US
$332 in 2015).

After a week-long training session focused on curricular knowledge and basic notions of
pedagogy, the mentors are assigned to schools on a rotating-based algorithm, which gives
diﬀerential priority across the school communities according to four criteria: (i) at least 30
percent of the students are classiﬁed as “insuﬃcient” in the National Standardized test; (ii)
at least six students are enrolled, (iii) there are high levels of poverty and marginalization
in the respective municipalities; and (iv) the school has not received a mentor in previous
academic cycles. Mentors meet with their supervisors every two months in two-day sessions
throughout the school year. In December 2018, there were 1928 mentors deployed throughout
Mexico, and the largest share is in Chiapas (20 percent).

The mentors periodically conduct home visits to provide parents with information on their
children’s progress in school and promote their participation in school activities. In addition
to working on behavioral issues directly with the children, the mentors are supposed to
address them with parents as part of the home visits. Each mentor is assigned to a maximum
of six students for individual (one-on-one) remedial education sessions that take place after
the regular instructional time. Student eligibility for the remedial sessions is determined by
a diagnostic evaluation at the beginning of the school year and an additional exam to assess
the grade to which the student’s knowledge corresponds. During regular school hours, the
mentor is supposed to observe and take notes on the teaching practices of the community
instructor, help her with the students who have learning diﬃculties, and work outside the
classroom with students who are unable to attend the remedial sessions in the afternoon.




                                              6
2.2    Two Field Experiments

In this section we describe the ﬁeld experiments that we use to evaluate the impact of two
diﬀerent modalities of the API program. A full description of the diﬀerent data sources used
throughout the empirical analysis is provided in Appendix A. Because both experiments
were run at scale within the infrastructure of the existing program, the recruitment process
and the assignment mechanism of the mentors are the same between experimental and non-
experimental schools.

First Experiment. As part of an eﬀort to evaluate a broader set of interventions targeted to
                                                       ınez, 2012), in 2010, the government
families and schools in disadvantaged communities (Mart´
undertook the ﬁrst impact evaluation of the API program. Eighty primary schools are
selected among those that met the eligibility criteria for the program across four Mexican
states (72 in Chiapas). Assignment to the API program was randomized at the school level
using a block design, with the strata represented by the Mexican states where schools are
located. Forty schools were assigned to receive the API program starting from the 2011–
2012 school year while the remaining half of the schools were assigned to the control group
without mentors.

A mid-line survey collected after one year of the API assignment recorded parental behaviors
and investments for 208 parents in 73 schools (the enumerators were not able to reach
the parents in seven schools). Student outcomes were measured two years after treatment
assignment through the results in the national standardized test for students in grades three
through six. Due to the incomplete take-up of the test—mainly due to the opposition from
the teachers’ unions in some states—we were able to match 70 schools with 599 test score
records out of the subsample of 73 schools with parental outcomes. Both sources of sample
attrition are orthogonal to treatment assignment. Five of the unmatched schools were in the
treatment group and ﬁve were in the control group. Table B-1 shows the balance for the
original sample as well as the nested samples with parental outcomes and student outcomes
between the treatment and the control groups across mean community-level and school-level
characteristics measured in the year before the start of the ﬁrst experiment.

Second Experiment. In 2014, as part of a World Bank project, we designed and evaluated




                                             7
an alternative training modality aimed at strengthening the original mentoring program.3
The API Plus modality embeds all the features of the Original modality, with two signif-
icant changes in the training module. First, it entails two weeks rather than one week of
initial training. The extra week is focused on hands-on strategies to improve students’ read-
ing and math competencies. Second, mentors attend an additional day during each one of
the bimonthly meetings throughout the school year. The schedule of the extra day is orga-
nized around peer-to-peer sessions in which mentors share experiences and design common
strategies to better address the most pressing issues. Among the large array of possible
improvements in the design of the program, enhanced training was ex-ante scalable and yet
a promising alternative from the perspective of the government. The Plus modality entails
a cost per child of US $332 as opposed to US $285 per child for the Original modality.
These cost ﬁgures are very much in line with another recent government-run intervention
that targeted both children and parents through home visits in Colombia (Attanasio et al.,
2022a).

We randomly selected 230 schools in rural Chiapas from a set of schools that were not
previously part of the API program. Assignment of the program was carried out using a
randomized block design at the school level, with the strata represented by the deciles of
the 2012 school-average in a national standardized achievement score in the Spanish test
(see Appendix A.1). As a result, 60 schools were assigned to the API Plus, 70 schools were
assigned to the API Original modality, and the remaining 100 schools were in the control
group with no API intervention.

We rely on both administrative and survey data sources as well as qualitative interviews for
the second ﬁeld experiment. Most of these variables are shown in Table B-2, and they are
balanced with respect to treatment assignment. The data collection took place by the end of
the second school year after the inception of the API program in the evaluation sample. By
that time, two schools out of the original 230 schools in the evaluation sample had closed,
while the program could not be put in place in another four schools due to high political
instability. Within the remaining 224 schools, one quarter of the community instructors
reported eight or fewer months of tenure in the school, and only 56 out of the original 126
mentors were working in the same schools to which they had been originally assigned. All

3
  The Original modality in the second experiment is meant to track the benchmark intervention with two
minor diﬀerences. First, the ability to speak the main indigenous language in the community would become
the most important criterion for the assignment of the mentors across program-eligible communities. Second,
the supervisors of the mentors would receive a salary increase in exchange for a mandatory increase in the
frequency of their visits to the targeted communities.


                                                    8
these outcomes are well balanced across treatment arms. Out of the six schools that dropped
out of the sample, two schools were in the control group, two were in the Original group, and
two in the Plus group. The p -values of the Komolgorov-Smirnoﬀ statistic for the equality
of the distributions of work experience in the school of the community instructors in each
treatment arm and the control group are 0.773 and 0.892, respectively. The p -value of the
Plus -Original diﬀerence in the share of mentors who drop out from the program during the
experiment is 0.957. There is no evidence of composition changes between the Original and
Plus groups induced by mentor turnover (see Table B-3).

The eﬀective sample size of the second experiment is 1,045 children/parents and 224 schools
(see Appendix A.2 for further details on the sampling design). We use the Early Grade
Reading Assessment (reading score) and the Early Grade Math Assessment (math score)
as our main measures of children’s cognitive achievement. Those are individually admin-
istered student assessments that have been conducted in more than 40 countries and in
a variety of languages (Dubeck and Gove, 2015; Platas et al., 2016). While these instru-
ments are typically applied to students in ﬁrst, second, or third grade, we administer them
to third through six grade students to account for the large learning gaps of the children
in our sample. The school-average standardized scores in math and Spanish as measured
in the school year prior to the introduction of the second experiment are, respectively, 0.5
and 0.7 standard deviations below the national averages.4 To measure the impact of the
intervention on socio-emotional skills, we consider a collection of thirty-two behavioral is-
sues as reported by a caregiver, which resembles the questionnaire in the Children section
of the National Longitudinal Study of Youth (CNLSY-79), such as antisocial behavior, anx-
iety/depression, headstrongness, hyperactivity and peer conﬂicts (for details, see Appendix
A.2). The resulting behavioral problem index is re-scaled in such a way that higher values
are associated with fewer behavioral issues (socio-emotional score). The survey also contains
a module on instructors’ characteristics as well as pedagogical practices collected through an
adapted version of the Stallings Classroom Snapshot (Bruns and Luque, 2015), a module on
parental attitudes and investment toward children’s education, as well as information about
the mentors’ activities in the communities, among others. To better interpret our results, we
standardize most of the survey-based outcome variables using the mean and the standard

4
  Only 5 percent of the children in our sample score at the maximum of the scale in two or more subdomains
of the reading score (out of eight subdomains) and in three or more subdomains of the math score (out of a
total of seven subdomains). Unlike the ﬁrst experiment, we cannot leverage the national standardized test
scores for the second experiment since the test ceased to be universal during the period of interest (after
2014).


                                                    9
deviation observed in the control group.

In addition, we access separate administrative data on students’ records that we use to
construct an indicator for enrollment in seventh grade, which is the ﬁrst grade in lower
secondary school. The sample reduces to 468 sixth graders in 182 schools, who are deciding
whether to transit to secondary school. This sample reduction is due to the multi-grade
aspect of the schooling system, where student composition among grades in each school
is not homogeneous in size. Missing schools in this analysis are balanced among treatment
arms. The choice of this cohort of students is meant to maintain the same length of exposure
to the API program of the main sample of the analysis.5

Finally, we conducted a series of in-depth interviews in the spring of 2022 for a small and
representative subsample of 16 mentors and 12 community instructors who were part of our
study.6 This qualitative evidence proves useful to complement the quantitative analysis and
to shed further light on the mechanisms through which the mentoring intervention aﬀects
students and parents as discussed in Section 3.3, as well as on the role of parents as means
of scalability as discussed in Section 5.2.



3     Experimental Evidence

In this section, we report OLS estimates from separate regression models for each experiment
on the treatment assignment indicators for the Original and Plus modality after two years of
exposure to the mentoring program. All models include the strata indicators that account for
the block randomization design (see Section 2.2) as well as few individual characteristics such
as students’ age and ethnicity. We control for interview week ﬁxed eﬀects, which account
for changes in weather and political conditions, as well as indicators for the diﬀerent teams
of enumerators who administered the survey across the communities in our sample. The
error terms are clustered at the school level, which represents the unit of randomization
of the treatments. We complement the usual asymptotic inference with two alternative
procedures. First, we display p-values based on randomization inference, which are accurate
5
   The distribution of missing schools in the analysis of transition to secondary school is 18 schools in the
control group, 14 in the Original group and 16 in the Plus group. Due to the diﬀerent individual identiﬁers,
we are not able to match this dataset to the survey data. The estimates reported in Table B-6 document
no program eﬀects on grade repetition and attrition, which suggest that conditioning on grade attainment
is not problematic in our context.
 6
   Appendix A.3 reports more details about these interviews. Tables B-4 and B-5 show that the characteristics
of these survey respondents are broadly comparable to those of the mentors and the local instructors in our
main sample.


                                                     10
even with a small number of clusters. This may be especially relevant in the context of the
ﬁrst experiment, which had fewer schools per treatment arm than the second experiment.
Second, given the large array of hypotheses considered throughout the analysis, we also
provide p-values that are adjusted for multiple hypothesis testing across diﬀerent families of
outcomes (List et al., 2019).


3.1     Children’s Outcomes

Table 1 and the ﬁrst row of Table 2 display the impacts of the Original modality on children’s
outcomes, as measured by individual test scores collected two years after the introduction
of the mentoring program in each experiment, respectively. For the ﬁrst experiment, the
outcome variables shown in Table 1 are based on administrative records of sixth graders
in a national standardized test. For the second experiment, we collect our own measures
of cognitive and socio-emotional skills (ﬁrst to fourth columns of Table 2), as the national
standardized test was terminated in 2014.7

In spite of the diﬀerences in measurement of the outcome variable, the separate analyses of
the two experiments show consistently inconclusive evidence regarding the eﬀectiveness of
the Original modality of the mentoring intervention. Depending on the outcome, the eﬀect
of the program in the ﬁrst experiment ranges from positive to negative and is not statistically
diﬀerent from zero. The eﬀect size of the estimated treatment eﬀect on the overall index for
student achievement (column 4 of Table 1)—a Generalized Least Squares (GLS)-weighted
average across the three subject tests that increases the power of the analysis (O’Brien,
1984)—is negative, small and imprecise.8 Eﬀect sizes are consistently positive and slightly
more precise in the second experiment, although none of the estimated coeﬃcients gets close
to the conventional signiﬁcance levels. The impact on the GLS-weighted overall index for
student achievement across the two cognitive measures and the socio-emotional score is 0.12

7
   Another national standardized test was administered by the National Institute for the Evaluation of Ed-
ucation (INEE) starting in 2015, the PLANEA National Plan for Learning Evaluation. While the national
test that we employ in the ﬁrst experiment (ENLACE) was administered to all Mexican students in grades
three through six through the year 2013 (see Appendix A.1), PLANEA scores are collected only on sixth
graders in a random sample of students within schools.
 8
   The GLS weighting procedure increases eﬃciency when compared to other summary indices by ensuring
that outcomes that are highly correlated with each other receive less weight, while outcomes that are un-
correlated and thus represent new information receive more weight. This procedure is more powerful than
other popular tests in the repeated-measures setting. Also, missing outcomes are ignored when creating the
GLS-weighted score. Thus this procedure uses all the available data, but it weights outcomes with fewer
missing values more heavily.


                                                   11
                      Table 1: Children’s Achievement—First Experiment
                            Reading Score         Math Score         Science Score        Overall Index
API Original                    -0.053              0.083                -0.082              -0.022
                               [0.737]             [0.655]              [0.585]              [0.902]
                               {0.750}             {0.669}              {0.591}             {0.910}
                               (0.779)             (0.739)              (0.717)             (0.878)

Number of clusters                70                      70                70                  70
Observations                      599                    599               599                  599
 Notes : This table shows OLS estimates and the associated p-values on student outcomes measured after
two years of exposure to the mentoring program under the ﬁrst experiment run by the government. For
detailed descriptions of the test scores used in this table, see Appendix A.1. The dependent variables
are standardized with respect to their means and the standard deviations in the control group. p-
values reported in brackets refer to the conventional asymptotic inference. p-values reported in braces are
computed using randomization inference (randomization-t). p-values reported in parentheses are adjusted
for testing the null impact of API Original across the ﬁve outcomes shown in the tyable through the step-
wise procedure described in Romano and Wolf (2005a,b, 2016). All p-values account for clustering at the
school level.


standard deviations—a non-negligible eﬀect size that is nonetheless not statistically diﬀerent
from zero. The eﬀect of the Original modality of the mentoring program on the transition
rates to lower secondary school are shown in the last two columns of Table 2. Estimated
eﬀect sizes are noisy and relatively small in magnitudes, ranging between an increase of seven
and eight percentage points out of a basis of 62 percent enrollment rate in seventh grade in
the control group.

The second row of Table 2 displays the estimated coeﬃcients for the average impact of the
Plus modality of the API program when compared to the control group. Children who are
enrolled in a school that received the Plus modality increased their reading scores by 0.32
standard deviations. We can reject the hypothesis of a null eﬀect of API Plus at the 99
percent conﬁdence level across all three diﬀerent inference procedures. Quantitatively, the
API Plus eﬀect is approximately 2.5 times higher than the eﬀect of the API Original. The
diﬀerence between the two modalities is statistically diﬀerent from zero at the conventional
95 percent level in two out of three inference procedures.

We ﬁnd similar patterns when we look at math scores, which show a sizable eﬀect of the
Plus modality with an estimated treatment eﬀect of 0.24 standard deviations. This eﬀect is
precisely estimated, and we can reject the hypothesis that the two treatment arms have the
same eﬀect at the 95 percent conﬁdence level in two out of three cases. The API Plus program
also generates a sizable improvement in the socio-emotional score of 0.2 standard deviations.


                                                    12
              Table 2: Children’s Achievement and Attainment—Second Experiment
                                                     Survey-Based Test Scores                              Enroll 7th Grade (1 = yes)
                                  Reading         Math      Socio-emotional           Overall Index                       age ≥13
API Original                        0.126         0.056          0.071                    0.124             0.073           0.081
                                   [0.104]       [0.455]        [0.418]                  [0.187]           [0.255]         [0.519]
                                  {0.138}        {0.483}        {0.440}                 {0.218}            {0.283}        {0.573}
                                  (0.147)        (0.554)        (0.554)                 (0.234)            (0.312)         (0.469)

API Plus                            0.315         0.237             0.199                 0.366             0.124              0.298
                                   [0.001]       [0.008]           [0.022]               [0.001]           [0.074]            [0.030]
                                   {0.001}       {0.012}           {0.030}               {0.001}           {0.084}            {0.053}
                                   (0.001)       (0.005)           (0.011)               (0.001)           (0.032)            (0.032)

API Original = API Plus            [0.043]       [0.043]           [0.178]               [0.020]           [0.469]            [0.134]
                                   {0.086}       {0.115}           {0.225}               {0.024}           {0.570}            {0.229}
                                   (0.045)       (0.045)           (0.098)               (0.023)           (0.376)            (0.157)

Number of clusters                  224           224                 224                  224               182                76
Observations                        1044          1044               1045                  1045              468                106
 Notes : This table shows OLS estimates and the associated p-values on student outcomes measured after two academic years of exposure to
the API program under the second experiment designed and implemented by the authors in collaboration with the government. For detailed
descriptions of the test scores used in this table, see Appendix A.2. The dependent variables in the ﬁrst four columns are standardized with
respect to their means and the standard deviations in the control group. The dependent variable in the last two columns is computed from
administrative school records (see Appendix A.1). p-values reported in brackets refer to the conventional asymptotic inference. p-values
reported in braces are computed using randomization inference (randomization-t). p-values reported in parentheses are adjusted for testing
each null hypothesis (null impact of API Original, API Plus, and the comparison) for the two diﬀerent families of outcomes (survey-based and
administrative data) through the stepwise procedure described in Romano and Wolf (2005a,b, 2016). All p-values account for clustering at
the school level.



While the diﬀerence with respect to the Original modality is not statistically signiﬁcant, the
larger eﬀect of the Plus modality is consistent with qualitative evidence documenting that
mentors with enhanced training shared more eﬀective strategies to best deal with children’s
emotions during the bimonthly peer-to-peer sessions (see Appendix A.3). The eﬀect size of
the Plus modality on the GLS-weighted index of achievement is very large, 0.37 standard
deviations—precisely estimated, and statistically diﬀerent at the 95 percent level from the
eﬀect of the Original modality.9

The last two columns in Table 2 report the estimated eﬀects on the average transition rate
to secondary school. Less than two-thirds of the sixth graders in the control group enroll in
seventh grade, while the corresponding national average is 95 percent. The Plus modality

9
  In Table B-7 we report the results by sub-domains of the reading scores (panel A), math scores (panel B).
While the estimates are erratic and not statistically signiﬁcant for the Original modality, the Plus modality
is shown to increase students’ proﬁciency in reading across various domains (familiar-word reading, reading
comprehension, and dictation). There are no improvements in sound-related questions (initial sound and
initial name), which is probably due to the fact that children whose ﬁrst mother tongue is an indigenous
language might struggle to capture Spanish alphabet pronunciation. For math scores, the Plus modality
seems particularly eﬀective on numbers’ identiﬁcation and discrimination as well as additions. There are no
improvements in more involved tasks such as problem solving and shape recognition. Similarly, in Table B-8
we report the eﬀects of the two program modalities for each individual component of the socio-emotional
score.


                                                                    13
increases the probability of a child’s enrolling in seventh grade by 13 percentage points.
Although marginally signiﬁcant, the eﬀect is quantitatively sizable, as it represents a 20
percent increase in the share of students who transit to secondary school relative to the mean
in the control group. This eﬀect more than doubles in size when we focus on the subsample
of over-aged sixth graders (13 years old or more, sixth column), and it persists one year after
the second experiment (see Figure B-1). Given the prevalence of child labor in Chiapas, this
result for older children is particularly important in terms of life-cycle opportunities. While
the estimated coeﬃcient of the Plus modality in the last column of Table 2 is signiﬁcant at
the 95 percent level for two out of three cases, the p-values reported in the third row show
that we cannot reject that it is diﬀerent from the corresponding estimate of the Original
modality.


3.2    Parental Investment and Behavior

Home visits are a key component of the mentoring intervention under study. The goal of these
home visits and repeated family/home-visitor interactions is to increase parental awareness
about their children’s educational trajectories. Table 3 presents the average impact of the
program on GLS-weighted indices of parental behavior and investment in their children’s
education (see Appendix A.2). Panel A displays the estimates of the Original modality in
the ﬁrst experiment, while Panel B shows the corresponding ﬁgures for both the Original
and Plus modality in the second experiment. Under the Original program, consistently
across experiments, the estimates are not statistically diﬀerent from zero, with signs of the
coeﬃcients that range from positive to negative and eﬀect sizes on the overall index of -0.03
and 0.1 standard deviations. Instead, parents appear to be systematically more invested in
their children’s education activities under the Plus modality of the program. The estimates
reported in the second row of Panel B document that mentors with enhanced training are
more eﬀective in boosting parental engagement, both toward the school and directly with
the child. The point estimates are positive throughout; three out of four coeﬃcients are
statistically signiﬁcant at the 95 percent level with a very large eﬀect size for the overall index
of parenting practices of 0.36 standard deviations. While we can reject the null hypothesis
of equal treatment eﬀects on most parental outcomes shown in Table 3, we refer the reader
to Section 4.1 for a more thorough discussion on hypothesis testing under both experiments




                                                14
                                  Table 3: Parental Investment and Behavior
                                 Engage at School       Manage School Resources Engage With Child Overall Index
                                                               Panel A: First Experiment
     API Original                       0.198                    -0.135                 0.149         0.101
                                       [0.259]                  [0.415]                [0.399]       [0.580]
                                       {0.261}                 {0.422}                 {0.399}      {0.578}
                                       (0.338)                  (0.511)                (0.511)      (0.511)

     Number of clusters                   73                        73                     73                          73
     Observations                        208                        208                   208                          208
                                                                Panel B: Second Experiment
     API Original                       -0.188                    -0.124                 0.167                        -0.034
                                       [0.049]                    [0.176]               [0.015]                      [0.684]
                                       {0.058}                   {0.197}                {0.015}                      {0.630}
                                       (0.067)                   (0.205)                (0.021)                      (0.704)

     API Plus                           0.217                      0.087                        0.353                 0.359
                                       [0.034]                    [0.344]                      [0.001]               [0.001]
                                       {0.037}                    {0.247}                      {0.001}               {0.001}
                                       (0.055)                    (0.388)                      (0.001)               (0.002)

     API Original = API Plus           [0.001]                    [0.056]                      [0.029]               [0.001]
                                       {0.001}                    {0.056}                      {0.158}               {0.001}
                                       (0.002)                    (0.036)                      (0.036)               (0.001)

     Number of clusters                 224                         224                           224                  224
     Observations                       1045                        1045                         1045                  1045
 Notes : This table shows OLS estimates and the associated p-values on survey-based measures of parental behavior measured after
two years of exposure to the API program. Panel A refers to the ﬁrst experiment run by the government. Panel B refers to the
second experiment designed and implemented by the authors in collaboration with the government. For detailed descriptions of the
individual components of the summary measures of parental engagement used in this table, see Appendix A.2. p-values reported
in brackets refer to the conventional asymptotic inference. p-values reported in braces are computed using randomization inference
(randomization-t). p-values reported in parentheses are adjusted for testing each null hypothesis (null impact of API Original, API
Plus, and the comparison) for the two diﬀerent families of outcomes through the stepwise procedure described in Romano and Wolf
(2005a,b, 2016). All p-values account for clustering at the school level.



(see also Table 5).10

Overall, the results presented in these two sections show that the API intervention had
diﬀerential impacts according to the training received by the mentors. While the Original
modality does not signiﬁcantly boost any of the outcomes of interest across two indepen-
dently run ﬁeld experiments, the Plus modality is shown to generate sizable average eﬀects
on children’s cognitive and socio-emotional scores, on schooling attainment, as well as on
parental engagement toward their children’s education. In the next section, we leverage de-
10
  We also estimate the impacts of both the Original and Plus modalities for each of the individual measures
of the parental behavior collected in the survey that have been aggregated in the summary measures displayed
in Table 3. Table B-9 reports he results, which are broadly comparable to the estimates discussed in the
text. They show large and signiﬁcant eﬀects for the Plus modality on food donations to the instructors,
the management of the school resources, help with homework, enrolling their children in extra-curricular
activities, expecting their children to complete secondary education or more, and meet periodically with the
instructor.


                                                               15
tailed survey information to provide direct evidence on the possible channels through which
the API Plus program enhance children’s outcomes.


3.3     Plus vs. Original : Channels

We start by evaluating the role of the remedial education sessions. The estimates displayed in
Table B-10 suggest that there is no diﬀerential eﬀect across the four children’s outcomes (p-
values = 0.766, 0.675, 0.639, and 0.937) in the relative impact of the two training modalities
between children who are more or less likely to be eligible for the remedial sessions (see also
Figure B-2). Although the design of the second experiment does not allow us to directly
isolate the direct eﬀect of the remedial education sessions within each API modality, this
evidence suggests that such mediating factor is unlikely to explain the diﬀerential impact
between the Plus and the Original documented in Table 2.11

We next consider the role of the pedagogical practices of the community instructors. Table
B-11 reports estimates of the eﬀect of the two API modalities using data at the instructor-
school level (the sample average number of instructors per school is 1.2 in the school year prior
to the start of the second experiment) on four summary measures of pedagogical practices
based on GLS-weighted indices across an array of instructor-student interactions (for details,
see Appendix A.2). The results show erratic patterns of positive and negative signs with
no statistically signiﬁcant eﬀects of either API modality. The overall index of pedagogical
practices reveals a non-negligible negative eﬀect of 0.18 standard deviations for the Plus
modality, indicating, if anything, a crowding-out eﬀect of the presence of the mentors on
instructors’ job eﬀort. The eﬀects are quantitatively and statistically similar between API
modalities across the diﬀerent pedagogical practices, as shown in the third row of Table B-11.

Finally, we study the role of the mentor/parent interactions during the home visits as a
potential mechanism behind the large and positive eﬀect of the Plus modality. Panel A in
Table 4 displays the estimated diﬀerences across the two API modalities on selected survey
variables when 591 parents were asked about the frequency and content of their interactions
with the mentors over a period of two months prior to the survey. The number of observations
varies across the columns in Panel A due to some of the parents not responding to the survey
questions. Missing values for each outcome are balanced with respect to the assignment of
11
  The correlation between the school-level rankings as implied by the average diagnostic test and the math
and reading scores is 0.51 and 0.52, respectively. In the absence of randomization across the diﬀerent
components of the intervention within each modality, the direct eﬀect of the remedial education sessions
cannot be separately identiﬁed from heterogeneous treatment eﬀects by academic achievement.


                                                   16
the Plus modality (p-values = 0.746, 0.183, 0.442, 0.517, 0.539, and 0.575). In spite of quite
noisy estimates due to the sample attrition and the reduced sample size—parents in the
control group cannot be part of this analysis by design—the evidence does show a systematic
pattern. Over a two-month period, mentors in the Plus modality met one time more with
parents at school and 0.7 times more at home compared to those in the Original modality
(sample means in the Original group are ﬁve and three, respectively). The GLS-weighted
index shown in the third column documents that the quantity of parent-mentor interactions
increased by 0.36 standard deviations under the Plus modality, which is signiﬁcant at the
10 percent level. Columns 4 and 5 of Panel A show marginally signiﬁcant estimates on
two measures of the quality of the interactions between parents and the mentors: (i) an
indicator variable for whether the mentors have informed parents about their children’s
learning diﬃculties, (ii) and whether the mentors provide concrete advice to the parent on
how to tackle these diﬃculties. The eﬀect sizes are large for both outcomes, implying a
14 percent increase in the probability of informing parents relative to the respective sample
means in the Original group (70 percent). The estimated coeﬃcient for the GLS-weighted
quality index is 0.25 standard deviations, which is signiﬁcant at the 90–95 percent level
depending on the inference procedure.

Panel B in Table 4 shows the eﬀect of the Plus modality on diﬀerent competencies, or
“parenting style,” that the mentors report to have promoted during their encounters with
parents. This information was collected during a follow-up interview at the end of the
ﬁeld experiment. Of a total of 126 mentors between the Original and Plus modalities,
enumerators were able to interview 107 of them. The attrition of survey participation of
mentors is unrelated to the treatment assignment (p-value = 0.514). For further details on
the survey of mentors, please refer to Appendix A.2. Mentors with enhanced training are
more inclined to foster attitudes that are centered on educative parenting styles, such as
communicating with the child (ﬁrst column), as well as learning activities (second column).
The overall educative style GLS-weighted index (third column) shows a sizable and signiﬁcant
eﬀect (across the three inference procedures) of the Plus modality, with an increase of 0.49
standard deviations in the promotion of educative parenting styles to parents during the
home visits. Other aspects of the parent-child relationship that are focused on emotional
practices do not seem to systematically vary across the two program modalities.

The evidence presented in this section points toward cross-modality variation in the quality of
both the parent/mentor interactions and parent/child interactions as a potential mechanism
behind the observed diﬀerence in children’s outcomes. While we cannot separately quantify

                                              17
     Table 4: The Role of Mentors in Fostering Parental Attitudes—Second Experiment
                       Panel A: Parents and Mentors Interactions (as reported by the parents)
                             Quantity (Last 60 Days)                               Quality
                        Meetings          Visits      Index        Inform            Advise                             Index
                                                               About Child        About Child
API Plus                  1.039           0.726       0.362         0.102             0.100                             0.251
                         [0.147]         [0.125]     [0.062]       [0.057]           [0.034]                           [0.040]
                        {0.194}          {0.171}     {0.094}      {0.097}           {0.056}                            {0.070}
                        (0.194)          (0.194)     (0.100)       (0.078)           (0.078)                           (0.078)

Observations               482                 491            504              354                   353                  357
Clusters                   123                 124            124              113                   112                  113

                         Panel B: Parenting Styles that Are Promoted by the Mentors (as reported by the mentors)
                               Educative Style                                    Emotional Style
                    Communication     Learning       Index        Share      Self-Knowledge        Manage         Index
                                                                Feelings                         Transitions
API Plus                0.178           0.168        0.494        0.049            0.030            0.142         0.194
                       [0.038]         [0.077]      [0.018]      [0.627]          [0.756]          [0.123]       [0.312]
                       {0.043}         {0.091}      {0.029}     {0.635}          {0.753}           {0.134}      {0.321}
                       (0.074)         (0.075)      (0.043)     (0.843)           (0.843)          (0.308)      (0.558)

Observations               107                 107            107              107                   107                  107             107
 Notes : This table shows OLS estimates and the associated p-values of the API Plus modality on survey-based measures of interactions between
parents and mentors (Panel A) and the diﬀerent parenting styles that are promoted by the mentors during their interactions with the parents.
For a detailed description of the outcome variables used in this table, see Appendix A.2. p-values reported in brackets refer to the conventional
asymptotic inference. p-values reported in braces are computed using randomization inference (randomization-t). All p-values account for
clustering at the school level. p-values reported in parentheses are adjusted for testing the eﬀect of API Plus for the diﬀerent families of
outcomes (quantity and quality of interactions, parenting styles) through the stepwise procedure described in Romano and Wolf (2005a,b,
2016).



the relative contribution of each additional training module, the increase in the quality of the
home visits is likely to originate from the mentors’ peer-to-peer sessions, which had the exact
role of helping the mentors with enhanced training to communicate more eﬀectively with
parents. As mentioned in Section 2.2, these workshops enable interactions and information
sharing among the participants, while the extra week of initial training is instead focused on
pedagogical practices targeted to children at school. Qualitative evidence seems indeed to
corroborate this hypothesis, as summarized by the following quotes from mentors who have
participated in the peer-to-peer meetings (see Appendix A.3 for more details):

            • “During the workshops I was told that I should be able to adapt to the context
                of the community and understand the local living arrangements in order to
                establish a dialog with the parents without modifying what they conceive as
                their environment.”

            • “It was recommended that we pay frequent home visits so as to establish a
                relationship with the parents and gain their trust.”

            • “[The workshops] exposed us to eﬀective strategies of other mentors [for

                                                                      18
           dealing with parents] that we could try and implement in our community.”



4    Threats to Scalability

Over the summer of 2016, after learning about the results of the second experiment, the
government decided to replace the Original program with the enhanced training modality.
All its primary schools, including the 224 schools that were part of the evaluation sample,
were deemed eligible to receive the Plus program modality. The overall scale of the operation
of the mentoring intervention—including the total number of mentors that participated in
the program—remained constant in the periods before, during, and after the experiments.
This single policy change creates two interesting circumstances that are informative for our
case study on scaling. On the one hand, schools that were part of our second experiment
experienced a change in the situation—from the research setting to the government imple-
mentation. On the other hand, the rest of the schools, that were not part of the experiment
but received the mentoring program under the Original modality, underwent a reform in
program design within the same government situation. In this section and the next one,
we focus on the sample of the experimental schools in order to zoom into the threats and
mechanisms of scaling brought about by the new situation. In Section 6, we discuss policy
impacts on education outcomes for experimental schools as well as for the overall population
of schools in the state of Chiapas.

To study the scale-up problem in our context, we analyze three key aspects outlined in
Al-Ubaydli et al. (2020), namely inference, representativeness of the population, and repre-
sentativeness of the situation. Although it will not be part of our discussion, we also want
to mention additional features of the experimental design that may speak to other threats
to scaling. First, the relatively large units of randomization (school community) are robust
to local general equilibrium/spillover eﬀects, which are often relevant in ﬁeld experiments
that are implemented at scale. Second, our ﬁeld experiment has been implemented by the
research team in close collaboration with the government agency that was later in charge of
its policy implementation. Hence, the design of the two program modalities bears in mind
the supply-side considerations of scaling, as well as various ﬁnancial and local institutional
constraints.




                                             19
         Table 5: Joint Test of Signiﬁcance Within and Across Experiments (p-values)
                                    First Experiment Second Experiment Both Experiments
     Api Original = Control               0.828            0.411             0.707

     Api Plus = Control                        .                     0.001                        .

     API Original = API Plus                0.114                    0.001                      0.002
  Notes : This table reports randomization-inference (Randomization-t) p-values for the omnibus test of
 overall experimental signiﬁcance of each separate hypothesis (Young, 2019). An asymptotic p-value is
 reported for the hypothesis that API Original = API Plus in the ﬁrst column, which is tested across
 experiments. All p-values account for clustering at the school level.


4.1      Inference

Inference problems arise when researchers and practitioners want to learn to what extent
existing evidence advocates for policy decisions. We focus on whether (i) the lack of eﬀec-
tiveness of the Original modality is indicative of a null result, and (ii) the large impact of
the Plus modality on schooling outcomes for children and engagement outcomes for par-
ents is a false positive.” We jointly consider two key outcomes—the overall index of student
achievement and the overall index of parental engagement—and compute p-values of overall
statistical signiﬁcance (Westfall and Young, 1993). Following the insights in Maniadis et al.
(2014), we harness the value of the two experiments to bolster the credibility of our empirical
evidence. We test hypotheses across experiments using Fisher’s combined probability test,
which is akin to the joint statistical signiﬁcance test usually invoked in meta-analyses.12

Table 5 shows the results. Consistently within and across experiments, the Original modality
does not generate actionable evidence (and yet, the government implemented such program
modality at scale). The evidence displayed in the ﬁrst row documents a lack of signiﬁcance of
such variant of the mentoring program on children’s achievements and parental investment.
The results in the second and third rows of Table 5 document a highly signiﬁcant impact of
the mentoring program under the Plus modality, both when compared to the control group
with no mentors (p-value = 0.001) and the Original modality. The relatively noisier estimates
of the ﬁrst experiment do not allow us to reject the hypothesis of equal treatment eﬀects
between the two modalities when tested across experiments (p-value = 0.114). This result
highlights the importance of the design of the second experiment, in which we replicate the
12
   The Westfall-Young procedure uses the joint distribution of p-values across all equations so as to minimize
the loss of power brought about by the multiple testing adjustment within a given experiment. Combined
                                                                           k
p-values across experiments are obtained using Fisher’s formula: −2 i=1 log (pi ) ∼ χ2   2k , where pi ∼ U [0, 1]
is the p-value for the ith hypothesis test and k = 2 is the number of independent experiments being combined.


                                                       20
Original modality along with the new Plus modality. In the second experiment, we strongly
reject that the Plus modality is equally eﬀective to the Original modality (p-value = 0.001).
A very similar result holds through a combined probability test across both experiments
(p-value = 0.002), which is reported in the third column of Table 5.

The joint inference drawn from the two experiments seems to convincingly point toward the
relative eﬀectiveness of the Plus modality when compared to both the Original modality and
the control group with no mentors. Given that the policy reform under study is a change
from the Original to the Plus modality, we discard the ﬁrst experiment in the rest of this
section and focus our analysis on the schools that participated in the second experiment.


4.2    Representativeness of the Population

Research ﬁndings from ﬁeld experiments may sometimes be diﬃcult to generalize because,
in the language of Al-Ubaydli et al. (2020), the properties of the study population may diﬀer
from the population of interest to policy makers. Heckman (1992) discusses selection into
ﬁeld experiments and ﬁnds that the characteristics of subjects who participate can be dis-
tinctly diﬀerent from those of subjects who do not participate. In Table 6 we compare means
in observable characteristics between our experimental sample and the overall population of
schools in the state of Chiapas. The descriptive statistics for the sample of experimental
schools are shown in column two, and they appear remarkably balanced when compared
to the respective statistics in the overall population that are displayed in the ﬁrst column.
As shown in the third column, we cannot reject equal means across the several variables
assessed. There is a very small imbalance in the number of local instructors, which is only
marginally signiﬁcant.

We next study whether the average impacts of the Plus modality in the second experiment
have ex-ante external validity with respect to the impact of the mentoring intervention at
scale in the broader population of schools in Chiapas. To do this, we evaluate whether
program impacts vary along the program eligibility criteria used by the government during
the policy implementation (see Section 2.1). The idea behind this exercise is that any
variation in treatment eﬀects along those dimensions may be indicative of the extent to
which program eﬀects may change because of the underlying diﬀerences across populations.
Table 7 displays heterogeneous treatment eﬀects of both Original and Plus modalities along
two criteria that are time invariant and hence plausibly unaﬀected by the intervention:
whether the community where the school is located is categorized as having high or very

                                             21
                                   Table 6: Diﬀerences Across Populations
                                          All Chiapas      Second Experiment Chiapas vs. Second Experiment
                                          Mean (SD)            Mean (SD)                     p-value
                                                            Panel A: Community Characteristics
     Number of households                     34.625              29.329                      0.486
                                            (109.863)            (50.234)
     Total population                        140.293             121.389                      0.494
                                            (394.371)           (240.562)
     Share economically active                0.298                0.303                      0.361
                                             (0.073)              (0.070)
     Water connection (Y/N)                   0.033                0.023                      0.454
                                             (0.178)              (0.151)
     Sewer system (Y/N)                       0.018                0.009                      0.346
                                             (0.134)              (0.096)
     Share of illiterates                     0.268                0.270                      0.832
                                             (0.175)              (0.167)
     Share of dwellings with dirt ﬂoor        0.328                0.363                      0.126
                                             (0.319)              (0.322)
     Garbage collection (Y/N)                 0.025                0.023                      0.842
                                             (0.158)              (0.151)
                                                               Panel B: School Characteristics
     Average test score (Spanish) 2010      425.173              431.340                      0.158
                                            (57.245)             (60.810)
     Average test score (Math)              415.998              421.333                      0.363
                                            (76.967)             (80.895)
     Number of students                      14.023               14.770                      0.205
                                             (8.403)              (7.069)
     Number of local instructors              1.216                1.279                      0.054
                                             (0.449)              (0.514)
     Share students over-age                  3.125                3.620                      0.264
                                             (6.285)              (5.629)

     Observations                             1,475                 230
 Notes : Means and standard deviations in parentheses for various characteristics collected before the introduction of the
API program. The last column shows asymptotic p-values for mean diﬀerences between the overall population and the
experimental sample. Panel A shows community-level characteristics from the population census (2010), whereas Panel B
displays school-level variables from the school census (2010). See Appendix A.1 for more details on the data sources.



high “marginality,” as deﬁned according to the National Population Council or CONAPO
(Poverty 1), and whether the community was targeted by an anti-poverty program (Poverty
2).13 In the sample of schools in the second experiment, approximately one-third satisfy the
Poverty 1 criterion, 70 percent satisfy the Poverty 2 criterion, and 25 percent satisfy both
criteria. We run separate regression models for three summary outcomes of the intervention
on both students and parents: the overall index of student achievement, the indicator for
enrollment in seventh grade, and the overall index of parental engagement. Estimation

13
  For details on the Poverty 1 index, refer to https://www.gob.mx/cms/uploads/attachment/file/
685308/Nota_t_cnica_IML_2020.pdf, accessed on August 2022.


                                                           22
         Table 7: Heterogeneity in the Impact of the Program by Eligibility Criteria
                                                      Children’s Outcomes                      Parental Outcome
                                              Overall Score    Enrolled Secondary              Engagement Index
API Original                                     0.088                 0.134                         -0.018
                                                [0.419]              [0.205]                         [0.887]
API Original× Poverty 1                          0.090                -0.104                         -0.046
                                                [0.637]              [0.420]                         [0.816]
API Original× Poverty 2                          0.091                -0.033                         -0.009
                                                [0.490]              [0.767]                         [0.949]

API Plus                                           0.320                   0.273                        0.464
                                                  [0.047]                 [0.024]                      [0.000]
API Plus× Poverty 1                                0.094                   0.015                        0.100
                                                  [0.638]                 [0.902]                      [0.643]
API Plus× Poverty 2                                0.010                  -0.216                       -0.173
                                                  [0.963]                 [0.128]                      [0.386]

Original(Pov. 1)=Original(Pov. 2)                 [0.995]                 [0.681]                      [0.873]
Plus(Pov. 1)=Plus(Pov. 2)                         [0.816]                 [0.297]                      [0.444]

Number of clusters                                 224                      182                         224
Observations                                       1045                     468                         1045
 Notes : This table shows OLS estimates and the associated p-values (in brackets) on student and parental outcomes
measured after two academic years of exposure to the API program under the second experiment designed and imple-
mented by the authors in collaboration with the government. For a detailed descriptions of the test scores used in this
table, see Appendix A.2. The dependent variables in the ﬁrst and third columns are standardized with respect to their
means and the standard deviations in the control group. The dependent variable in the second column is computed
from administrative school records (see Appendix A.1). All p-values account for clustering at the school level.


results reveal limited variation in program impacts along both poverty measures, with eﬀect
sizes for the interaction terms with the indicator variables for the program modalities that
are not statistically diﬀerent from zero, and not statistically diﬀerent from each other.

Taken together, the evidence shown in this section documents that diﬀerences across popu-
lations (if any) are unlikely to represent a meaningful threat to scalability in this context.
There is a very high degree of similarity in observable characteristics between the experimen-
tal sample and the overall population of schools in Chiapas. Program impacts are also not
heterogeneous along the determinants of the rollout of the program under the government
implementation, which is indicative of the fact that the community/school targeting process
is unlikely to play a role for the scalability of the program.




                                                          23
4.3       Possible Threats from the New Situation

Notwithstanding sizable and signiﬁcant eﬀects on a sample that is representative of the
population of interest, the success of the mentoring intervention at scale is not guaranteed.
The diﬀerence between the implementation protocol in our research setting—where we had an
active role in guaranteeing the smooth progress of the ﬁeld experiment—and the government
operations can translate into contrasting aftermaths of the program. For example, the criteria
for closing schools were very diﬀerent in the two situations. Although the oﬃcial enrollment
threshold for closing schools is six students, schools in the second experiment were allowed
to remain open if they had at least three enrolled students in either of the two school years
when the experiment took place. In addition, children in schools with more than 29 enrolled
students were required to transfer to schools in the regular public school system. As a result,
only two schools closed in the sample of 230 schools in the second experiment (see Section
2.2).14

Hence, a threat to the success of the program in the new policy situation may come from
the possible school closures as a result of the mentoring intervention under the government
implementation. If instead the presence of a mentor increases the probability that schools
remain open, this mechanism could represent an opportunity to learn something about the
mechanisms behind the scalability of the program in our context. Figure 1 shows the re-
lationship between school closures, as measured by the year-to-year presence in the school
census two years after the second experiment, and school size, as measured by the number of
enrolled students in the school year before the second experiment. The green (lighter) line
shows how the probability of school closures varies by school size for schools that did not
receive the program during the government intervention. The probability of school closure
is bimodal, and its two modes are positioned around the two critical enrollment thresholds.

14
  While school closures are obviously a “non-negotiable” aspect (List, 2022) for the success of the API
program, there may be other “negotiable” diﬀerences in the program implementation across the experimental
and the policy settings that we cannot directly study due to a lack of monitoring data outside of the
experimental sample/period. First, to avoid refusal of the assigned mentor among the communities of
the evaluation schools, each mentor in the experimental sample was provided with two baskets of food,
throughout the school year, as donations to the community leaders as well as for personal consumption.
Second, as a way to attenuate the potentially detrimental consequences of mentors’ dropping out of the
program during the evaluation period, the government delegates in Chiapas arranged for a replacement
within two weeks from the day of a mentor’s departure from a community. If the dropout was part of the
Plus group, the replacement would receive an additional three-day training session that would make up for
the content covered during the extra week of the initial training session. Third, there might be diﬀerences
between the experiment and the policy rollout in the implementation of the training module of the Plus
modality, such as the number of extra days and the content of the curriculum of the initial training as well
as the frequency of the peer-to-peer sessions.


                                                    24
                            Figure 1: School Closures and School Size




 Notes : The ﬁgure shows a histogram with the distribution of the size of the 224 schools that participated
 in the second experiment (as measured by the number of enrolled students during the ﬁrst two years of the
 government implementation). Overlaid on the histogram, it displays kernel-weighted local mean estimates
 of the relationship between the probability of closure for schools that receive (blue, darker line) and do
 not receive a mentor (green, lighter line) during the ﬁrst two years of the government implementation of
 the Plus modality. The red vertical dashed lines represent the statutory enrollment thresholds for school
 closures in rural Chiapas.


School closures occur also for medium-sized schools, consisting of six and 29 students. In
total, twelve schools out the 122 schools without mentors closed during the ﬁrst two years
of the government implementation of the program under the Plus modality. Over the same
time period, none of the 102 schools that received a mentor during the government imple-
mentation closed (blue, darker line).



5     Pathways to Scale

What plausible mechanism can explain the strong and positive correlation between the men-
tors’ presence and the probability that schools remain open? Parents organize local associ-
ations aimed at promoting community education, to which they contribute by maintaining
the school’s facilities and distributing school materials. The parents’ association also plays
a role in the decision to keep the school open as well as whether to require children enrolled


                                                    25
in schools with more than 29 students to transfer to schools that are part of the regular
public school system. Given the evidence on the eﬀect of the mentoring intervention on
parental engagement (see Section 3), in this section we analyze the role of parents, and more
broadly, of the community-level parental engagement, as a potential mechanism behind the
scalability of the program. We ﬁrst lay out a simple model of skill formation and parental
investment, where the individual incentives for parents to invest in educational activities
are jointly determined at the community level. The model provides us with a framework
to study the threat to scalability from changes in situations. We then document empirical
evidence that is consistent with the key model predictions and that is diﬃcult to reconcile
with alternative, more direct, channels of inﬂuence of the mentors on school closures.


5.1     Theory

There are various local communities (c), each composed of a number of families Nc . Each
family i ∈ {1, . . . , Nc } decides whether to engage in parenting, Ii ∈ {0, 1}. The returns to
parenting are deﬁned by the following technology of skill formation:
                                                                                      1
                                                                                      φ
                                                                                  φ
                                                                 1
(1)                          θi = θi,0 + A · Iiφ +                          Ij        ,
                                                            Nc   −1   j =i


                                            1
where θi is a child’s skills, while      N c −1   j =i Ij   represents the average parental engagement
among other parents in the same community. The parameter φ characterizes the degree
of substitutability between individual parental engagement and community-level parental
engagement in the process of child development. Parameter A is total factor productivity,
while θi,0 represents the child’s initial skills.15

We model parents as paternalistic over their children’s skills, and we assume that there is
a cost of parenting so that the parental utility function is µ(Ii ) : Ui = −µ(Ii ) + θi . The
parental decision problem is deﬁned as follows:
                                                                                                           1
                                                                                                           φ
                                                                                                       φ
                                                                                1
               max Ui (Ii , I−i ) := − µ(Ii ) + θi,0 + A · Iiφ +                                 Ij        .
              Ii ∈{0,1}                                                       Nc − 1       j =i




15
  We omit in equation (1) the shares of the parental investment since the total factor productivity term, A,
allows us to appropriately re-scale the constant elasticity of substitution function without loss of generality.


                                                      26
In this framework, the incentives for parents to engage with the education of their children
depend upon the community-average parental engagement as well as on the productivity of
those investments. An equilibrium in this economy is deﬁned as the optimal choices of fami-
lies (Ii∗ ) that are consistent with the endogenously determined community-level engagement:
              Nc
       ∗
 Ii∗ (I− i)   i=1
                  .


Proposition 1 This economy exhibits two types of equilibria:

  1. Free Riding Equilibrium: parents free ride on each other, and in equilibrium there is
       no community engagement (Ii∗ = 0 ∀i).

  2. Collaborative Equilibrium: all parents in the community are engaged in the process of
       children’s learning (Ii∗ = 1 ∀i).


We introduce in this framework an educational intervention τ ∈ {0, 1}—such as the API
program—that directly aﬀects both the process of skill formation of children and the incen-
tives of parents as follows:

(2)                             θi,0 (τ ; γ0 ) = θi,0 · (1 − τ ) + (θi,0 + γ0 ) · τ
(3)                              A(τ ; γ1 ) = 1 · (1 − τ ) + γ1 · τ.

The impact of the intervention not only hinges on the direct eﬀect on children’s skill γ0 , but
also on its ability to shift the incentives of parents through higher productivity of investment
(γ1 ), which in turn aﬀects the community-level parental engagement. The total eﬀect on
children’s outcomes ∆θi = θi (τ = 1) − θi (τ = 0) can be written as follows:

                        γ0     in the Free Riding Equilibrium
       ∆θi =
                      γ0 + γ1 if the intervention induces a new Collaborative Equilibrium .


We deﬁne a threat to scalability in this framework as a deterioration of the direct impact of
the program. More precisely, for two given situations s and s —such as the ﬁeld experiment
and the government implementation in our context—we posit that γ0 (s ) < γ0 (s). The
extent to which this translates into a deterioration of the overall eﬀectiveness of the program
depends on the endogenous response of parents. In one scenario, the change in the situation
induces the program to fail at scale because of the lack of parental responses. In the second
scenario, the program is able to promote coordination and engagement of parents in the local

                                                        27
                                                                             Figure 2: Pathways To Scale

                                                                2




                  )
                                                               1.9



                                             1
                  Minimum Treatment Effect for Scalability (
                                                               1.8

                                                               1.7

                                                               1.6

                                                               1.5

                                                               1.4

                                                               1.3

                                                               1.2

                                                               1.1

                                                                1
                                                                 0.3   0.4      0.5     0.6        0.7         0.8   0.9   1
                                                                                  Incentive to Free Ride ( )


 Notes : The ﬁgure shows the relationship between the incentive to free ride among parents (φ) and the
 program’s impact on the productivity of parental investment as predicted by a calibrated version of the
 model (µ(Ii ) = −2 · Ii , Nc = 1000).


community, therefore possibly oﬀsetting the threat to scalability with respect to changes in
the situations.

The parameter φ, which captures the degree of complementarity between parents in the
technology of skill formation of children, lies at the core of the equilibrium selection in
the model (Free Riding vs. Collaborative, see Proposition 1). This parameter pins down
the incentives for parental cooperation in the local educational activities. Low values of φ
represent high incentive for cooperation, while positive levels of φ induce free-riding among
parents. Figure 2 shows how the impact of a program at scale across diﬀerent situations
depends upon the degree of parental collaboration (φ) and its ability to trigger local spillover
eﬀects. The ﬁgure shows that in economies with a high degree of parental collaboration (small
φ), a relatively small impact on the productivity of investment (γ1 ) can induce a shift toward
the Collaborative Equilibrium. Economies with a lower degree of parental cooperation (high
φ) need larger impacts of the program on the productivity of parental investment to trigger
the social determination of human capital investment at the community level.




                                                                                              28
5.2     Evidence on Parents as Means of Scalability

We test the model’s key prediction using variations across situations (the ﬁrst experiment
was run by the government, while the second experiment was run by the research team) and
across program modalities (API Original and API Plus ). In particular, we study whether
a diﬀerent response of parents in terms of their engagement both at the local school and
with their children (see Table 3) subsequently triggers a diﬀerential impact of the mentoring
intervention on school closures. School closures, which in our context represent a major
disruption to the program’s continuity and eﬀectiveness, should be interpreted as an outcome
of the deterioration eﬀect of the program once it is implemented at scale.

The ﬁrst two columns of Table 8 show the reduced-form eﬀects of the two randomized
program modalities—in both the ﬁrst experiment (ﬁrst column) and the second experiment
(second column)—on the probability that schools close in the second year of the national
rollout of both programs. The Original modality displays small and noisy eﬀects on school
closures in both experiments, which are not statistically diﬀerent from zero. This result
suggests that situations that do not promote parental engagement do not diﬀer from the
status quo rates of school closures, which range between 5 percent (ﬁrst experiment) and 8
percent (second experiment) in the experimental control groups with no mentors.

The second column of Table 8 shows that the Plus modality, as Table 3 shows signiﬁcantly
boosts parental engagement, has a signiﬁcant impact on school closures. Schools that were
assigned to the enhanced modality during the second experiment experience are less likely to
close permanently (−8.3 percentage points) two years after the Plus modality was adopted
by the government, which is statistically diﬀerent from zero at the 95 percent conﬁdence
level. This result echoes previous evidence on the relationship between the probability of
closures for schools that receive a mentor during the government implementation of the Plus
modality, which is shown in Figure 1. Given that the probability of receiving a mentor
during the government implementation is orthogonal with respect to the randomized API
assignment of the second experiment, this evidence rules out channels other than parental
engagement through which the presence of the mentors can keep the schools open.16

The IV estimates shown in the third column of Table 8 go a step further and quantify the
extent to which parental engagement aﬀects the probability of school closures. Because of
16
  Approximately half of the schools in any of the treatment arms and the control group of the second
experiment received a mentor by the second year of the national rollout of the Plus modality. This share
is balanced across treatment arms after controlling for the program eligibility criteria (see Section 2.1):
p-value(Original ) = 0.367, p-value(Plus ) = 0.660.


                                                    29
                         Table 8: School Closures and Parental Engagement

                                                               Outcome: School Closures
                                       First Experiment         Second Experiment Second Experiment, IV
API Original                                  0.031                   -0.031              -0.031
                                             [0.549]                  [0.396]            [0.410]

API Plus                                                                -0.083
                                                                        [0.030]

Overall Parental Engagement                                                                          -0.217
                                                                                                     [0.021]

Observations                                    80                        224                          1045
Clusters                                         .                         .                           224
F-Stat (Excl. Instruments)                                                                            13.833
 Notes : This table reports the estimates for the reduced-form eﬀects of the API modalities during the two experiments
(columns 1 and 2) on the probability of school closures, as well as the instrumental variable estimates of the impact of
parental engagement on school closures. In the third column, the randomized API Plus modality during the second
experiment is used as an instrumental variable, while the randomized API Original modality is included as a control
variable. The dependent variable is an indicator variable for whether the school is closed in the fall of 2014 (column
1) or in the fall of 2018 (columns 2 and 3). The variable “Overall Parental Engagement” is the same variable used
in the last column of Table 3. All p-values account for clustering at the school level. p-values reported in brackets
account for clustering at the school level.


the contextual information on the role of the parental association in deciding school closures
discussed previously in this section, we posit that parents are the main channel through which
the Plus modality of the API program aﬀects school closures. The diﬀerential impacts of
the two program modalities on both parental investment and school closures shown in the
ﬁrst two columns of Table 8 are consistent with this exclusion restriction. We ﬁnd that an
increase of half a standard deviation in the overall parental engagement index is causally
associated with a reduction of 11 percentage points in the probability that their children
experience a school closure. This eﬀect is both statistically and quantitatively signiﬁcant.

We complement these ﬁndings with qualitative evidence on the role of parents in ensuring
continuity in schooling activities (see Appendix A.3). As reported by the community instruc-
tors, parents may have more at stake in keeping the schools open as they invest in durable
goods for the local school:


          • “[Parents] help manage the school and contribute by improving the fencing,
             painting the walls, ﬁxing the toilets, as well as buying school materials.”
          • “[Parents] serve the needs of the school with construction works and they
             provide food to the local instructor.”

                                                          30
        Table 9: Parental Investment by Proxies of Community-Level Collaboration

                                Engage at School    Manage School     Engage With Child Engagement Index
 API Plus× No Conﬂict                0.161              0.159               0.370             0.364
                                    [0.176]            [0.103]             [0.000]           [0.001]

 API Plus× Conﬂict                     0.926              0.281               0.592                 0.914
                                      [0.000]            [0.183]             [0.005]               [0.000]

 Conﬂict=No Conﬂict                   [0.000]            [0.596]             [0.329]               [0.020]

 Mean Control (Conﬂict)               -0.006             -0.059              0.152                  0.011
 Mean Control (No Conﬂict)            -0.023              0.013              -0.013                 0.003

 Number of clusters                     224               224                  224                   224
 Observations                          1045               1045                1045                  1045
 Notes : This table shows OLS estimates on parental outcomes measured after two academic years of exposure
to the API program under the second experiment designed and implemented by the authors in collaboration
with the government. The variable Conﬂict takes the value of one if least on hostile event related to land
property, religion, elections, crime, or drug addiction is reported at the locality level in the population census
(2010). For a detailed descriptions of the variables used in this table, see Appendices A.1–A.2. The dependent
variables are standardized with respect to their means and the standard deviations in the control group. All
p-values account for clustering at the school level. Asymptotic p-values reported in brackets are clustered at
the school level.


As reported by the mentors, parents follow up with their children on homework and other
pedagogical material whenever the mentor is busy attending tasks outside of the community:


      “Parents used to provide support with homework whenever mentors are visiting
      other communities ensuring pedagogical support, so that upon the return of the
      mentors they are able to make progress in the schooling activities without set-
      backs.”


We next empirically investigate the second prediction of the model, as depicted in Figure 2
and discussed above. We take advantage of information from the 2010 locality-level census
on the degree of social hostility in the community (see Appendix A.1), which is based on
hostile event related to land property, religion, elections, crime, or drug addiction. This
measure proxies the degree of collaboration in the community. In the 224 communities that
form part of the second experiment, we construct an indicator variable for the presence of
a conﬂict in the community if at least one of these hostile events is reported. We then
interact this variable with the experimental API Plus assignment and regress the various
survey-based measures of parental engagement collected during the second experiment on
these interaction terms.


                                                       31
Table 9 displays the estimation results. In line with the prediction from the model, com-
munities with higher hostility display a higher parental response to the Plus program when
compared to communities with no conﬂicts, as parents need to overcome the higher incentive
to free ride.17 The impacts on parental engagement at school are shown in the ﬁrst column,
and it is approximately eight times larger in communities with conﬂicts than in communi-
ties without conﬂicts. The impacts are twice as large when considering activities related
to managing school resources and engaging directly with children in educational activities,
although in this case they are not statistically diﬀerent from the corresponding eﬀects with-
out conﬂicts (second and third columns). In the last column we look at the overall parental
engagement index. Communities with conﬂicts exhibit impacts on parental behavior (+0.91
standard deviations) that are 2.5 times larger than communities with no conﬂicts (+0.36
standard deviations)—this diﬀerence is statistically signiﬁcant at the 95 percent conﬁdence
level.

Previous literature has highlighted how parental investments and parenting styles are re-
sponsive to the environments that families face (Doepke and Zilibotti, 2017; Agostinelli,
2018; Agostinelli et al., 2020). Our results shed light on how the success of an educational
program depends upon the local engagement of parents in educational activities and how its
scalability is eﬀectively a socially determined outcome.



6        Policy Impacts at Scale

In this last part of the analysis, we discuss the impacts of the government-run program on
various educational outcomes. After 2016, the government fully converted the mentoring
program into the Plus modality, and all schools were in principle eligible to receive the men-
tors. Schools were assigned a score between one and four, with one denoting the highest
priority level. The scores are based on a combination of criteria that includes school per-
formance in the national learning assessment, whether the school has six or more primary
students enrolled, whether the school received the API program in the period between 2009
and 2015, the level of marginalization of the community where the school was based, and
whether the community was targeted by an anti-poverty program (see Section 2.1). The size
of the new program did not change relative to the previous implementation of the Original
modality, including the number of available mentors. In the fall of 2016 there were 535
17
  Notice that few to no school closures were detected in villages that were part of the Plus modality,
independently of whether the community experienced any conﬂict.


                                                 32
mentors in Chiapas, and given these constraints, mentors are allocated across communities
on a rotating basis.


6.1     The Exposure Eﬀect of the Program

We start by analyzing the initial transition of the schools in the evaluation sample from
the experimental situation to the policy at scale. We do this by estimating the following
regression model:

                                 K
(4)                 Yj = β0 +         β1,k 1{ExpP lusj = k } + γ Xj
                                                                  Criteria
                                                                           + uj ,
                                k=1


where 1 ExpP lusj (i) = k       ∈ {0, 1} represents an indicator variable for whether school j
                                                                       Criteria
is exposed to k ∈ {0, 1, . . . , K } years of API Plus program, while Xj (i)    is a vector of
indicator variables for the program eligibility criteria. Our outcome of interest (Yj ) represents
the 2016–2017 school-level transition rates to seventh grade. This variable is constructed from
the same administrative source we used for our experimental evaluation. The sample includes
207 schools of the 224 that were part of the experiment. Beyond a school that permanently
closed, the sample attrition is caused by schools not having sixth graders during that school
year. This fact is consistent with the multi-grade nature of the CONAFE system. Attrition
is balanced with respect to the total years of exposure to API Plus (p-values = 0.467, 0.812,
and 0.568).

We exploit the change in situation to analyze the exposure eﬀect to the program one year
after the government rollout, when school closures were still minimal (only one school closed
among our 224 schools by the fall of 2017). After the ﬁrst year of the rollout, schools
started to close more intensively (11 additional schools closed during the second year); hence,
outcome variables based on survey or administrative school data during this period would
suﬀer from endogenous censoring.18 By the spring of 2017, the total years of exposure to API
Plus range from zero to three depending on whether a school received the Plus modality for
two years during the second experiment, as well as on the API Plus government assignment
in the year after the experiment. The underlying assumption of this approach is that, once
controlling for the oﬃcial government criteria, the remaining variation in the program’s

18
  For this reason, in Section 6.2 we take advantage of the 2019 Census information that is not subject to
this issue. The Mexican Census is available only every ten years, and hence we cannot use this information
to estimate the model in equation (4).


                                                   33
                Figure 3: The New Situation in the Experimental Sample of Schools




                     Treat Effect on 7th Grade Enrollment
                                                      .5
                                                            Urban Secondary School Enrollment

                       0      .1    .2    .3
                                        −.1     .4




                                                                   1 Year                   2 Years                 3 Years
                                                                              Total Years of Exposure to API Plus

                                                                         Point Estimate             90% CI            95% CI


     Notes : This ﬁgure shows OLS estimates of the years of exposure to the mentoring program on the prob-
     ability of enrolling in seventh grade during the transition from the second experiment to the government
     implementation of the Plus modality. Vertical lines overlaid on each bar display the 95 percent and 90
     percent conﬁdence intervals, respectively. Conﬁdence intervals are based on asymptotic inference.


assignment across localities in the year after the experiment is as good as random:

                        Criteria                           Criteria
(5)               E [u|Xj        , ExpP lusj = k ] = E [u|Xj        ], ∀k ∈ {0, 1, 2, 3} .

While this assumption is obviously not testable, we examine whether the API Plus assign-
ment is conditionally balanced with respect to predetermined educational outcomes. Table
B-12 shows the results of this placebo test. The outcomes we use are the 2013 scores in the
national standardized test (Spanish, math, and science, see Appendix A.1).19 The estimated
coeﬃcients of the years of exposure to API Plus are not statistically diﬀerent from zero. The
point estimates are relatively small, especially after controlling for the assignment criteria,
and their signs do not suggest any pre-existing positive trend. For all these reasons, we
believe this evidence provides some support to the plausibility of (5) in our setting.

Figure 3 plots the estimated β1,k coeﬃcients shown in equation (4), where zero years of
exposure represents the reference category. The results show a positive exposure eﬀect to
the program one year after the government rollout. The eﬀect of the Plus modality goes
from 3 percentage points after one year to more than 35 percentage points after three years
19
     The year 2013 is the last year in which the national standardized test was applied universally in Mexico.


                                                                                          34
of exposure. The average eﬀect of three years of exposure on the probability of enrolling
in seventh grade is very large and precisely estimated (p-value < 0.001). The magnitude of
this eﬀect implies that the enrollment rates in these disadvantaged and rural areas achieve
the secondary school enrollment rates in urban Mexico (95 percent). The average marginal
eﬀect of an extra year of the program is +10 percentage points (p-value = 0.006) in the
probability of enrolling in seventh grade. The eﬀect of two years of exposure, although not
statistically diﬀerent, is larger than the experimental estimate displayed in Table 2 (ﬁfth
column) suggesting that the impact does not fade out after one year.


6.2     The Eﬀect of the Program Beyond the Evaluation Sample

We now broaden our analysis to the entire population of schools in the state of Chiapas.
We examine whether the Plus modality of the program at scale has promoted educational
opportunities for children in these disadvantaged communities, possibly resembling the re-
sults from the second ﬁeld experiment (see Section 3). To do so, we match administrative
records on the government rollout of the program during the fall of 2017 with village-level
educational outcomes from the population census data (data collection in the fall of 2019)
for the majority of the schools and communities, which include those that were part of the
second experiment.20 Our ﬁrst outcome is the village-level lower-secondary enrollment rates
among children between 12 and 14 years old. This variable is available in the census for
1,417 communities in Chiapas. It is not immediately comparable with our previous measure
of enrollment in seventh grade for two reasons. First, the census-based information repre-
sents the stock (rates) of children enrolled in secondary school in a given year, while our
previous measure represents the ﬂow of new students enrolling in secondary schools. Second,
the census-based variable includes children in the village who are enrolled in primary schools
that are not eligible for the API program, which converts the analysis of the program at scale
into an intent-to-treat analysis. Another educational outcome from the population census
that we use is the rate of child literacy for children between eight and fourteen years old,
which represents an available measure of children’s achievement. This variable is available
20
   The match between the universe of schools and the localities of the population Census is one to one, as
each village has at most only one primary school. The coverage of the census data is not universal, and
we were not able to match nearly one-third of the 2,063 schools in Chiapas in 2019. For both educational
outcomes in the census data we cannot reject the hypothesis that the probability of missing observations
is balanced with respect to the program assignment (p-values for secondary school enrollment and child
literacy are 0.728 and 0.430, respectively). For further details on the census sampling design, please re-
fer to: https://www.inegi.org.mx/contenidos/productos/prod_serv/contenidos/espanol/bvinegi/
productos/nueva_estruc/702825197629.pdf, accessed on August, 2022.


                                                   35
in the census for 1,440 communities in Chiapas. Finally, we match information about school
closures from the school census in the same year (2019).

This set of educational outcomes is particularly conducive for our analysis. Secondary school
is a critical period for the educational outcomes of the disadvantaged population under study,
as more then a quarter of the 12 to 14 year olds in Chiapas are out of school. Likewise, 13
percent of school-aged children are still illiterate. The year of the data collection in the census
(2019, two full school years after the rollout of the Plus modality at scale) is consistent with
the length of exposure to the API program in the second experiment. These census-based
outcomes cover the quasi-universe of the localities in Mexico and, unlike other survey-based
or administrative test score measures, they are not subject to any censoring during the data
collection due to school closures. This allows us to avoid the concerns about selection bias
due to diﬀerential school closures induced by the program at scale (see Section 4.3).

We analyze the impact of the policy implementation of the program using the following
linear regression model:

                                                       Criteria
(6)                           Yj = α0 + α1 P lusj + δ Xj        +   j   ,


where Yj is a school-level outcome for school j , while P lusj takes a value of one if school
j receives a mentor during the government implementation of the Plus modality, and zero
                       Criteria
otherwise. The vector Xj        consists of all the criteria used for the assignment of the
program. The parameter of interest, α1 , represents the eﬀect of the program during the
government implementation on the outcome of interest. As was the case for the exposure
analysis discussed in the previous section and formalized by equations (4)-(5), to causally
interpret the estimates we need the assignment of the program across communities to be
conditionally as good as random. In other words, conditional on the assignment criteria,
schools that receive and do not receive the program at scale are similar in terms of unob-
served characteristics. As before, we run some placebo tests to bolster the credibility of this
identiﬁcation assumption in our setting. Table B-13 shows the results. The 2017 govern-
ment assignment is not unconditionally random (odd columns of the table), as priority is
                                                                                  Criteria
given to more disadvantaged communities. Instead, when we control for the vector Xj        ,
the estimated coeﬃcients displayed in the even columns of Table B-13 are very small and
statistically insigniﬁcant.

Panel A of Figure 4 shows the results for secondary school enrollment after two years from



                                                36
                                                                               Figure 4: Policy Impact on Education Outcomes




                                                                                                                                       .08
                                          .15
 Treat Effect on Secondary School Enrollment




                                                                                                                   Treat Effect on Rates of Child Literacy
                                                                                                                                                    .06
                            .1




                                                                                                                           .02         .04
            .05




                                                                                                                                       0
                        0




                                                Non−Experimental Schools Experimental Schools   All Schools                                                  Non−Experimental Schools Experimental Schools   All Schools



                                                Panel A: Secondary School Enrollment                                                                                     Panel B: Child Literacy
   Notes : The bars in the ﬁgure represent the OLS estimates of the assignment to the API program during
   the government implementation of the Plus modality on school-level secondary school enrollment rates
   (Panel A) and child literacy rates (Panel B). Vertical lines overlaid on each bar displays the 95 percent
   and 90 percent conﬁdence intervals, respectively. Conﬁdence intervals are based on asymptotic inference.


the assignment of the mentors under the Plus modality at scale. Each bar represents the
treatment eﬀect of the policy (α1 in equation (6)) for a diﬀerent sample of schools in the state
of Chiapas. For the sample of schools that did not participate in the second experiment,
we ﬁnd that the program increases the fraction of children who enroll in secondary schools
by 4.5 percentage points (p-value = 0.039), which represents an increase of 6.5 percent with
respect to the sample mean. For the schools that were part of the experiment, the impact
of receiving the program during the government implementation is larger (+8.5p.p., p-value
= 0.031), although the two estimates are statistically similar. These eﬀects on secondary
school enrollment are in line with the experimental ﬁndings on the enrollment in seventh
grade (+12.4 percentage points, see Table 2). We interpret this result as evidence that the
program at scale is eﬀective in increasing schooling opportunities despite the change created
by the policy implementation. Finally, the pooled estimates of the entire population of
schools in Chiapas (+5.5 percentage points, p-value = 0.004) are in line with the estimated
eﬀects of each subpopulation of schools.

The results for child literacy are shown in Panel B of Figure 4. After two years of rollout of the
program, we ﬁnd that villages that received mentors under the Plus modality at scale display
a 2.3 percentage points (p-value = 0.026) increase in child literacy rates when compared to
villages without mentors. The magnitude of this eﬀect implies a reduction of illiteracy rates
by 20 percent with respect to the sample average. The estimated program eﬀect for the


                                                                                                              37
                                                       Figure 5: Policy Impact on School Closures




                                 0
                 Treat Effect on School Closure
                     −.1         −.15 −.05




                                                  Non−Experimental Schools Experimental Schools   All Schools

 Notes : The bars in the ﬁgure represents the OLS estimates of the assignment to the API program during
 the government implementation of the Plus modality on the rate of school closures as measured over the
 subsequent two years. Vertical lines overlaid on each bar display the 95 percent and 90 percent conﬁdence
 intervals, respectively. Conﬁdence intervals are based on asymptotic inference.


subsample of experimental schools is quantitatively similar, although a bit noisier (+3.0
percentage points, p-value = 0.122). The pooled result for all schools (+2.5 percentage
points, p-value = 0.006) mirrors the analysis for the two subpopulations. Overall, our results
conﬁrm the conclusion that the program is scalable and it enhanced an achievement outcome
for children in these disadvantaged communities.

We conclude the analysis of the impact of the mentoring program at scale by looking at school
closures. As previously described, school closures represent a major threat to the scalability
of the program as schools are the means through which the mentors reach the children and
their families. At the same time, we also show that parents act as a channel of scalability by
preventing schools from closing through increased parental engagement. To gauge whether
this mechanism persists during the government implementation of the Plus modality, it is
important to test whether the impact of the policy on school closures is consistent across
diﬀerent situations for the experimental sample, as well as across diﬀerent subpopulations
within the state of Chiapas.

Figure 5 shows the results. The government implementation of the Plus modality induces
a signiﬁcant and sizable eﬀect on school closures across all the subpopulations considered.


                                                                                38
When we focus on the set of schools outside of the experimental sample, we ﬁnd that the
program reduces the probability of a school closing by 7.3 percentage points (p-value <
0.001). Schools that were part of the experimental sample also experience a sizable decrease
in school closures during the policy implementation at scale, with an average impact of the
mentoring program of −8.3 percentage points (p-value = 0.004). The magnitude of this
eﬀect is remarkably similar to the corresponding impact of the API Plus program on school
closures after two years under the experimental situation for the same set of schools (−8.3
percentage points., see Table 8). The pooled eﬀect in the overall population of schools in
Chiapas is −7.3 p.p. (p-value < 0.001). This last piece of evidence strongly suggests that the
underlying mechanism through which schools are more likely to remain open with the API
program persists across diﬀerent situations. By enhancing parental engagement, the policy
implementation of the Plus modality prevented disruptions in the school environment, which
is a necessary condition for the success of the program at scale.



7        Discussion and Conclusion

We study a school mentoring program with a home visit component in the state of Chiapas,
Mexico. By exploiting two independently run ﬁeld experiments, we show that relatively
small diﬀerences in program design (the training module in our case) can spur substantial
diﬀerences in ﬁnal outcomes. We conﬁrm that the program as it was originally implemented
by the government is largely ineﬀective. One alternative modality of the program (Plus )
that features enhanced training for the mentors/home visitors is successful in enhancing
test scores and educational attainment for the students in our sample. Parents not only in-
creased their interactions and investment with children—a shared result among past success-
ful interventions—but also they intensiﬁed their engagement at the school and community
level.

The enhanced program modality is found to be eﬀective after the government scale-up. The
national rollout of the mentoring program, which fully converts the Original modality into
the Plus modality with enhanced training for all the schools, provides us with an opportunity
to study the mechanisms through which education interventions can be successfully scaled-
up. Even when the evaluation sample is broadly representative of the targeted population,
changes in the implementation protocol during the transition between the ﬁeld experiment
and the policy rollout can threaten the success of the program at scale. In our context, school


                                              39
closures represent a major concern during this transition. We document that the exposure
to the mentoring program at scale practically eliminates this issue. Parental responses are
shown to be the key mechanism through which schools remain open, thereby ensuring the
viability of the mentoring program as implemented by the government. The magnitudes of
the estimated impacts are remarkably comparable across situations (ﬁeld experiment versus
government implementation) for our experimental sample as well as for the rest of the schools
in Chiapas that experienced a change in program modality (from Original to Plus ) during
the government rollout.

Beyond the speciﬁc context of the analysis, we believe our case study can provide broader
lessons for scholars who are interested in designing and evaluating scalable interventions.
Whenever possible, we reiterate the importance of evaluating programs “at scale.” This rep-
resents both an opportunity to exploit the existing infrastructure of the program as well as a
restriction in terms of program design, since researchers have to consider various institutional
constraints and supply-side issues. While we do not necessarily advocate sampling the entire
population of beneﬁciaries, the composition of the evaluation sample needs to reﬂect the
targeting criteria of the intervention, and the units of analysis should be large enough so as
to encompass local spillover/general equilibrium eﬀects that likely arise in those situations.
It is also crucial to consider the joint impact of the intervention on the targeted actors. In
the context of mentoring and educational programs, parental responses need to be taken
into account jointly with children’s outcomes both ex ante (e.g., for power calculations) and
ex post (e.g., when adjusting inference procedures for multiple hypothesis testing and when
conducting omnibus statistical signiﬁcance tests). This is key for scalability since we have
shown that it is precisely the interplay in the behavioral responses between these actors that
determines the success of the programs at scale.

Finally, our work stresses the importance for scalability of the local parental incentives nec-
essary to achieve a “Collaborative Equilibrium” in the community, as discussed in Section
5.1. Our model highlights a key trade-oﬀ between the degree of complementarity among
parents, their incentive to cooperate, and the minimal actionable impact on parents that
favors scaling. This result sheds light on the current debate on how to design mentoring
interventions aimed at promoting better educational opportunities for children in disadvan-
taged contexts, including poor neighborhoods in the United States. While every parent is
certainly unique to her own child (and hence not scalable), engaged communities of parents
are potentially available at scale to promote the success of educational programs.



                                              40
References
Agostinelli, Francesco, “Investing in Children’s Skills: An Equilibrium Analysis of Social
  Interactions and Parental Investments,” 2018.

      and Matthew Wiswall, “Estimating the Technology of Children’s Skill Formation,”
  Working Paper 22442, National Bureau of Economic Research July 2016.

  , Matthias Doepke, Giuseppe Sorrenti, and Fabrizio Zilibotti, “It Takes a Village:
  The Economics of Parenting with Neighborhood and Peer Eﬀects,” Working Paper 27050,
  National Bureau of Economic Research April 2020.

  ,     ,   , and   , “When the Great Equalizer Shuts Down: Schools, Peers, and Parents
  in Pandemic Times,” Journal of Public Economics, 2022, 206, 104574.

Al-Ubaydli, Omar, John A. List, and Dana Suskind, “2017 Klein Lecture: The
  Science of Using Science: Toward an Understanding of the Threats to Scalability,” Inter-
  national Economic Review, 2020, 61 (4), 1387–1409.

Anderson, Michael L., “Multiple Inference and Gender Diﬀerences in the Eﬀects of Early
  Intervention: A Reevaluation of the Abecedarian, Perry Preschool, and Early Training
  Projects,” Journal of the American Statistical Association, 2008, 103 (484), 1481–1495.

Attanasio, Orazio, Helen Baker-Henningham, Raquel Bernal, Costas Meghir,
  Diana Pineda, and Marta Rubio-Codina, “Early Stimulation and Nutrition: The
  Impacts of a Scalable Intervention,” Journal of the European Economic Association, 01
  2022.

  , Sarah Cattan, and Costas Meghir, “Early Childhood Development, Human Capital,
  and Poverty,” Annual Review of Economics, 2022, 14 (1).

Bando, Rosangela and Claudia Uribe, “Experimental Evidence on Credit Constraints,”
  Working Paper 670, Inter-American Development Bank February 2016.

Banerjee, Abhijit V., Rukmini Banerji, James Berry, Esther Duﬂo, Harini Kan-
  nan, Shobhini Mukerji, Marc Shotland, and Michael Walton, “From Proof of
  Concept to Scalable Policies: Challenges and Solutions, with an Application,” Journal of
  Economic Perspectives, November 2017, 31 (4), 73–102.



                                            41
                   er´
Bobba, Matteo and J´ emie Gignoux, “Neighborhood Eﬀects in Integrated Social Poli-
  cies,” World Bank Economic Review, 2019, 33 (1), 116–139.

Bold, Tessa, Mwangi Kimenyi, Germano Mwabu, Alice Ng’ang’a, and Justin
  Sandefur, “Experimental Evidence on Scaling Up Education Reforms in Kenya,” Journal
  of Public Economics, 2018, 168 (C), 1–20.

Bruns, Barbara and Javier Luque, Great Teachers : How to Raise Student Learning in
  Latin America and the Caribbean, Washington, DC: World Bank, 2015.

Cameron, Lisa, Susan Olivia, and Manisha Shah, “Scaling Up Sanitation: Evidence
  from an RCT in Indonesia,” Journal of Development Economics, 2019, 138, 1–16.

Carneiro, Pedro, Emanuela Galasso, Italo Lopez Garcia, Paula Bedregal, and
  Miguel Cordero, “Parental Beliefs, Investments, and Child Development: Evidence from
  a Large-Scale Experiment,” IZA Discussion Papers 12506, Institute of Labor Economics
  (IZA) July 2019.

                on de Pobreza 2008-2018, Estados Unidos Mexicanos,” 2018.
CONEVAL, “Medici´

Cunha, Flavio, James J. Heckman, and Susanne M. Schennach, “Estimating the
  Technology of Cognitive and Noncognitive Skill Formation,” Econometrica, 2010, 78 (3),
  883–931.

Davis, Jonathan M.V., Jonathan Guryan, Kelly Hallberg, and Jens Ludwig, “The
  Economics of Scale-Up,” Working Paper 23925, National Bureau of Economic Research
  October 2017.

Doepke, Matthias and Fabrizio Zilibotti, “Parenting With Style: Altruism and Pater-
  nalism in Intergenerational Preference Transmission,” Econometrica, September 2017, 85,
  1331–1371.

Dubeck, Margaret M. and Amber Gove, “The Early Grade Reading Assessment
  (EGRA): Its Theoretical Foundation, Purpose, and Limitations,” International Journal
  of Educational Development, 2015, 40, 315–322.

Engzell, Per, Arun Frey, and Mark Verhagen, “Learning Inequality During the Covid-
  19 Pandemic,” 2020. Mimeo.



                                              42
Fryer, Roland G. Jr., Steven D Levitt, and John A. List, “Parental Incentives and
  Early Childhood Achievement: A Field Experiment in Chicago Heights,” Working Paper
  21477, National Bureau of Economic Research August 2015.

Gertler, Paul J., Harry Anthony Patrinos, and Marta Rubio-Codina, “Empowering
  Parents to Improve Education: Evidence from Rural Mexico,” Journal of Development
  Economics, 2012, 99 (1), 68–79.

Heckman, James, “Randomization and Social Policy Evaluation,” in “Evaluating Welfare
  and Training Programs. Edited by C. F. Manski and I. Garﬁnkel,” Harvard University
  Press, 1992.

  and Jin Zhou, “Interactions as Investments: The Microdynamics and Measurement of
  Early Childhood Learning,” Working Paper, Center for the Economics of Human Devel-
  opment, University of Chicago 2021.

Heckman, James J. and Stefano Mosso, “The Economics of Human Development and
  Social Mobility,” Annual Review of Economics, 2014, 6, 689–733.

List, John A., The Voltage Eﬀect: How to Make Good Ideas Great and Great Ideas Scale,
  Penguin Books, 2022.

  , Azeem M. Shaikh, and Yang Xu, “Multiple Hypothesis Testing in Experimental
  Economics,” Experimental Economics, December 2019, 22 (4), 773–793.

  , Fatemeh Momeni, and Yves Zenou, “The Social Side of Early Human Capital
  Formation: Using a Field Experiment to Estimate the Causal Impact of Neighborhoods,”
  Working Paper 28283, National Bureau of Economic Research December 2020.

Maldonado, Joana E. and Kristof De Witte, “The Eﬀect of School Closures on Stan-
  dardised Student Test Outcomes,” 2020. KU Leuven Discussion Paper Series 20.17.

Maniadis, Zacharias, Fabio Tufano, and John A. List, “One Swallow Doesn’t Make
  a Summer: New Evidence on Anchoring Eﬀects,” American Economic Review, January
  2014, 104 (1), 277–90.

    ınez, Edmundo Ram´
Mart´                ırez, “Supporting Education in Families and Schools,” Unpub-
  lished manuscript, Impact Evaluation Report (in spanish) November 2012.



                                          43
Miguel, Edward and Michael Kremer, “Worms: Identifying Impacts on Education and
  Health in the Presence of Treatment Externalities,” Econometrica, January 2004, 72 (1),
  159–217.

Mobarak, A. Mushﬁq and Austin C. Davis, “A Research Agenda Built for Scale,” in
  “The Scale-up Eﬀect in Early Childhood and Public Policy. Edited by John A. List, Dana
  Suskind, and Lauren H. Supplee,” Routledge, 2021.

Muralidharan, Karthik and Abhijeet Singh, “Improving Public Sector Management
  at Scale? Experimental Evidence on School Governance India,” Working Paper 28129,
  National Bureau of Economic Research November 2020.

   and Paul Niehaus, “Experimentation at Scale,” Journal of Economic Perspectives,
  2017, 31 (4), 103–124.

O’Brien, Peter C., “Procedures for Comparing Samples with Multiple Endpoints,” Bio-
  metrics, 1984, 40 (4), 1079–1087.

Platas, Linda M., Leanne R. Ketterlin-Geller, and Yasmin Sitabkhan, “Using an
  Assessment of Early Mathematical Knowledge and Skills to Inform Policy and Practice:
  Examples from the Early Grade Mathematics Assessment,” International Journal of Ed-
  ucation in Mathematics, Science and Technology, 2016, 4(3), 163–173.

Romano, Joseph P. and Michael Wolf, “Exact and Approximate Stepdown Methods
  for Multiple Hypothesis Testing,” Journal of the American Statistical Association, March
  2005, 100, 94–108.

   and    , “Stepwise Multiple Testing as Formalized Data Snooping,” Econometrica, July
  2005, 73 (4), 1237–1282.

   and    , “Eﬃcient Computation of Adjusted p-Values for Resampling-Based Stepdown
  Multiple Testing,” Statistics & Probability Letters, 2016, 113 (C), 38–40.

Westfall, Peter H. and S. Stanley Young, Resampling-Based Multiple Testing: Exam-
  ples and Methods for p-Value Adjustment 1993.

Young, Alwyn, “Channeling Fisher: Randomization Tests and the Statistical Insigniﬁcance
  of Seemingly Signiﬁcant Experimental Results,” The Quarterly Journal of Economics,
  2019, 134 (2), 557–598.


                                             44
Zhou, Jin, Alison Baulos, James J. Heckman, and Bei Liu, “The Economics of Child
  Development with an Application to Home Visiting at Scale,” in “The Scale-up Eﬀect in
  Early Childhood and Public Policy. Edited by John A. List, Dana Suskind, and Lauren
  H. Supplee,” Routledge, 2021.




                                          45
Appendices

A      Data Description

A.1     Administrative Data

School census. The Ministry of Education runs a school census (Formato 911 ) at the
beginning and at the end of each school cycle that covers all public schools in Mexico. The
census asks the school representative about the number of students enrolled in every grade
and whether they are new students or repeaters. Additional information includes the number
of instructors and the number of classrooms per school. Information from the 2013 Census
is used to construct the baseline school variables that are displayed in Table B-1 and in
Panel A of Table B-2. School census data for the years 2015–2020 are used to track the
school closures during the government implementation of both the API Original and Plus
modalities, as shown in Table 8 and Figure 5.

Locality-level Population census: The National Institute of Statistics and Geography
(INEGI) is in charge of compiling a population count with detailed information on socio-
demographics, poverty, and education, among other information every decade. Census data
are made available at the individual level for a small random sample of the population, as
well as at the locality-level for the universe of localities in Mexico. We use the locality-level
information collected in the census rounds of 2010 and 2020 for our analysis. In particular,
we use information from the 2010 population census in Table B-1, in Panel B of Table B-2. as
well as to construct the indicator variable for the presence of conﬂicts in the community that
is shown in Table 7. We leverage information on schooling outcomes in the 2020 census for
all the localities in the state of Chiapas (including those that were part of the experimental
sample), which is shown in Figure 4.

Standardized test scores. Between 2007 and 2013, all Mexican students in third grades
                                                                                   on
through ninth grade were required to take a standardized test, the ENLACE (Evaluaci´
Nacional de Logro Academico en Centros Escolares ). The test was administered by exter-
nal proctors at the end of each academic year, and it assessed student knowledge in three
areas: math, Spanish, and, starting in 2008, a third subject that rotated between science,
ethics/civics, history, or geography. We use the school-level average of the Spanish scores in
2012 to construct the strata for the school-level randomization of the second experiment. In


                                                I
the ﬁrst experiment, we use individual scores for sixth graders in each pedagogical area in
2013 as our main measures of academic achievement. The Overall Score displayed in Table
1 is computed using GLS-weighted score over the three scores (O’Brien, 1984). Last, we use
the 2013 ENLACE scores at the school level for the placebo tests displayed in Tables B-12
and B-13.

Transitions to Secondary Schools. We link the enrollment records of the sixth graders
in the sample of the second experiment across the population of seventh graders in Chia-
pas during the following academic year. Individual transitions computed in the school year
2016–2017 (i.e., by the end of the second experiment) are reported in Table 2, while tran-
sitions computed in the school year 2017–2018 (i.e., after the ﬁrst year of the government
implementation of the API Plus modality) are reported in Figure 3.

Other administrative records. All students in Chiapas schools, irrespective of whether
they received the API program, must undergo a diagnostic test at the beginning of each
school year. The test covers three subjects: math, Spanish, and natural science. The score
for each subject ranges between 5 and 10. We use the individual-level average across the three
subjects in the diagnostic tests at the beginning of the 2014–2015 school year to construct
the within-school student rankings displayed in Figure B-2 and Table B-10, which proxy for
the individual eligibility for the one-on-one remedial education sessions. We also use the
oﬃcial assessments assigned to the students based on those tests (level 1, level 2, and level
3) in Table B-3.

We use student-level longitudinal information for the population of primary schools to con-
struct various measures of school-level changes in student composition reported in Table
B-6: whether the student must repeat a grade in school year 2015–2016, attrition from the
school system in Chiapas between the school years 2014–2015 and 2015–2016, and whether
in 2015-2016 the student attends the same school as in 2014–2015.


A.2     Survey Data

Data collection took place in the spring of 2016 in the 224 schools and the surrounding
communities that form part of the second experiment. The household module of the survey
was collected for a random sample of ﬁve households within a ﬁve kilometer radius from
each school. The information is linked at the child-parent level with the student test scores
through unique student identiﬁers. It entails the following array of survey modules and


                                              II
measurement tools.

Measures of Children’s Achievement. The reading scores reported in Tables 2 and
B-10 are given by the latent factor of an exploratory factor analysis of the following eight
domains: 1) letter name, 2) initial name, 3) initial sound, 4) word recognition, 5) word
reading, 6) reading comprehension, 7) listening, 8) dictation. The math scores reported in
Tables 2 and B-10 are given by the latent factor of an exploratory analysis of the following
seven domains: 1) number identiﬁcation, 2) number discrimination, 3) missing number, 4)
addition, 5) subtraction, 6) problem solving, 7) shape recognition. An orthogonal rotation is
applied before standardizing each factor with respect to the mean and the standard deviation
in the control group. The individual components of the math and reading scores are reported
in Table B-7.

The household survey contains a set of measures of behavioral problems reported by the
caregivers of the children in our sample. The socio-emotional scores reported in Tables 2
and B-10 are the sum of the following thirty-two items on how often the child displays a given
emotion/behavior: 1) has serendipitous mood changes, 2) feels or complains that nobody
loves him/her, 3) is tense or nervous, 4) lies or cheats, 5) is scared or anxious, 6) talks and
argues too much, 7) has diﬃculty focusing on a speciﬁc activity for an extended amount of
time, 8) gets easily confused, 9) has his/her head is in the clouds, 10) threatens or is mean
with other children, 11) tends to challenge parental authority, 12) does not feel guilty after a
bad deed, 13) does not get along with other children, 14) is impulsive or acts “fast” without
thinking, 15) has inferiority issues, 16) has no friends, 17) has diﬃculty letting go of certain
thoughts, 18) is hyper active, 19) has a bad temper or is irascible, 20) easily loses his/her
temper, 21) feels unhappy, sad, or depressed, 22) is shy, does not socialize with others, 23)
breaks objects on purpose, 24) is too attached to adults, 25) cries too much, 26) demands a
lot of attention, 27) is too much dependent on others, 28) is afraid of other people’s judgment,
29) tends to be in bad company; 30) reserved, keeps things for himself/herself, 31) worries
about everything, 32) misbehaves at school and does not respect the instructor.

The Overall Score of students’ achievement displayed in Table 2 is computed using GLS-
weighted averages over the two cognitive measures and the socio-emotional score.

Parenting Practices. The household survey collects information on parents’ behavior and
investment in their children’s education. The same information was collected during the mid-
line survey of the ﬁrst experiment. The parental engagement outcomes reported in Table
3 are computed using GLS-weighted averages (Anderson, 2008) over diﬀerent indicators of


                                              III
parental behavior. For Engage at School : whether or not parents (i) volunteer at the school,
(ii) donate money to the school, (iii) donate in kind to the school, and (iv) oﬀer food to the
instructor. For Manage School Resources : whether or not parents (i) directly manage the
school budget, (ii) propose some materials to the school, (iii) decide to use some materials
for the school, and (iv) decide on how to allocate money for some school activities, and (v)
deﬁne the pedagogical targets of the school. For Engage with Child : whether (i) parents
help with their child’s homework, (ii) meet with the instructor, (iii) expect their child to
complete secondary education or more, and (iv) children participate in other academically-
related activities outside the school hours. The Engagement Index is the same GLS-weighted
average over each of the individual components described above, which are reported in Table
B-9.

Parent-Mentor Interactions. The household module collects several questions on both
the quantity and the quality of parents’ interactions with the mentors for those households
that were assigned to either the API Original group or the API Plus group. This information
is used to construct the four variables reported in Panel A of Table 4. Basic information on
both the household module respondent and household characteristics is reported in Panel C
of Table B-2.

Parenting Styles. The mentors’ questionnaire included a battery of questions on the
speciﬁc competencies they promote during their interactions with parents. The indicator
variables for each competency are used as outcomes variables in Panel B of Table 4. Since
the mentors were not located in the communities on a continuous basis, the survey ﬁrm
interviewed them by an end-of-year evaluation session. Some of their characteristics are
reported in Panel D of Table B-2, as well as in Table B-3 for the subset of the mentors who
reported working in diﬀerent schools from those they were initially assigned to.

Teaching Practices. We measure time use and diﬀerent learning activities of community
instructors as well as their ability to keep students engaged using an adapted version of
Stallings classroom snapshot, which is a rubric for timed observations that has been used
previously in Mexico (Bruns and Luque, 2015). An observer scores the instructor’s eﬀective
use of 15 diﬀerent activities over the course of a full one-hour lesson, with snapshots every
three minutes. Each activity was scored between 1 and 4. In every snapshot, the external
observer reports whether the instructor is present in the classroom. Given the nature of
the API intervention and the multi-grade context, the tool was adapted to capture the
instructor’s ability to use materials and keep the rhythm of the class.


                                             IV
The information included in this survey module is used to construct GLS-weighted averages
over the diﬀerent types of teacher behavior, which are displayed in Table B-11. Learning
Activities is the sum of the amount of time children spend on (i) reading aloud alone,
(ii) reading aloud in a group, (iii) questions and answers, (iv) memorizing, (vi) individual
homework, and (viii) verbal tasks. Engage with Students is the sum of the amount of time
the instructor spends on (i) elaborating on a given concept, (ii) students were not involved,
and (iii) keeping discipline. Manage Time is the amount of time the instructor spends (i)
out of the classroom, (ii) eﬀectively administering some tasks in the classroom, (iii) whether
or not the instructor complies with the start and end time of each classroom, (iv) whether
or not the instructor keeps the rhythm of the class as well as of the individual students
according to their age and their mother-tongue, and (v) whether or not the students were
grouped according to their respective academic levels. Use of Material is the sum of four
indicator variables: (i) whether the instructor uses any book to explain a given topic, (ii)
whether the instructor uses any material from the community to explain a given topic, (iii)
whether drawings and other students’ artworks are displayed in the classroom, and (iv)
whether charts and maps are displayed in the classroom. The Overall Index is the same
GLS-weighted average of the individual components of teacher behavior described above.

Local instructors were also asked standard questions on their socio-demographic characteris-
tics, education, experience and, if they were in the treatment group, their relationship with
the mentors. Those are reported in Panel B of Table B-2.


A.3     In-Depth Interviews

In the spring of 2022 we implemented a series of semi-structured phone interviews with a
small sample of local instructors and mentors who participated in the program. In total,
we were able to locate and contact 104 local instructors and 68 mentors. Of those, 12
instructors and 16 mentors agreed to complete the phone interview. More than half of the
survey respondents continued working as mentors after the 2016 government implementation
of the Plus modality. The characteristics of the survey respondents in comparison with the
overall sample are shown in Tables B-4 and B-5.

The survey contains a series of open questions related to the experiences of the mentors/local
instructors with the parents in the communities. Below, we report the original quotes in
Spanish that we refer to in the main body of the paper (authors’ translation from Spanish).
In particular, these quotes from the mentors about the peer-to-peer sessions of the training

                                              V
are reported in Section 3.3:

                                           on en donde me dijeron que deb´
           “Fue un momento de la capacitaci´                             ıa adap-
           tarme al contexto de su centro del trabajo, de comprender las necesidades y
                                             ıan en la misma comunidad, para poder
           de entender situaciones que se viv´
                                                     nos sin afectar o modiﬁcar lo que
           dialogar con los padres y atender a los ni˜
           ellos conciben como su medio.”

           “Recomendaban hacer las visitas domiciliarias con frecuencia y ayudarle en
                         as o sal´
           algo a los pap´                                           as conﬁanza.”
                                 ıan con ellos a visitas y les daba m´

           “[Las sesiones de orientacion me permitieron] escuchar las diferentes es-
                                  ıan para poder probarlas e implementarlas.”
           trategias que ellos ten´


These quotes from the local instructors about the role of parents in the day-by-day routine
of the school are reported Section 5.2.

                    on dela escuela y se le hicieron mejoras de cercado, pintaron la
           “La gesti´
                                    nos y se compraron materiales.”
           escuela arreglaron los ba˜

           “Eran participativos, estaban pendientes del bienestar de la escuela por ejem-
                            on, de materiales e incluso de los desayunos y alimentaci´
           plo la construcci´                                                        on
           del instructor.”

           “Los padres apoyaban en el seguimiento al bloc de tareas y trabajaban en
                                           ıan estar presentes por apoyar a otra
           equipo cuando los API que no pod´
                                ıan al corriente o, incluso un poco m´
           comunidad, los manten´                                    as avanzados,
                                                    ıan dar continuidad a sus clases
           por lo que cuando los APIs regresaban pod´
                   un atraso.”
           sin ning´




                                              VI
B      Additional Figures and Tables

Figure B-1: Treatment Eﬀects on Secondary School Enrollment During the Transition Be-
tween the Second Experiment and the Government Implementation


                       Treat Effect on Secondary School Enrollment
                         0             .1         .2           .3




                                                                     API Original                        API Plus
                                                                       Treatment Assignment −− Second Experiment

                                                                       Point Estimate          90% CI               95% CI


 Notes : The bars depicted in this ﬁgure show the OLS estimates of the original treatment assignments in
 our experiment on the probability of enrolling in seventh grade in the year after the end of the second
 experiment (2017). The vertical lines overlaid on the bars represent asymptotic conﬁdence intervals at the
 90 percent and the 95 percent conﬁdence levels. Conﬁdence intervals are based on asymptotic inference.
 The sample includes 207 schools of the 224 that were part of the experiment. Beyond a school that
 permanently closed, the sample attrition is caused by schools not having sixth graders during that school
 year. This fact is consistent with the multi-grade nature of the CONAFE system. Attrition is balanced
 among schools that were part of the two treatment arms (p-values = 0.914, and 0.768)




                                                                                        VII
 Figure B-2: Probability of Being in Remedial Sessions by Inverted Achievement Rank

                        Rank=2

                        Rank=3

                        Rank=4

                        Rank=5

                        Rank=6

                        Rank=7

                        Rank=8

                        Rank=9

                       Rank=10

                       Rank=11

                       Rank=12

                       Rank=13
                                  −.8        −.6          −.4      −.2         0


Notes : The dots in this ﬁgure are estimated marginal eﬀects from Probit regression models of indicator
variables for the inverted within-school student rank based on the average score on the diagnostic tests
in math, Spanish, and natural science on the probability of participating in the one-on-one remedial
education sessions with the mentors. The indicator variable for whether the student is ranked ﬁrst (i.e.,
the worst-performing student in the class) is the omitted category. The horizontal lines around each dot
represent 90 percent conﬁdence intervals. Conﬁdence intervals are based on asymptotic inference.




                                                   VIII
Table B-1: Baseline Characteristics and Covariate Balance – First Experiment
                                                       Panel A: Original Sample of Schools
                                                  Treatment (40)           Control (40)     Diﬀ
                                                Mean Std. Dev. Mean Std. Dev. P-Value
     Number of households                       43.050     117.302     48.100     89.883   0.832
     Total Population                          205.250 575.294 227.275 480.694             0.855
     Share Economically Active Pop               0.287       0.066      0.287      0.077   0.970
     Water connection (Y/N)                     0.025       0.158      0.050       0.221   0.567
     Sewer system (Y/N)                         0.025       0.158       0.025      0.158   1.000
     Share of analphabet population             0.320       0.180       0.319      0.171   0.984
     Share of dwellings with dirt ﬂoor          0.334       0.356      0.334       0.254   0.999
     Garbage collection (Y/N)                   0.025       0.158       0.050      0.221   0.564
     ENLACE Spanish 2010                       399.476      38.545    398.560     27.853   0.901
     ENLACE Math 2010                          375.065      42.763    386.407     50.357   0.279
     Enrollment                                 16.231      9.192     15.550       8.246   0.712
     Number of Teachers                         1.385       0.544      1.450       0.597   0.606
     Share over-aged students                   3.135       9.832      1.884       3.941   0.464
                                                     Panel B: Schools in Mid-Line 2012 Survey
                                                  Treatment (37)           Control (36)     Diﬀ
                                                Mean Std. Dev. Mean Std. Dev. P-Value
     Number of households                       45.297     121.712     49.667     94.636   0.863
     Total Population                          217.054 597.061 234.778 506.694             0.888
     Share Economically Active Pop               0.286       0.064      0.276      0.069   0.553
     Water connection (Y/N)                     0.027       0.164      0.056       0.232   0.547
     Sewer system (Y/N)                         0.027       0.164       0.028      0.167   0.990
     Share of analphabet population             0.321       0.170       0.333      0.173   0.745
     Share of dwellings with dirt ﬂoor          0.307       0.330      0.349       0.261   0.544
     Garbage collection (Y/N)                   0.027       0.164       0.056      0.232   0.551
     ENLACE Spanish 2010                       401.971      38.973    399.036     28.974   0.703
     ENLACE Math 2010                          377.916      43.159    388.422     51.038   0.351
     Enrollment                                 15.917      8.334     14.917       7.987   0.597
     Number of Teachers                         1.389       0.549      1.417       0.604   0.827
     Share over-aged students                   2.134       7.225      1.961       4.094   0.900
                                                    Panel C: Schools with Test Score 2013 Data
                                                  Treatment (35)           Control (35)     Diﬀ
                                                Mean Std. Dev. Mean Std. Dev. P-Value
     Number of households                       46.971     124.974     48.686     95.832   0.950
     Total Population                          225.857 612.996 227.543 512.201             0.990
     Share Economically Active Pop               0.287       0.065      0.278      0.069   0.579
     Water connection (Y/N)                     0.029       0.169      0.057       0.236   0.568
     Sewer system (Y/N)                         0.029       0.169       0.029      0.169   1.000
     Share of analphabet population             0.327       0.165       0.335      0.175   0.823
     Share of dwellings with dirt ﬂoor          0.321       0.334      0.345       0.264   0.733
     Garbage collection (Y/N)                   0.029       0.169       0.057      0.236   0.566
     ENLACE Spanish 2010                       401.869      40.034    399.206     29.378   0.748
     ENLACE Math 2010                          377.168      44.284    390.561     50.120   0.242
     Enrollment                                 15.971      8.449     14.743       8.034   0.527
     Number of Teachers                         1.400       0.553      1.400       0.604   1.000
     Share over-aged students                   2.195       7.321      2.017       4.140   0.900
     Notes : This table shows means and standard deviations for community and school characteristics collected
    in the population census (2010) and the school census (2010). See Appendix A.1 for more details on these
    data sources. The ﬁfth column reports the associated p-values of the diﬀerences in means between the
    treatment and the control group.




                                                       IX
Table B-2: Baseline Characteristics and Covariate Balance – Second Experiment
      Sample                               Control     API Original API Plus                All Evaluation
      (Number of Schools)                   (100)           (60)         (70)                    (230)
      Statistic                             Mean           Mean         Mean        Original-Control Plus-Control
                                            (SD)           (SD)          (SD)             (SE)            (SE)
                                             Panel A: School Characteristics
      Average Test score Spanish           431.88         431.65        431.36            -0.223           -0.516
                                           (64.43)        (66.87)      (66.60)           (2.581)          (2.783)
      Average Test score Math              455.75         454.85        451.68            -0.902           -4.076
                                           (80.71)        (83.50)      (81.06)           (5.790)          (6.911)
      Average Test score Science           440.15         441.24        441.27             1.095            1.120
                                           (52.52)        (48.66)      (50.89)           (4.273)          (4.784)
      Community Instructors                 1.220          1.300         1.200             0.080           -0.020
                                           (0.416)        (0.462)      (0.403)           (0.066)          (0.067)
      Number of Enrolled Students          15.160         15.314        14.233             0.154           -0.927
                                           (5.839)        (5.714)      (5.782)           (0.901)          (0.946)
                                    Panel B: Community     Instructors Characteristics
      Lower than upper second.             0.067            0.062         0.066           -0.002           0.009
                                          (0.251)          (0.242)       (0.250)         (0.035)          (0.033)
      Lower than higher ed.                0.918            0.901         0.908           -0.000           0.002
                                          (0.276)          (0.300)       (0.291)         (0.044)          (0.040)
      Training weeks at baseline           4.515            4.704         4.500            0.128           -0.042
                                          (1.322)          (1.259)       (1.426)         (0.196)          (0.253)
      3rd and 4th grade students           3.655            3.986         3.716            0.346            0.137
                                          (2.434)          (2.286)       (2.230)         (0.349)          (0.356)
      5th and 6th grade students           3.517            3.838         3.507            0.325            0.054
                                          (2.408)          (2.507)       (2.298)         (0.354)          (0.352)
                                           Panel C: Household Characteristics
      Indigenous Language                   0.326         0.366        0.476               0.049           0.142
                                           (0.469)       (0.483)      (0.500)            (0.065)          (0.077)
      Read                                  0.715         0.686        0.734              -0.031            0.022
                                           (0.452)       (0.465)      (0.443)            (0.041)          (0.042)
      Less than Primary                     0.615         0.587        0.584              -0.028           -0.030
                                           (0.487)       (0.493)      (0.494)            (0.043)          (0.041)
      Upper Sec. or Higher                  0.015         0.016        0.019              -0.001            0.003
                                           (0.123)       (0.124)      (0.135)            (0.009)          (0.009)
      Oportunidades                         0.813         0.807        0.829              -0.003            0.015
                                           (0.391)       (0.395)      (0.377)            (0.033)          (0.031)
      Refrigerator                          0.397         0.387        0.373              -0.010           -0.019
                                           (0.490)       (0.488)      (0.485)            (0.047)          (0.055)
      Television                            0.692         0.738        0.651               0.048           -0.040
                                           (0.462)       (0.440)      (0.478)            (0.047)          (0.051)
      Car                                   0.084         0.081        0.063              -0.003           -0.019
                                           (0.277)       (0.273)      (0.244)            (0.027)          (0.024)
      Sewage                                0.254         0.253        0.320              -0.003            0.068
                                           (0.436)       (0.435)      (0.467)            (0.042)          (0.052)
      Phone                                 0.220         0.233        0.204               0.014           -0.014
                                           (0.414)       (0.423)      (0.404)            (0.037)          (0.038)
      Light                                 0.863         0.916        0.873               0.054           0.006
                                           (0.344)       (0.278)      (0.333)            (0.040)          (0.040)
                                          Panel D: Mentors’ Characteristics
                                        API Original  API Plus     Diﬀerence
      Variable                             Mean         Mean        Plus-Std
                                           (SD)         (SD)           (SE)
      Age                                 28.491       28.543         -0.135
                                          (3.760)      (3.075)       (0.650)
      Male                                 0.566        0.587         -0.064
                                          (0.500)      (0.498)       (0.097)
      High Edu Complete                    0.887        0.891         0.014
                                          (0.320)      (0.315)       (0.066)
      Previously Instructor                0.792        0.848         -0.079
                                          (0.409)      (0.363)       (0.072)
      Previously Education Assistant       0.075        0.065         0.014
                                          (0.267)      (0.250)       (0.049)
      Notes : The ﬁrst three columns of the table report mean and standard deviations in parentheses for various
     characteristics collected before the assignment of the API program in the evaluation sample. The school variables
     in Panel A are computed from the 2013 national standardized tests and from the 2013 school census. The other
     characteristics reported in Panels B-D are collected in the survey data. The diﬀerences reported in the last
     two columns of the table are based on OLS estimates of the regression models that control for stratiﬁcation
     dummies. Standard errors of the mean diﬀerences for the student characteristics are reported in parentheses
     in the last two columns and they are clustered at school level. See Appendix A for more details on the data
     sources.
                                                            X
                        Table B-3: Characteristics of Dropout Mentors

                                                Original                  Plus   Plus - Original
 Former CONAFE facilitator                       0.689                    0.703       0.012
                                                (0.468)                  (0.463)    (0.102)
 At least 5 days of training                      0.467                   0.514       0.061
                                                (0.505)                  (0.507)    (0.111)
 Sleeps in community (y/n)                       0.711                    0.757       0.052
                                                (0.458)                  (0.435)    (0.097)
 Number of nights in community last week         3.022                    2.757      -0.301
                                                (2.061)                  (1.978)    (0.442)
 Number of students with personalized attention   6.049                   5.767      -0.284
                                                (0.835)                  (1.104)    (0.264)
 Days spent in community during last month       10.220                  10.200       0.063
                                                (4.613)                  (4.715)    (1.148)
 Number of students below Level 2                 3.450                   3.560       0.079
                                                (1.679)                  (1.660)    (0.440)
 Number of students below Level 3                 2.727                   2.731      -0.020
                                                (1.773)                  (1.845)    (0.488)
 Notes : This table reports means and standard deviations for the characteristics of the mentors who
dropped out from the schools where they were originally assigned across API Original and API Plus
modalities. The diﬀerences reported in the last column of the table are based on OLS estimates of the
regression models that control for stratiﬁcation dummies. Standard errors of the mean diﬀerences for
the student characteristics are reported in parentheses in the last column and they are clustered at the
school level. For detailed descriptions of the survey variables used in this table, see Appendix A.2.




                                                  XI
             Table B-4: Characteristics of Mentors—Sample vs Phone Survey

                                          Original Sample 2022 Survey Diﬀerence
 Age                                           28.443        27.556     0.888
                                               (3.260)      (3.941)    (1.150)
 Male                                           0.585        0.778      -0.193
                                               (0.495)      (0.441)    (0.171)
 High School Completed                          0.868        1.000      -0.132
                                               (0.340)      (0.000)    (0.114)
 Training Weeks                                 2.858         2.667     0.192
                                               (2.035)      (1.871)    (0.703)
 Experience as Api                             21.274        13.444     7.829
                                              (10.058)      (6.803)    (3.425)
 Previously Local Instructor                    0.840         0.778      0.062
                                               (0.369)      (0.441)    (0.130)
 Previously Education Assistant                 0.085         0.000     0.085
                                               (0.280)      (0.000)    (0.094)
 Days Spent in the Community                   13.528        13.556     -0.027
                                               (5.331)      (4.876)    (1.840)
 Students Lagging Behind                        5.698         5.889     -0.191
                                               (1.657)      (3.018)    (0.621)
 Notes : This table reports means and standard deviations for the characteristics of the mentors
in the main sample of the analysis and those of the mentors who participated in the in-depth
phone interviews (2022). The diﬀerences reported in the last column of the table are based
on OLS estimates of the regression models that control for stratiﬁcation dummies. Standard
errors of the mean diﬀerences for the student characteristics are reported in parentheses in
the last column and they are clustered at school level. For detailed descriptions of the survey
variables used in this table, see Appendix A.2.




                                              XII
        Table B-5: Characteristics of Local Instructors—Sample vs. Phone Survey

                                          Original Sample        2022 Survey Diﬀerence
 Age                                           21.284               21.157     0.127
                                              (2.585)              (2.034)    (0.702)
 Male                                           0.560                0.786     -0.226
                                              (0.497)              (0.426)    (0.135)
 Lower than Upper Second                        0.062                0.071     -0.010
                                              (0.241)              (0.267)    (0.066)
 Upper Second Complete                          0.800                0.643     0.157
                                              (0.401)              (0.497)    (0.111)
 Above Upper Second                             0.138               0.286      -0.148
                                              (0.346)              (0.469)    (0.097)
 Experience in Months                          13.545               13.429      0.117
                                              (9.408)              (9.362)    (2.577)
 Training Weeks at Baseline                     4.768               5.500      -0.732
                                              (4.114)              (5.019)    (1.140)
 Time spent in the School                       9.509                9.071     0.438
                                              (4.220)              (3.269)    (1.146)
 Sleeps in the Community                        0.651                0.857     -0.206
                                              (0.478)              (0.363)    (0.130)
 Nights spent in the Community                  3.204                3.071      0.132
                                              (2.065)              (2.093)    (0.566)
 Notes : This table reports means and standard deviations for the characteristics of the mentors
in the main sample of the analysis and those of the mentors who participated in the in-depth
phone interviews (2022). The diﬀerences reported in the last column of the table are based
on OLS estimates of the regression models that control for stratiﬁcation dummies. Standard
errors of the mean diﬀerences for the student characteristics are reported in parentheses in the
last column and they are clustered at the school level. For detailed descriptions of the survey
variables used in this table, see Appendix A.2.




                                             XIII
           Table B-6: Treatment Assignment and School-Level Student Composition

                                       Repeat     Attrition     Outside CONAFE in t − 1           Same school in t − 1
 API Original                          -0.011      -0.018                -0.002                          0.019
                                       [0.116]     [0.322]               [0.895]                        [0.295]
 API Plus                              -0.010      -0.006                -0.003                          0.011
                                       [0.153]     [0.751]               [0.861]                        [0.574]

 H0: API Original = API Plus            [0.834]     [0.491]                 [0.911]                       [0.620]

 Observations                            1019        1019                    1019                          1019
 Number of Clusters                      224         224                     224                            224
 Notes : This table shows the estimates of the two API modalities on various measures of school-level changes in student
composition. The number of observations drops from 1045 to 1019 due to incomplete school identiﬁers (CURP) for 26
students. All p-values account for clustering at the school level. Asymptotic p-values reported in brackets are clustered at
school level. For a detailed descriptions of the survey variables used in this table, see Appendix A.1.




                                                            XIV
Table B-7: Average Program Impacts by Subdomains of the Reading and the Math Scores
                                                   Panel A: Share of Correct Reading Answers by Subdomain
                                   Letter       Initial   Initial    Word      Word       Read      Listening                  Dictation
                                   Name         Name      Sound Recogn. Reading Comprehen.

 API Original                      0.103        0.006       0.122       0.129        0.075          0.118          -0.004        0.129
                                  [0.232]      [0.941]     [0.156]     [0.091]      [0.300]        [0.107]        [0.963]       [0.120]
                                  {0.285}      {0.949}     {0.194}     {0.124}      {0.341}        {0.138}        {0.968}       {0.173}
                                  (0.449)      (0.996)     (0.365)     (0.255)      (0.510)        (0.290)        (0.996)       (0.314)

 API Plus                          0.240        -0.019      0.042       0.318        0.197          0.321          0.123         0.378
                                  [0.005]      [0.816]     [0.565]     [0.000]      [0.014]        [0.000]        [0.145]       [0.000]
                                  {0.010}      {0.824}     {0.584}     {0.000}      {0.026}        {0.001}        {0.185}       {0.000}
                                  (0.005)      (0.789)     (0.728)     (0.000)      (0.021)        (0.000)        (0.226)       (0.000)

 API Original = API Plus          [0.180]      [0.771]     [0.343]     [0.039]      [0.183]        [0.023]        [0.094]       [0.005]
                                  {0.174}      {0.799}     {0.479}     {0.062}      {0.229}        {0.059}        {0.220}       {0.003}
                                  (0.328)      (0.727)     (0.421)     (0.077)      (0.328)        (0.045)        (0.194)       (0.010)

 Observations                       1044    1044       1044      1044       1044      1044        1044                            1044
 Clusters                           224      224       224        224        224       224        224                             224
                                                Panel B: Share of Correct Math Answers by Sub-Domain
                                  Number Number Missing           Add     Subtract  Problem      Shape
                                  Identif. Discrim. Number                           Solving    Recogn.

 API Original                      0.094        0.036       0.099       0.011        0.061          -0.051         0.022
                                  [0.252]      [0.661]     [0.192]     [0.874]      [0.402]        [0.481]        [0.789]
                                  {0.301}      {0.681}     {0.226}     {0.882}      {0.447}        {0.511}        {0.800}
                                  (0.576)      (0.919)     (0.483)     (0.923)      (0.789)        (0.817)        (0.923)

 API Plus                          0.259        0.201       0.204       0.215        0.111          0.116          0.099
                                  [0.005]      [0.026]     [0.022]     [0.003]      [0.103]        [0.156]        [0.316]
                                  {0.011}      {0.036}     {0.035}     {0.008}      {0.130}        {0.200}        {0.365}
                                  (0.007)      (0.033)     (0.033)     (0.007)      (0.137)        (0.163)        (0.247)

 API Original = API Plus          [0.095]      [0.103]     [0.218]     [0.008]      [0.500]        [0.046]        [0.396]
                                  {0.163}      {0.129}     {0.420}     {0.020}      {0.514}        {0.080}        {0.550}
                                  (0.191)      (0.191)     (0.361)     (0.008)      (0.516)        (0.090)        (0.516)

 Observations                       1044        1044         1044        1044         1044           1044           1044
 Clusters                           224          224         224         224           224            224           224

 Notes : This table shows OLS estimates and the associated p-values of the two API modalities: API Original and API Plus for 1,044
students enrolled in third to sixth grade by the end of the second school year since treatment assignment. For detailed descriptions of the
sub-components of the reading and math scores used in this table, see Appendix A.2. The outcome variables are standardized with respect
to their means and the standard deviations in the control group. The inference procedures take into account clustering of the error terms at
the school level and the block randomization design at the strata level. p-values reported in brackets refer to the conventional asymptotic
inference. p-values reported in braces are computed using randomization inference (randomization-t). All p-values account for clustering at
the school level. p-values reported in parentheses are adjusted for testing each null hypothesis (null impact of API Original, API Plus, and
the comparison) on multiple outcomes through the step-wise procedure described in Romano and Wolf (2005a,b, 2016).




                                                                    XV
                              Table B-8: Average Program Impacts by the Individual Components of the Socio-Emotional Score

                                                                                                              Panel A: First 16 Components
                                           (1)         (2)         (3)         (4)         (5)         (6)       (7)        (8)     (9)               (10)        (11)        (12)        (13)        (14)        (15)        (16)

       API Original                       0.040       -0.068  0.074   0.003              -0.008      0.026       0.072       -0.009      0.006       0.015       0.017       0.042       -0.013      -0.024      0.030       -0.020
                                         [0.293]     [0.041] [0.049] [0.943]            [0.835]     [0.477]     [0.047]     [0.818]     [0.863]     [0.679]     [0.646]     [0.205]     [0.737]     [0.410]     [0.348]     [0.563]
                                         {0.340}     {0.052} {0.065} {0.945}            {0.849}     {0.507}     {0.062}     {0.826}     {0.868}     {0.700}     {0.654}     {0.246}     {0.748}     {0.447}     {0.386}     {0.588}
                                         (0.989)     (0.370) (0.409) (1.000)            (1.000)     (0.999)     (0.393)     (1.000)     (1.000)     (1.000)     (1.000)     (0.934)     (1.000)     (0.997)     (0.994)     (0.999)

       API Plus                           0.125       0.058       0.057      -0.012      -0.014      0.038       0.096       -0.023  0.021           -0.007      0.055       0.056       0.047       0.061       0.040       0.003
                                         [0.001]     [0.136]     [0.158]    [0.773]     [0.720]     [0.317]     [0.019]     [0.584] [0.510]         [0.870]     [0.150]     [0.113]     [0.205]     [0.057]     [0.216]     [0.937]
                                         {0.002}     {0.168}     {0.204}    {0.798}     {0.748}     {0.352}     {0.035}     {0.607} {0.533}         {0.889}     {0.173}     {0.149}     {0.249}     {0.078}     {0.251}     {0.939}
                                         (0.010)     (0.775)     (0.813)    (0.999)     (0.999)     (0.972)     (0.157)     (0.997) (0.995)         (0.999)     (0.809)     (0.710)     (0.901)     (0.421)     (0.908)     (0.999)

       API Original = API Plus           [0.044]     [0.002]     [0.690]    [0.721]     [0.863] [0.777] [0.560] [0.739]                 [0.696]     [0.595]     [0.380]     [0.706]     [0.141]     [0.014]     [0.759]     [0.532]
                                         {0.073}     {0.003}     {0.641}    {0.758}     {0.894} {0.812} {0.772} {0.795}                 {0.680}     {0.637}     {0.413}     {0.796}     {0.174}     {0.024}     {0.789}     {0.580}
                                         (0.367)     (0.013)     (1.000)    (1.000)     (1.000) (1.000) (1.000) (1.000)                 (1.000)     (1.000)     (0.998)     (1.000)     (0.843)     (0.119)     (1.000)     (0.999)

       Observations                       1045        1045        1045        1045        1045        1045    1045      1045     1045    1045                     1045        1045        1045        1045        1045        1045
       Clusters                           224          224        224         224         224          224     224       224     224      224                     224          224        224         224         224          224
                                                                                                           Panel B: Second 16 Components
XVI




                                           (17)        (18)        (19)        (20)        (21)       (22)    (23)      (24)     (25)    (26)                     (27)        (28)        (29)        (30)        (31)        (32)

       API Original                       -0.005      -0.050  0.015   -0.030             0.044       -0.034      0.085       -0.026      0.040       0.026       0.060       0.010       0.075       0.002       0.024       0.033
                                         [0.882]     [0.138] [0.677] [0.405]            [0.178]     [0.116]     [0.020]     [0.450]     [0.328]     [0.519]     [0.054]     [0.720]     [0.044]     [0.956]     [0.553]     [0.301]
                                         {0.894}     {0.159} {0.707} {0.448}            {0.192}     {0.143}     {0.038}     {0.491}     {0.370}     {0.564}     {0.076}     {0.730}     {0.067}     {0.967}     {0.564}     {0.345}
                                         (1.000)     (0.823) (1.000) (0.997)            (0.905)     (0.757)     (0.189)     (0.998)     (0.991)     (0.999)     (0.436)     (1.000)     (0.381)     (1.000)     (0.999)     (0.989)

       API Plus                           0.073       -0.009      0.091      0.021       0.040       -0.013      0.077       0.071   0.045           0.037       0.100       0.053       0.020       0.036       0.037       0.007
                                         [0.018]     [0.807]     [0.014]    [0.559]     [0.214]     [0.547]     [0.031]     [0.048] [0.305]         [0.336]     [0.005]     [0.049]     [0.613]     [0.344]     [0.327]     [0.838]
                                         {0.028}     {0.817}     {0.028}    {0.586}     {0.245}     {0.608}     {0.045}     {0.065} {0.353}         {0.379}     {0.009}     {0.071}     {0.647}     {0.366}     {0.383}     {0.846}
                                         (0.154)     (0.999)     (0.117)    (0.997)     (0.908)     (0.997)     (0.258)     (0.371) (0.972)         (0.972)     (0.037)     (0.379)     (0.997)     (0.972)     (0.972)     (0.999)

       API Original = API Plus           [0.018]     [0.246]     [0.055]    [0.191]     [0.923] [0.350] [0.848] [0.012]                 [0.925]     [0.796]     [0.301]     [0.193]     [0.203]     [0.422]     [0.735]     [0.494]
                                         {0.037}     {0.298}     {0.092}    {0.233}     {0.933} {0.408} {0.896} {0.027}                 {0.960}     {0.775}     {0.444}     {0.175}     {0.210}     {0.463}     {0.742}     {0.493}
                                         (0.146)     (0.966)     (0.432)    (0.935)     (1.000) (0.996) (1.000) (0.102)                 (1.000)     (1.000)     (0.989)     (0.935)     (0.937)     (0.998)     (1.000)     (0.999)

       Observations                       1045        1045        1045        1045        1045        1045        1045        1045        1045        1045        1045        1045        1045        1045        1045        1044
       Clusters                           224          224        224         224         224          224         224         224        224          224        224          224        224         224         224          224
       Notes : This table shows OLS estimates and the associated p-values of the two API modalities: API Original and API Plus for 1,044 students enrolled in third to sixth grade by the end of the second school year since
      treatment assignment. The individual components of the socio-emotional score are indicator variables for whether the child displays one of the following emotions/behaviors: 1) has serendipitous mood changes, 2) feels
      or complains that nobody loves him/her, 3) is tense or nervous, 4) lies or cheats, 5) is scared or anxious, 6) talks and argues too much, 7) has diﬃculty in focusing on a speciﬁc activity for an extended amount of time,
      8) gets easily confused, 9) it seems that his/her head is in the clouds, 10) threatens or is mean with other children, 11) tends to challenge parental authority, 12) does not feel guilty after a bad deed, 13) does not get
      along with other children, 14) is impulsive or acts “fast” without thinking, 15) feels has inferiority issues, 16) has no friends, 17) has diﬃculty letting go certain thoughts, 18) is hyper-active, 19) has a bad temper, or is
      irascible, 20) looses easily his/her temper, 21) feels unhappy, sad, or depressed, 22) is shy, does not socialize with others, 23) breaks objects on purpose, 24) is too attached to the adults, 25) cries too much, 26) demands
      a lot of attention, 27) is too much dependent on others, 28) is afraid of other people’s judgement, 29) Tends to be in bad company; 30) is reserved, keeps things for himself/herself, 31) worries about every thing, 32)
      misbehaves at school and does not respect the instructor (see Appendix A.2). The inference procedures take into account clustering of the error terms at the school level and the block randomization design at the strata
      level. p-values reported in brackets refer to the conventional asymptotic inference. p-values reported in braces are computed using randomization inference (randomization-t). All p-values account for clustering at the
      school level. p-values reported in parentheses are adjusted for testing each null hypothesis (null impact of API Original, API Plus, and the comparison) on multiple outcomes through the stepwise procedure described in
      Romano and Wolf (2005a,b, 2016).
                                          Table B-9: Average Program Impacts by the Individual Components of Parental Investments
                                                    Engage with School                                                                 Manage School Resources                                                                              Engage with Child
                                  Volunteering       Donate     Donate              Food           Manage School          Propose School     Decide School  Decide Money                    Evaluate School         Help With         Extra-Academic     Meeting             Expect Upper
                                                      Cash      In-Kind           Instructor        Resources                Material           Material       Allocation                       Targets             Homework             Activities      Teachers             Secondary
                                                                                                                                 Panel A: First Experiment
       API Original                   0.042            0.118         0.063           0.046              -0.042                 0.026              -0.009         0.002                            -0.040               0.210                0.055              0.203              0.025
                                     [0.417]          [0.126]       [0.478]         [0.560]            [0.579]                [0.726]            [0.912]        [0.974]                          [0.487]              [0.358]              [0.528]            [0.291]            [0.608]
                                     {0.435}          {0.147}       {0.494}         {0.566}            {0.597}               {0.734}            {0.916}         {0.971}                          {0.512}              {0.382}              {0.524}            {0.322}            {0.626}
                                     (0.955)          (0.475)       (0.969)         (0.969)            (0.969)               (0.969)             (0.983)        (0.983)                          (0.969)              (0.928)              (0.969)            (0.872)            (0.969)

       Number of clusters              73               73            73               73                  73                    73                 73                        73                    73                   73                   73                  73               73
       Observations                    208              208           207             208                 208                   208                208                       208                    208                 208                  207                 208               199
                                                                                                                                 Panel B: Second Experiment
       API Original                   -0.031           -0.004        -0.058          -0.058             -0.029                 -0.070             -0.062                   -0.010                 -0.027               0.222                0.074              0.043              0.010
                                     [0.356]          [0.894]       [0.130]         [0.042]            [0.471]                [0.095]            [0.122]                  [0.772]                [0.389]              [0.027]              [0.082]            [0.568]            [0.781]
                                     {0.884}          {0.981}       {0.452}         {0.194}            {0.917}                {0.369}           {0.452}                   {0.981}                {0.888}              {0.137}              {0.350}            {0.942}            {0.981}
                                     (0.377)          (0.902)       (0.155)         (0.057)            (0.488)                (0.123)            (0.153)                  (0.783)                (0.422)              (0.048)              (0.117)            (0.598)            (0.791)

       API Plus                       0.036            0.018         0.044           0.071              0.069                  0.001                  0.006                0.010                  0.018                0.221                0.108              0.192              0.094
                                     [0.289]          [0.625]       [0.329]         [0.013]            [0.095]                [0.978]                [0.890]              [0.776]                [0.570]              [0.066]              [0.015]            [0.020]            [0.019]
                                     {0.765}          {0.953}       {0.778}         {0.062}            {0.323}                {0.977}                {0.977}              {0.977}                {0.953}              {0.245}              {0.063}            {0.072}            {0.072}
                                     (0.341)          (0.666)       (0.364)         (0.024)            (0.128)                (0.977)                (0.901)              (0.791)                (0.598)              (0.105)              (0.025)            (0.037)            (0.034)


       Clusters                        224               224          224            224                  224                   224                    224                   223                    224                 224                  224                 223                224
       Observations                    1042             1042          1039           1042                1033                   1036                  1027                  1031                   1029                 1044                1033                 974               1017
XVII




        Notes : This table shows OLS estimates and the associated p-values of the two API modalities: API Original and API Plus for 1,044 students enrolled in third to sixth grade by the end of the second school year since treatment assignment. For a detailed descriptions of the
       sub-components of the reading and math scores used in this table, see Appendix A.2. The outcome variables are standardized with respect to their means and the standard deviations in the control group. The inference procedures take into account clustering of the error terms
       at the school level and the block randomization design at the strata level. p-values reported in brackets refer to the conventional asymptotic inference. p-values reported in braces are computed using randomization inference (randomization-t). All p-values account for clustering
       at the school level. p-values reported in parentheses are adjusted for testing each null hypothesis (null impact of API Original, API Plus, and the comparison) on multiple outcomes through the stepwise procedure described in Romano and Wolf (2005a,b, 2016).
                  Table B-10: Remedial Education Sessions and Student Test Scores
                                                            Reading Score       Math Score       Socio-Emotional Score         Overall Index
 API Original× Rank≥7                                           0.193             0.023                  0.147                     0.192
                                                               [0.105]           [0.844]                [0.313]                   [0.177]

 API Plus× Rank≥7                                                 0.423             0.274                  0.206                    0.430
                                                                 [0.001]           [0.055]                [0.140]                  [0.003]

 API Original× Rank<7                                             0.078             0.045                  0.034                    0.074
                                                                 [0.431]           [0.641]                [0.728]                  [0.487]

 API Plus× Rank<7                                                 0.261             0.224                  0.183                    0.327
                                                                 [0.011]           [0.042]                [0.082]                  [0.003]

 H0: Standard=Plus (<7)                                          [0.104]           [0.095]                [0.192]                  [0.039]
 H0: Original=Plus (≥7)                                          [0.072]           [0.081]                [0.721]                  [0.144]
 H0: [Original-Plus (<7)]=[Original-Plus (≥7)]                   [0.766]           [0.675]                [0.639]                  [0.937]

 Observations                                                     1044              1044                   1045                     1045
 Clusters                                                         224                224                    224                      224
 Notes : This table shows the estimates for the API program once we interact the treatment assignment dummies with indicators of whether
a child is among the six lowest-performing children in the class on the diagnostic test (Rank Below 7 and Rank Above 7), which is one of the
main determinants for participation in the one-on-one remedial sessions with the mentors (see Appendix Figure B-2). Reading, math, and
socio-emotional scores are standardized with respect to the mean and the standard deviation of the control group. See Appendix A.2 for a
detailed description of the outcome variables. Asymptotic p-values reported in brackets are clustered at the school level.




                                      Table B-11: Teacher Pedagogical Practices
                                 Learning Activities Engage With Students Manage Time                     Use of Material      Overall Index
 API Original                           0.006                -0.019          0.178                             -0.142             -0.040
                                       [0.960]              [0.903]         [0.264]                           [0.388]             [0.755]
                                      {0.962}              {0.911}          {0.292}                          {0.399}             {0.765}
                                      (0.982)               (0.982)         (0.556)                           (0.726)            (0.969)

 API Plus                                -0.081                     0.064                     -0.030            -0.029              -0.180
                                        [0.555]                    [0.651]                   [0.843]           [0.845]             [0.169]
                                        {0.569}                    {0.654}                   {0.848}           {0.858}             {0.168}
                                        (0.919)                    (0.919)                   (0.960)           (0.960)             (0.357)

 API Original = API Plus                [0.566]                    [0.622]                   [0.206]           [0.528]             [0.318]
                                        {0.583}                    {0.600}                   {0.248}           {0.567}             {0.348}
                                        (0.847)                    (0.847)                   (0.470)           (0.847)             (0.616)

 Observations                              265                        265                     265                 265                265
 Clusters                                  209                        209                     209                 209                209
 Notes : This table shows OLS estimates and the associated p-values of the API Original and the API Plus modalities on teachers’ pedagogical
practices (Stallings Classroom Snapshot). The outcome variables are standardized with respect to their means and the standard deviations in the
control group. The inference procedures take into account clustering of the error terms at the school level and the block randomization design
at the strata level. p-values reported in brackets refer to the conventional asymptotic inference. p-values reported in braces are computed using
randomization inference (randomization-t). All p-values account for clustering at the school level. p-values reported in parentheses are adjusted
for testing each null hypothesis (null impact of API Original, API Plus, and the comparison) on multiple outcomes through the stepwise procedure
described in Romano and Wolf (2005a,b, 2016).




                                                                   XVIII
Table B-12: Placebo Test for Years of API Plus Exposure Within Experimental Schools
                                 Spanish                  Math                 Science
1 Year                      -0.049    -0.019         0.074    0.025         0.098    0.155
                            [0.773] [0.920]         [0.665] [0.890]        [0.584] [0.444]

2 Years                     -0.020      -0.007      -0.021     -0.029       0.003      -0.073
                            [0.913]     [0.971]     [0.920]    [0.896]     [0.985]     [0.715]

3 Years                     -0.235      -0.131      -0.218     -0.123      -0.371      -0.328
                            [0.339]     [0.618]     [0.368]    [0.626]     [0.092]     [0.176]

Observations                  207         207        207         207         207         207
Controls for Criteria         No          Yes        No          Yes         No          Yes
 Notes : This table shows OLS estimates and the associated p-values of the years of exposure
to the API program during the transition between the second experiment and the government
implementation of the Plus modality. For detailed descriptions of the 2013 school-average test
scores used in this table as outcome variables, see Appendix A.1. All p-values account for
clustering at the school level. Asymptotic p-values reported in brackets are clustered at the
school level.


   Table B-13: Placebo Test for API Plus Assignment During Policy Implementation
                                 Spanish                  Math                 Science
API Plus                    -0.246    -0.045        -0.231   -0.053        -0.205    -0.004
                            [0.000] [0.409]         [0.000] [0.330]        [0.000] [0.945]

Observations                 1702        1702        1702       1702        1702        1702
Controls for Criteria         No         Yes          No        Yes          No         Yes
 Notes : This table shows OLS estimates and the associated p-values of the assignment API
Plus in the fall of 2017. For detailed descriptions of the 2013 school-average test scores used
in this table as outcome variables, see Appendix A.1. All p-values account for clustering at
the school level. Asymptotic p-values reported in brackets are clustered at the school level.




                                             XIX