A Randomized Evaluation of a Low-Cost and Highly Scripted Teaching Method to Improve Basic Early Grade Reading Skills in Papua New Guinea

Early grade literacy skills are crucial for children's future education and ultimately their contribution to human capital formation and economic development. A significant challenge in development is identifying low-cost interventions to improve early literacy skills in contexts characterized by varying teacher ability and severe budget constraints. This paper evaluates the impact of Papua New Guinea's randomized Reading Booster Programme, which was conducted in Madang and Western Highlands Province in 2013 and 2014, respectively. The program provided teachers with training on a highly structured teaching method that they could apply one hour per day within the teaching time allocated to reading. Using the randomized assignment of schools into the program, the paper shows that it had a substantial impact on the reading skills targeted by the program for third grade students, ranging from 0.6 to 0.7 standard deviation. Large effects on other reading skills were found for girls but not boys. The program's cost per student was approximately US$60.


Introduction
The relationship between young children's literacy skills, future human capital formation, and subsequent economic development is an increasingly researched theme in economic development.
The importance of literacy to individual productivity including the diffusion of technology is well established in developing countries (Basu and Foster 1998;Rosenzweig 1995), and literacy has been described as a threshold for economic development (Azariadis and Drazen 1990). Gaps in reading skills persist as children age (Butler et al. 1985); as a result, early literacy skills are an important determinant of a child's future education outcomes including future literacy skills (Marteleto et al. 2008;Entwisle et al. 2005;Jimerson et al. 2000;Alexander et al. 1997).
Despite their importance for development, developing countries struggle to provide children with basic literacy skills, even after substantial progress towards the 1990 Education for All goals. For example, in a recent regional assessment of 10 countries in Francopone Africa, the Programme d'analyse des systèmes éducatifs de la confemen (PASEC), 71.4 percent of 2 nd grade students and 57.3 percent of 5 th grade students on average do not achieve minimum proficiency in literacy (PASEC 2015:36,50). In the Pacific, assessments of early age literacy in Tonga in 2009 and in Vanuatu in 2011 found that after three years of schooling, only 30 percent of students in Tonga and 25 percent of students in Vanuatu were able to read fluently for comprehension (World Bank 2012a, 2012b, 2012c. In Kiribati and Tuvalu, 20 percent of 3 rd grade children achieved minimum reading comprehension proficiency (World Bank 2017a, b). In Papua New Guinea, early grade reading assessments conducted in four provinces between 2011 and 2013 found that students lagged two years behind curriculum targets for fundamental pre-reading skills (World Bank 2014a, 2014b, 2014c, 2014d. International research has identified basic skills that young children need in order to read alphabetic languages (Linan- Thompson and Vaughn 2007;Wolf 2007;Sprenger and Charolles 2004;Chiappe et al. 2002; see also: Gove andCvelich 2011 andNational Reading Panel 2000).
Among these are an understanding of the relationship between printed letters and sounds (Scarborough 2002), the speed at which a child can read (Abadzi 2006), and oral reading fluency (Fuchs et al., 2001).
3 An important challenge is how to ensure children in the early grades of school acquire these skills in a context where teachers have varying and often few formal qualifications, implementation capacity is weak, and budgets allow for no or limited expenditure beyond teachers' salaries. In this context, several randomized controlled trials of interventions involving heavily scripted and systematic instructional approaches have been shown to be successful at improving early grade This paper evaluates a similar though smaller scale approach of scripted and systematic instruction piloted as a randomized controlled trial in two provinces in Papua New Guinea. 2 In this approach, teachers use one hour-long lesson per day from the time allocated to language instruction to follow highly scripted lessons on teaching reading skills. Unlike other approaches, this is not a comprehensive reading instruction program, but rather provides remedial lessons on reading that is low cost to implement. As a result, this paper contributes to the literature by showing that lowcost, scripted instructional approaches in a remedial course format can have significant effects on early grade reading skills in a developing country context.

Papua New Guinea and the Reader Booster Programme
Papua New Guinea is classified by the World Bank as a lower-middle-income country. Its per capita gross national income was 2,240 USD in 2014. Economic growth has been around 5 percent per annum over the last 16 years, and an estimate of 6.6 percent per year between 2012 and 2016.
The projected population in 2015 is 7.6 million people, and the adult literacy rate is projected to be 63.4 in 2015 (World Bank 2016a).

4
Its primary education sector has experienced significant growth in participation: from 2008 to 2012, during which time its universal basic implementation plan was implemented, primary enrollment increased from 600,000 to 1.4 million students. The primary gross enrollment ratio increased from 60.5 to 114.7, and the primary net enrollment rate was 86 percent according to the latest available data from 2012 (World Bank 2016b). Children begin school with three years of elementary school followed by six years of primary school. 3 The language of instruction in elementary school is a local language while English is the official language of instruction for primary schools.
Despite significant increase in enrollment, learning outcomes remain poor. Eighth grade exam results reveal poor outcomes for literacy and numeracy (World Bank 2011). Early grade reading assessments conducted in the National Capital District, Madang Province, Western Highlands Province and East New Britain from 2011 to 2013 found that children's crucial pre-reading skills including "alphabetic principle and phonetics" were two years behind the curriculum target. They also found that students took five years to attain some reading skill objectives required by the first- Time for these lessons was scheduled during the curriculum time allocated for language instruction; the national curriculum allows teachers to have flexibility in which materials they use and how they teach. In addition to training, teachers were also provided with mentoring and coaching.

5
The intervention targeted three key pre-reading skills: initial sound identification, letter sound knowledge, and word reading. These domains in addition to several others were tested in the series of early grade reading assessments conducted before and after the interventions were implemented.
In both provinces, schools were randomly sampled and assigned to either a treatment group which received the intervention or a control group which did not. In Madang Province, 15 schools were assigned to the treatment group while 16 schools were assigned to a control group. In Western Highlands Province, 23 schools were assigned to the treatment group and 23 to the control group.
The intervention was implemented in the treatment schools in Madang province in 2013.
However, the unpublished government report indicates that the intervention was delayed until late in the school year due to various logistical issues; consequently, the Madang students were not fully exposed to the intervention (Government of Papua New Guinea 2016). The intervention in the Western Highlands Province was implemented in 2014.

A. Early grade reading assessment and timeline
In order to measure the impacts of the interventions, an Early Grade Reading Assessment (World Bank 2014b,c) was applied the year before and the year after each intervention in Madang and Western Highlands Provinces. The assessment measures basic reading skill domains: letter recognition, phonemic awareness, phonics, word reading, oral reading fluency, reading comprehension, listening comprehension and alphabetic principle. These skills are measured on nine sub-tests: letter name knowledge, initial sounds of words, letter sound knowledge, familiar word reading, unfamiliar word reading, reading comprehension, listening comprehension, oral reading fluency and dictation (World Bank 2014a:23). The intervention aims to improve three reading skill domains measured in these data: word initial sound knowledge, letter sound knowledge and familiar word knowledge. However, the letter sound domain is excluded from this analysis as the government report as well as an unpublished reliability analysis found the domain 6 to be unreliable in the end-line Madang dataset. 4 The Early Grade Reading Assessment was conducted four times, both before and after the interventions in the two provinces. In Madang, the intervention occurred in 2013, and the assessments were conducted in 2011 and at the end of the school year in 2013. In Western Highlands Province, the intervention occurred in 2014, and the assessments were conducted in 2013 and at the end of the school year in 2014. The data sets were not implemented as a panel as a new sample of students was drawn in each round.
The World Bank provided four data sets for each of the four rounds of the assessment. These data included scores for each of the 9 reading skill domains, sample weights, and several variables about the schools and students, which are described below. The reading domain scores contain a proportion of zero scores, which vary depending on the domain, suggesting a truncated distribution. As a result, the reading domain scores are standardized using a mean and standard deviation of the control group, baseline students estimated using a Tobit model. Table 1 presents the number of schools sampled in each round of EGRA. In Madang province, the baseline sample included 11 of the 15 control schools and 10 of the 16 treatment schools. Endline assessment data are missing for 5 of the sample control schools and 3 of the sampled treatment schools. The unpublished government report states that this is due to logistical, financial and weather issues; it is unlikely that the interventions had any effect on school attrition in Madang.

B. Sample sizes and attrition
Four of the remaining control schools and 5 of the remaining treatment schools were added to the sample, but there are no baseline data for these additional schools. In Western Highlands Province (WHP), 10 each of the 23 treatment and control schools were sampled at baseline and all 23 treatment and control schools were sampled at end-line. For the additional 13 control and 13 treatment schools included in the end-line data, there are no baseline data. Sample sizes by grade, province and school treatment status are described in Table 2. Second grade students were sampled only in the Madang province baseline round, and the samples for the Western Highlands Province include 4 th grade students only at baseline. Each round of EGRA is sampled as a repeated cross-section, and, because of the timing, no cohort was sampled more than once except for the 2 nd grade students in the Madang baseline; they were in 4 th grade at the time of the Madang end-line sampling. The data used in this evaluation are that of grade 3 only. The 4 th grade sample at end-line does not include any students from Western Highlands Province, and while the 4 th grade sample in Madang province could be used with the 2 nd grade sample, the sample of students in schools that were 8 included in both baseline and end-line is small. Table 3 compares baseline reading achievement scores between students in schools that appeared in the baseline and not the end-line ("attrition schools") and in schools that appeared in both the baseline and end-line ("non-attrition schools"). None of the reading domains' differences are statistically significant; however, a difference as large as 0.3 standard deviation cannot be rejected for the letter names domain. With the exception of this domain, the data suggest little difference in reading achievement between the attrition schools and schools appearing in both rounds of the survey. Standard errors presented in parentheses. Statistical significance at the 10, 5 and 1 percent levels denoted by *, **, and ***, respectively.
Students in attrition and non-attrition schools differ in terms of their background characteristics, as compared in Table 4, but neither has a clear advantage. Students in attrition schools are less likely to have printed materials at home to read and more likely to be absent for more than two 9 weeks from school in the previous year but, at the same time, are more likely to have someone read to them at home and are in smaller classes. They are also less likely to be in multi-grade classes and be tested in the national language, Tok Pisin, rather than English. Neither attrition nor non-attrition school students have a consistent advantage in terms of background characteristics. Table presents estimates of the difference in differences in background variables at baseline between students in treatment and control schools and in schools included and not included in the end-line sample. Standard errors presented in parentheses. Statistical significance at the 10, 5 and 1 percent levels denoted by *, **, and ***, respectively. Table 5 presents the difference in baseline reading achievement scores between treatment and control group students for those in all schools sampled at baseline and for those in schools appearing in both the baseline and end-line samples; positive values indicate that the treatment group has a higher estimate than the control group. For all baseline schools, large and statistically significant differences exist between the treatment group and control in three reading domains:

C. Baseline balance
familiar words, dictation and oral reading fluency. Several other domains have differences that, while not statistically different from zero, are also not statistically different from 0.2 standard deviation. In other words, moderate differences between the treatment and control groups cannot be ruled out. In the sample of students in non-attrition schools, the differences between treatment and control groups tend to be lower. None of the reading domain differences for these students are statistically different from zero, but several, including dictation and oral reading fluency, are not statistically different from 0.3 standard deviation as well.  Table presents estimates of the difference in reading score between treatment and control groups. Positive differences imply that the treatment group has a higher score than the control group. Standard errors presented in parentheses. Statistical significance at the 10, 5 and 1 percent levels denoted by *, **, and ***, respectively.
Differences in the available background variables between treatment and control groups are estimated in Table 6. Statistically significant differences exist for the proportion of females, availability of printed materials at home, whether someone reads to the child at home, class size, and whether a majority of the tests at the school are in Tok Pisin. The differences between treatment and control groups are roughly the same whether comparing students in the baseline schools or students in the non-attrition schools. This suggests that any imbalance is a result of the randomized assignment of treatment rather than the attrition of schools. Table presents estimates of the difference in background variables between treatment and control groups. Positive differences imply that the treatment group has a higher value than the control group. Standard errors presented in parentheses. Statistical significance at the 10, 5 and 1 percent levels denoted by *, **, and ***, respectively.

A. Estimation model
The empirical strategy is to estimate the school-level effect of the reader boost program using a difference-in-differences approach with covariates. The school-level effect is estimated because students in the baseline and end-line samples are different and represent different cohorts. The difference-in-differences approach and the inclusion of student and school background variables as controls are motivated by the imbalance detected between treatment and control groups in some achievement scores and background variables. The effect is also assumed to vary by gender.
Reading scores for the i th student and school j, are modeled as a linear function of being in a program school, , being sampled at end-line, , being female, , other student and school characteristics, and disturbance, .
Coefficient, , is the impact of the program on male test scores, and is the impact on female test scores. Because achievement scores in some domains may have a high proportion of zero scores, a Tobit model is used to estimate the model. Baseline sample weights are adjusted to reflect attrition of schools in the end-line data, and standard errors are estimated to be robust to the two-stage sampling method (schools, then students) and a finite population correction based on the number of schools in each province.

B. Impact of the Reader Booster Programme
Estimates of the model are presented in Table 7. For males, the intervention has a statistically significant and large effect on one of the reading domains targeted by the intervention, of 0.63 standard deviation for initial sounds. The effect on males' familiar words achievement is not statistically different from zero. For females, the effect is large and statistically significant for both domains, and statistically higher than males for the familiar words domains.
The program also had positive effects on reading domains that are not targeted by the intervention.
For males, a positive effect is found only for oral reading comprehension; for females, positive and large effect sizes are found for five of the six other domains. The effect size in the other domains is statistically higher for females than males in three of the six domains.   Table presents estimates of nine Tobit regression models. Standard errors denoted in parentheses. Statistical significance at the 1, 5, and 10 percent levels denoted by ***, ** and *, respectively. Impact on females is the estimated sum of time x treatment + time x treatment x female. Number of observations is 893 students. Average impact (for both genders) is estimated using a separate Tobit model including time, treatment and time, treatment (average impact) and other control variables as regressors.
An average impact is also estimated and presented in Table 7. This average impact is estimated using a Tobit model but excluding the gender variables. Overall, the program had a strong positive impact in the three reading domains targeted by the intervention and four of the six other reading domains.

C. Internal validity and robustness checks
The effect size was estimated using five other methods to test the robustness of the model's estimates given the imbalance and school attrition described above; these results are not presented 14 in this paper but available from the author on request. In the first method, the effect sizes were estimated using no control variables. In the second, the end-line data alone are used to estimate impact as this provides a larger sample size.
One reason for poor balance in the baseline may be the relatively small population of schools to draw on. Recent studies in the medical research field have dealt with this source of poor balancing (van Marwijk et al. 2008;Xu and Kalbfleisch 2010;Ravaud et al. 2009;Roux et al. 2011;Taft et al. 2011;Schwartz et al 2015;Leyrat et al. 2016). Leyrat et al. (2013) use a Monte Carlo simulation to assess the accuracy of several different methods including the use of covariates, weighting observations by the inverse of the probability of being selected into their respective treatment or control group (e.g.: Seaman and White 2013) and a direct adjustment by including this probability as a covariate. These latter two methods are the third and fourth methods used in this paper to test for robustness. Finally, Lee bounds (Lee 2009) are estimated using data aggregated at the school level to test whether school attrition may affect the results.
For all five methods, the results are similar to the estimates of the model. Only the estimates from the first method, the Tobit model without covariates, and the fifth method, school-level Lee bounds (without covariates) yielded notably smaller effect sizes. The remaining methods produced effect sizes similar to those presented in Table 7.
Two other issues may affect the internal validity of the estimates of impact. First, some contamination of control schools was reported, as Catholic schools in Madang Province received some special training in phonics. Second, because the data are repeated cross-sections of different cohorts, there are no data on student dropout or non-response. If the intervention affects whether students are present for the end-line data collection, then effect sizes may be biased.

D. External validity
Schools were randomly selected from a pre-defined population of schools that excluded very small schools, schools in highly remote areas, and those in dangerous areas. Attrition of schools from the Madang Province sample was a result of financial, logistical and weather issues. While these 15 issues were unrelated to the treatment, if they reflect underlying characteristics of the schools that, in turn, affect the impact of the treatment, then this would introduce some bias. More generally, the effect sizes estimated in this paper may not be replicable in the more remote areas of the country or in those prone to the issues that led to the attrition of the schools in Madang.

Cost effectiveness
Benchmarking the impact of the Reader Booster Programme to other interventions helps assess the efficiency of the intervention and benefits of scaling up the intervention versus other types of interventions. The Abdul Latif Poverty Action Lab at the Massachusetts Institute of Technology compiles data on costs and impacts of several randomized impact evaluations in education. Table   8 presents the cost effectiveness of these programs measured as the impact on test scores in standard deviations per 100 USD cost. In their data, 2.278 standard deviations is the median impact per 100 dollars. The figures are not perfectly comparable. Tests differ in grade level and psychometric properties, but it provides a general range of cost effectiveness data with which to benchmark the Reading Booster Programme.  (2014) The total cost for each year of the Reading Booster Programme was 794,243 PGK (250,549 USD) based on data from the World Bank project which supported the program. Because this program was implemented during regular teaching hours, there is no additional cost of teachers; these costs reflect training and distribution of materials as well as monitoring and evaluation. The program benefitted 4,272 students, yielding a cost of 186 PGK (59 USD) per student (World Bank 2016c). Table 9 presents the average impact of the program in standard deviations per 100 USD, which is calculated by dividing the impacts presented in Table 7 by 0.59. Per 100 USD, the impact of this program on the two targeted reading domains ranges from 1.04 to 1.16 standard deviations. For the other reading domains where a statistically significant effect was found, effect sizes range from 0.61 to 1.07 standard deviations per 100 USD. The reader booster program is most cost effective for girls; cost effectiveness, like effect size presented in Table 7, is higher in the five domains that have statistically higher effect sizes than males.  Table 7 and dividing by the cost per student in 100s USD (58.65 USD). Statistical significance at the 1, 5, and 10 percent levels denoted by ***, ** and *, respectively.
Compared to the data compiled by the Poverty Action Lab, this intervention's cost effectiveness lies towards the bottom end of the distribution. However, it is not clear how this intervention will affect test scores later in the students' schooling. The impact of the program may be amplified over time, as early reading skills are crucial to a child's literacy and future learning.

Conclusions
These findings provide evidence that a teacher training approach providing highly scripted lesson plans can improve basic reading skills in a low-cost, remedial course format, especially for girls.
The Papua New Guinea curriculum provides teachers with flexibility over how they use their instructional time for language; this flexibility permitted the piloting and evaluation of the program. The Reader Booster Programme diverges from the curriculum's approach by providing teachers with a very specific teaching method and scripted lesson plans that they apply within the time allocated to language instruction.
A natural question is how much flexibility should teachers have within the curriculum in a developing country context? Highly structured approaches are appealing in contexts where teacher qualification and ability vary considerably. While in developed countries, the use of highly scripted lesson plans has received mixed reception, evaluations of interventions providing teachers with specific teaching methods and lesson plans have been shown to be successful in developing countries to improve early reading skills. The Reader Booster Programme adds to this evidencebase demonstrating an intervention that is formatted as a remedial course aimed at improving specific reading skills. The remedial course format is advantageous because it can be implemented without changes to the national curriculum and can be targeted to schools most in need. Its low cost is also important given the education budget constraints faced by Papua New Guinea and other developing countries.
One limitation of the Reader Booster Programme is the weaker effect on boys. It is not clear from the data collected in this study why this may be the case. Qualitative work would be beneficial to better understand this outcome; however, the program has already completed. If this approach is replicated in other countries, gender differences in the effects should be studied closely.