Policy Research Working Paper                10348




     Improving Estimates of Mean Welfare
   and Uncertainty in Developing Countries
                         Joshua D. Merfeld
                          David Newhouse




Development Economics
Development Data Group
March 2023
Policy Research Working Paper 10348


  Abstract
  Reliable estimates of economic welfare for small areas are                         All three machine learning methods outperform the tradi-
  valuable inputs into the design and evaluation of develop-                         tional linear mixed model, with extreme gradient boosting
  ment policies. This paper compares the accuracy of point                           and boosted regression forests generally outperforming the
  estimates and confidence intervals for small area estimates                        other alternatives. The proposed residual bootstrap proce-
  of wealth and poverty derived from four different prediction                       dure reliably estimates confidence intervals for the machine
  methods: linear mixed models, Cubist regression, extreme                           learning estimators, with estimated coverage rates across
  gradient boosting, and boosted regression forests. The eval-                       simulations falling between 94 and 97 percent. These results
  uation draws samples from unit-level household census data                         demonstrate that predictions obtained using tree-based
  from four developing countries, combines them with pub-                            gradient boosting with a random effect block bootstrap
  licly and globally available geospatial indicators to generate                     generate more accurate point and uncertainty estimates
  small area estimates, and evaluates these estimates against                        than prevailing methods for generating small area welfare
  aggregates calculated using the full census. Predictions of                        estimates.
  wealth are evaluated in four countries and poverty in one.




 This paper is a product of the Development Data Group, Development Economics. It is part of a larger effort by the
 World Bank to provide open access to its research and make a contribution to development policy discussions around the
 world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may
 be contacted at dnewhouse@worldbank.org.




          The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
          issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
          names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
          of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
          its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


                                                        Produced by the Research Support Team
         Improving Estimates of Mean Welfare and Uncertainty in
                         Developing Countries∗

                                Joshua D. Merfeld†                  David Newhouse‡




      Keywords: poverty, welfare, prediction, machine learning, geospatial
      JEL Codes: C53, C80, I32, O10



  ∗ We  thank Jed Friedman, Haishan Fu, Keith Garrett, Dean Jolliﬀe, Talip Kilic, and Bo Pieter Johannes Andree for comments
and support. We thank Lina Cardona, Dilhanie Deepawansa, Carlos De Maia, Francis Mulangu, and Mario Negre for help
obtaining data. This project was partially funded by the Knowledge for Change Program’s Phase IV-funded programmatic
research project “Understanding Trends in Sub-National Diﬀerences in Economic Well-Being in Low- and Middle-Income
Countries”.
   † KDI School of Public Policy and Management and IZA; merfeld@kdis.ac.kr
   ‡ World Bank and IZA; dnewhouse@worldbank.org
1    Introduction

Accurate measures of welfare for spatially disaggregated areas are valuable inputs into the design and

evaluation of eﬀective development policies (Atkinson 2019; Blumenstock 2016; Ravallion 2015; Merfeld and

Morduch 2022). Yet, most estimates of welfare are derived from household surveys that can only produce

reliable statistics at higher levels of aggregation, mostly because of the high cost of data collection (Fujii

and Weide 2020; Kilic et al. 2017). More granular estimates can improve geographic targeting (Elbers et

al. 2007) as well as program and policy evaluation (Ratledge et al. 2021). While some countries can draw

on rich administrative data such as income tax records to serve as auxiliary data, developing countries do

not typically maintain accurate and up-to-date administrative data sources. It is quite common to predict

welfare with census data and a contemporaneous survey – following Elbers, Lanjouw, and Lanjouw (2003) or

Molina and Rao (2010) – but in poorer countries, census data are usually collected infrequently. As a result,

oﬃcial statistics on welfare in small areas tend to be dated.

Against this backdrop, recent advances in machine learning and the growing availability of non-traditional

data sources have led to the proliferation of new options for small area estimation. For example, Blumenstock,

Cadamuro, and On (2015) use mobile phone records to infer the socioeconomic status of phone owners

in Rwanda and Aiken et al. (2022) use mobile phone call data records to predict targeting performance

of programs in Togo. However, one drawback to mobile phone data is that – like banking records – the

population of mobile-phone owners may be systematically diﬀerent from the population of those without

phones. Satellite-derived geospatial data do not suﬀer from this selection bias and have become increasingly

popular in economics (Donaldson and Storeygard 2016). Previous research has demonstrated that geospatial

data is a promising source of data to estimate economic growth (Henderson, Storeygard, and Weil 2012),

labor force participation (Merfeld et al. 2022), and welfare more generally (Jean et al. 2016; Yeh et al. 2020;

Chi et al. 2022; Engstrom, Hersh, and Newhouse 2022; McBride et al. 2022; Newhouse et al. 2022; Masaki et

al. 2022).

In this paper, we evaluate four diﬀerent methods to estimate welfare at low levels of aggregation in developing

countries. We validate the performance of these methods using unit-level census data across four developing

countries: Madagascar, Malawi, Mozambique, and Sri Lanka. In Malawi we are able to extend the evaluation

to include a headcount poverty measure in addition to the asset index, while in the other three countries we

evaluate prediction of an asset index, similar to Chi et al. (2022) and Masaki et al. (2022).

The four methods we evaluate are linear empirical best predictor (EBP) models (Battese, Harter, and Fuller

1988; Jiang and Lahiri 2006; Molina and Rao 2010; Masaki et al. 2022), which come from a long history of


                                                      2
small area estimation in statistics, and three newer machine learning methods: cubist regression models (Wang

and Witten 1997; Quinlan 1992), extreme gradient boosting (Chen and Guestrin 2016) – more commonly

known as XGBoost – and boosted regression forests, or BRF (Friedberg et al. 2020; J. Tibshirani et al. 2022).

These methods diﬀer in their level of parsimony and transparency on the one hand, and their predictive

accuracy on the other, and a key goal of this exercise is to better understand the terms of this trade-oﬀ in the

context of welfare prediction. While we specify the EBP model at the household level, the others are speciﬁed

at the sub-area level, which in these contexts refers to highly disaggregated administrative areas akin to

groups of villages. We then aggregate predictions to obtain estimates at the target area for each country.

For predictors, we use satellite-derived geospatial data that is available across much of the globe, meaning

that the methods and data we evaluate here are widely applicable in cases where geolocated survey data are

available. We use shapeﬁles from all four countries to pull geospatial data from multiple sources, which is

then combined with the unit-level census data. In each country, we simulate 100 two-stage samples – ﬁrst

randomly selecting 500 enumeration areas and then randomly selecting 8 households – and compare the

overall performance across simulations, meaning our results are derived from one hundred possible samples

rather than a single sample.

Under these conditions – which are similar to those for many household surveys in developing countries –

all three machine learning models substantially outperform EBP in terms of accuracy, as seen clearly in

Table 1. XGBoost and BRF perform best, with XGBoost yielding slightly more accurate predictions on

average. XGBoost is notably more accurate than BRF in Sri Lanka, especially among the poorest in the left

tail of the distribution. Cubist regression, which estimates a set of linear models on diﬀerent subsets of the

data, performs slightly less well than XGboost and BRF, even when a boosting procedure is utilized. All

three machine learning methods greatly outperform EBP. The average (pearson) correlation is 7.5 percent

higher for XGBoost than EBP, which is the current workhorse for small area estimation for practitioners who

require accurate estimates of uncertainty. We also examine additional accuracy measures related to deviation

from truth. XGBoost and BRF also outperform EBP in terms of squared deviation, and in this case the

diﬀerence is even larger: squared deviation for XGBoost is 37.1 percent lower than for EBP, indicating large

improvements in accuracy. The other two ML estimators also outperform EBP by large amounts in terms of

both correlation and squared deviation.

A key contribution of the paper is the evaluation of the random eﬀect block bootstrap procedure, proposed

by Chambers and Chandra (2013) and applied by Krennmair and Schmid (2022), to estimate uncertainty

for the three machine learning methods. The presentation of uncertainty statistics has traditionally been

less common for machine learning methods (Chi et al. 2022). The random eﬀect block bootstrap accounts


                                                       3
                                         Table 1: Summary of results

                                                   EBP        cubist    XGBoost      BRF
                   correlation (pearson)           0.841      0.883      0.904       0.895
                   squared deviation               0.114      0.080      0.072       0.082
                   width of CI                     1.212      1.316      1.006       1.381
                   coverage                        0.950      0.966      0.940       0.968
                    Note: Measures of accuracy and uncertainty are simple averages for
                   the asset index predictions across the four hundred independent sam-
                   ples (100 samples for each country).


for the hierarchical nature of the data and proceeds in two steps, sampling residuals separately at both the

target area level and the sub-area level, which is the unit of analysis in the machine learning estimation. This

procedure estimates accurate conﬁdence intervals, with average coverage rates of 96.6 percent for cubist, 94.0

percent for XGBoost, and 96.8 percent for BRF. These coverage rates are in line with the coverage rates

for EBP, which are derived from the parametric bootstrap procedure typically used to generate uncertainty

estimates for EBP predictions. XGBoost achieves approximately correct coverage rates but is more likely

than BRF or Cubist to understate coverage, as coverage rates range from 0.894 to 0.979 for XGBoost, from

0.934 to 0.987 for BRF, and from 0.925 to 0.983 for Cubist. It appears that uncertainty estimates using

the random eﬀects block bootstrap slightly underestimate uncertainty for XGboost, and slightly overstate

uncertainty when using Cubist and BRF.

We also document diﬀerences in performance across in-sample and out-of-sample areas, given recent evidence

that out-of-sample predictions generated by EBP models can be signiﬁcantly less accurate than in-sample

predictions when using geospatial data (Newhouse et al. 2022). On average, all four estimators predict in

sample better than out of sample. EBP shows the biggest drop in performance out of sample, at least in

terms of correlation. While the three ML estimators also show drops in accuracy, these drops tend to be

relatively smaller. For example, both XGBoost and BRF have Pearson and rank correlations above 0.8 in

out-of-sample areas across the ﬁve cases that we examine. The more ﬂexible nature of XGBoost and BRF

models are better suited for predicting out-of-sample. In addition, out-of-sample EBP predictions suﬀer from

the unavailability of sample data to serve as a prior when estimating the conditional random area eﬀect.

There are also noticeable diﬀerences in precision across in-sample and out-of-sample areas. We do not include

a conditional random eﬀect in the machine learning models, and the bootstrap procedure therefore estimates

conﬁdence intervals that are roughly the same size for in- and out-of-sample areas. Since out-of-sample

accuracy is lower than in-sample accuracy, out-of-sample coverage rates tend to be lower as well; coverage

drops to just 83 percent for XGBoost in Malawi assets, for example. This does not result from overﬁtting the



                                                       4
model to the sample, given that LASSO is used for model selection in the EBP model and that regularization

methods are built into the machine learning algorithms. Instead, because villages are sampled proportional to

their population size, out-of-sample areas tend to be less populated and are therefore systematically diﬀerent

from in-sample areas. Although sample weights are included, out-of-sample predictions suﬀer from the relative

paucity of training data from rural, less populated areas.

Our ﬁnal set of results takes the four estimators and applies them in the context of an actual publicly-available

household survey with oﬀset GPS coordinates: the 2019 Integrated Household Survey (IHS) in Malawi. We

show that the main ﬁndings are robust when using the IHS instead of 100 simulated surveys drawn from the

census. Accuracy measures show that the machine learning methods continue to outperform the EBP model.

Conﬁdence intervals for the machine learning models are all slightly conservative, leading to higher coverage

rates than the 95-percent target. Compared with EBP, XGBoost achieves this with smaller conﬁdence

intervals for poverty both in and out-of sample, similar size conﬁdence intervals for the asset index in sample,

and conﬁdence intervals just half the size of EBP’s out of sample (the latter of which is still underestimating

uncertainty). These ﬁndings indicate that, in these settings, the machine learning methods generate estimates

that are more accurate and more precise than EBP, regardless of whether we use simulated surveys drawn

from the censuses or the actual IHS in the case of Malawi.

These ﬁndings primarily contribute to a newer literature using new types of data to estimate economic

statistics of interest, especially welfare. In the past decade, there has been a proliferation in the use of satellite

imagery to estimate poverty and welfare (Jean et al. 2016; Yeh et al. 2020; Engstrom, Hersh, and Newhouse

2022; Newhouse et al. 2022; Chi et al. 2022). However, image processing is computationally intensive and

unwieldy. In comparison, other types of data are easier to use, like mobile phone call data records (Aiken et

al. 2022; Blumenstock, Cadamuro, and On 2015), but these can be more diﬃcult to access due to privacy

concerns, and also raise issues related to representativeness. The satellite indicators we use can be obtained

from publicly available sources relatively easily and are much smaller in size.

We also contribute to a related literature on small area estimation, which grew out of the statistics literature

in the 1970s (Efron and Morris 1973; Carter and Rolph 1974; Fay and Herriot 1979; Battese, Harter, and

Fuller 1988; Rao and Molina 2015). Earlier work proposed the use of census data for prediction (Elbers,

Lanjouw, and Lanjouw 2003) and the empirical best predictor (Jiang and Lahiri 2006; Molina and Rao 2010;

Tzavidis et al. 2018) is now one of the most common implementations of small area estimation. One reason

the EBP model is preferred in many applications is its transparency; a nested error regression model allows

for a straightforward estimation of linear coeﬃcients with (conditional) random eﬀects speciﬁed at the target

area level. A simple table with coeﬃcients indicates exactly how each variable is related to the measure


                                                          5
of household welfare. On the other hand, the machine learning methods, while generating more accurate

predictions, suﬀer from a lack of parsimony and transparency (Efron 2020). While analytical techniques can

determine which features are most commonly used in a machine learning model, it is not straightforward

to understand the relationship between the set of predictors and the prediction. In addition, much of the

formal statistical theory related to measuring the uncertainty associated with predictions from tree-based

machine learning is new (Athey, Tibshirani, and Wager 2019). To the best of our knowledge, this is the ﬁrst

paper to show rigorously that the Random Eﬀect Block bootstrap estimates accurate conﬁdence intervals for

predictions from these types of machine learning methods, when validated against unit-level census data as

ground truth.1 In cases where economists and statisticians are willing to sacriﬁce parsimony and transparency

to achieve more accurate predictions, and the sample data are suﬃciently rich to train accurate machine

learning models, the availability of a simple and accurate bootstrap method for estimating uncertainty

surmounts a crucial barrier to the use of tree-based machine learning algorithms.

The rest of the paper is organized as follows. In section 2, we provide a brief overview of the data, the

estimation methods, and the method utilized to validate estimates for accuracy and uncertainty. Then, in

section 3, we review detailed results, including the simulation results, both in and out-of sample. We discuss

estimates from the 2019 Malawi IHS in section 4 before concluding in section 5.




2      Methods

This paper evaluates four diﬀerent methods for generating predictions of district-level poverty rates: Linear

Empirical Best Predictor (EBP) models (Battese, Harter, and Fuller 1988; Jiang and Lahiri 2006; Molina

and Rao 2010), Cubist regression models (Wang and Witten 1997; Quinlan 1992), extreme gradient boosting

(Chen and Guestrin 2016), and Boosted Regression Forest (BRF) models (Friedberg et al. 2020). Importantly,

we evaluate these methods in the context of developing countries, where census data is typically collected

rarely. One of our goals is to improve the estimation of key development outcomes in such contexts. As

such, we propose using data that is widely available across the globe: remote sensing and geospatial data. In

addition, we adapt and apply a random eﬀects block bootstrap procedure (Chambers and Chandra 2013) to

estimate uncertainty for the machine-learning models.

These techniques improve on existing methods but require rigorous evaluation of their accuracy and precision

in multiple contexts before they can be applied. To do that, we compare estimates from these models to

ground-based “truth” derived from unit-level census data in four countries: Madagascar, Malawi, Mozambique,
1A   contemporaneous paper by Krennmair and Schmid (2022) apply the bootstrap with random forests in Mexico.



                                                           6
                                            Table 2: Census data statistics

                               Madagascar                 Malawi                 Mozambique                Sri Lanka
 Area                           Commune               Traditional                  Locality               DS Division
                                                       Authority
 (count)                          1,515                  420                        1,258                    331
 Subarea                       Fokontany            Enumeration area                Bairro                GN Division
 (count)                         14,412                 18,700                      65,707                  13,984
 Sample or full?                   Full                    20%                       Full                      Full
 Households (count)             5,007,602                 796,925                 5,992,349                 4,842,300
 Year                              2017                    2018                     2017                      2012


and Sri Lanka. These countries were selected because of the availabiltiy of census data with either enumeration

area geocoordinates or sub-area identiﬁers with corresponding shapeﬁles. We present information related to

the censuses in Table 2. The oﬃcial administrative boundaries available in shapeﬁles diﬀer quite substantially

across countries. In general, we pull geospatial data for the lowest administrative level possible. In Madagascar,

this is the Fokontany, while in Malawi it is the Enumeration Area, the Bairro in Mozambique, and GN

Divisions2 in Sri Lanka. We refer to these levels as “subareas” throughout this paper.

While we pull geospatial data at the subarea level, the target areas for prediction are one level above this

in all cases. In Madagascar, this corresponds to the Commune; in Malawi it is the Traditional Authority;

in Mozambique it is Localities (Localidades); and in Sri Lanka it is Divisional Secretary’s Divisions (DS

Divisions). These are below the levels at which the household survey is considered to be representative.

There are large diﬀerences across countries in how administrative units are allocated across space, with some

countries having relatively larger units and others relatively smaller ones.

For Madagascar, Mozambique, and Sri Lanka, we have access to the full set of unit-level census data for all

households. In Malawi, we have a 20-percent extract, which is a random sample of the entire population.



2.1     Outcomes

We focus on two separate measures of welfare, an asset index and poverty rates, although we only estimate

poverty rates in Malawi. In Malawi, in addition to the census data, we utilize a near-contemporaneous

household survey – the 2019 IHS – that collected expenditure data and is made publicly available by the

World Bank’s Living Standards Measurement Survey Program.3 Using that data, we predict household per
2 “GN  Divisions” stands for “Grama Niladhari” Divisions.
3 The IHS is part of the Living Standards Measurement Survey project at the World Bank. Surveys under this project are
 generally implemented by country-speciﬁc national statistics oﬃces but with support from the World Bank. More information
 on the 2019 IHS and the LSMS Program more generally is available on the World Bank website: https://microdata.worldbank.
 org/index.php/catalog/3818.




                                                            7
capita expenditures for all the households in the census, and classify the bottom half of the distribution as

poor. Thus poverty is based on predicted per capita consumption. Appendix B provides more data on the

imputation procedure as well as the calculation of the asset index.



2.2     Geospatial eatures

The use of unit-record census data remains the preferred gold standard option when recent census data are

available. However, census data tend to be collected infrequently in most developing countries, and small

area estimates based on satellite indicators are a preferred alternative to reporting direct survey estimates

when census data are old or there have been rapid changes in spatial welfare patterns. We focus on satellite

and other remotely sensed data because they are widely available and predictive of spatial variation in

welfare. We pull geospatial features from Google Earth Engine using the rgee package in R. Table A1 in the

appendix lists the geospatial features used. Importantly, we often derive several additional statistics from

diﬀerent indicators. For example, data on temperature is used to construct average temperature, maximum

temperature during the year, and minimum temperature during the year, while data on pollution is used to

generate many distinct indicators.4 In addition, we also create features by aggregating to higher levels by

taking means, medians, maximums, or minimums. By combining these features in diﬀerent ways and across

diﬀerent levels of aggregation, we end up with more than 130 diﬀerent predictive features. While we include

all of these features in some methods, we use lasso to select features for the EBP model, following others

(Engstrom, Hersh, and Newhouse 2022; Newhouse et al. 2022; Masaki et al. 2022).5 We return to these

points below.



2.3     Linear Empirical Best Predictor Models

We utilize the EMDIplus package in R to estimate the linear Empirical Best Predictor model.6 This is an

updated version of the emdi package (Kreutzmann et al. 2019), which implements the models described in

Molina and Rao (2010) with additional features. We use a household-level model, which models household-level

expenditures as a function of the chosen covariates.7 While we include descriptions of the other estimators
4 For example, pollution data includes carbon monoxide, carbon dioxide, ozone, etc.
5 We  generally pull this data from the same year as the census. This is not possible for all indicators in Sri Lanka – which was
  several years before the other censuses – so we instead pull the most recent year wherever necessary.
6 The package is a spin-oﬀ of the EMDI package developed by Ifeanyi Edochie and available for download from his GitHub page:

  https://github.com/SSA-Statistical-Team-Projects/SAEplus.
7 The use of aggregate predictors/features in a household-level model is sometimes referred to as a “unit-context model.”

  Unit-context models tend to generate more accurate and precise predictions than area-level models due to their ability to use
  sub-area level predictors (Masaki et al. 2022; Newhouse et al. 2022).




                                                               8
in the appendix, we spend some time here on EBP since it is the current workhorse method for small area

estimation of welfare.

The household-level model is a model of the form:


                                     G(yhsar ) = β1 Xsar + β2 Xar + γr + ηar + εhsar ,                       (1)


where G(yhsa ) is some transformation of outcome y for household h in sub-area s in area a in region r,8 Xsar

is a vector of sub-area-speciﬁc geospatial features, Xar is a vector of area-speciﬁc geospatial features, γr is a

set of region ﬁxed eﬀects, ηar is an area-level random eﬀect, and εhsar is a classical error term.

When estimating Malawian poverty rates, we use the rank order transformation proposed by Peterson and

Cavanaugh (2019) and implemented in (Masaki et al. 2022), though the results are robust to using the log

shift transformation recommended in Tzavidis et al. (2018). We set a speciﬁc threshold for the poverty line

to match the poverty rate in the survey data, which is around 50 percent. We estimate the mean of the asset

index directly. In both cases, the software estimates Equation 1 at the household level, with the conditional

random eﬀect eﬀectively using the survey data as a prior estimate that is updated using predictions from the

model. Once the model is estimated, poverty estimates for areas are generated by repeatedly drawing random

area eﬀects and idiosyncratic error terms from their estimated normal distributions, generating simulated

poverty estimates 100 times for each sub-area, and aggregating across sub-areas to the target area level.

We include both survey weights and population weights, with the latter taken from WorldPop estimates

rather than the censuses. The software calculates measures of uncertainty using 100 parametric bootstrap

replications. For more details on the estimation process, we refer readers to Tzavidis et al. (2018) and Molina

and Rao (2010).

As with any regression model whose main goal is prediction, EBP models can be prone to overﬁtting. To

avoid this, we select features using LASSO (R. Tibshirani 1996), implemented using the R package glmnet

(Friedman, Hastie, and Tibshirani 2010). We select the optimal lambda using cross validation. However, we

are particularly concerned about the hierarchical structure of the data leading to information leakage across

folds in the cross validation routine. Therefore, we modify the LASSO algorithm to assign areas instead of

individual households to cross validation folds. We also allow the regional ﬁxed eﬀects to enter unpenalized;

in other words, we force LASSO to select all regional dummies. Because the surveys are considered to be

representative at the regional level, this improves model ﬁt by prioritizing variables that explain within-region

variation for selection. We implement this routine on each of the 100 survey draws.
8 The   level at which we are interested in predicting outcomes is the area, described above.




                                                                 9
2.4     Cubist

The second prediction method that we evaluate is Cubist regression, which is closely related to M5 regression

model trees and is derived from the work of Kuhn and Johnson (2013), Wang and Witten (1997), Quinlan

(2014), and Quinlan (1992). We implement it in R with the Cubist package (Kuhn et al. 2022), using a

procedure described in detail in Kuhn and Johnson (2013) and the publicly available source code. The input

is a set of training data with a dependent variable and set of candidate independent variables. The output is

a set of piecewise linear models. The procedure uses tree-based prediction methods to develop “rules”, which

correspond to leaves of the tree, and linear models are estimated for every rule. The user can set the number

of rules or allow the algorithm to determine the optimal number of rules based on cross-validation. In short,

the procedure estimates a set of linear models that are estimated on various subsets of the data, which are

selected to maximize the accuracy of the predictions. Further details on the Cubist algorithm can be found

in Appendix C.



2.5     XGBoost

The third estimator – Extreme Gradient Boosting – is a popular implementation of gradient boosted trees,

commonly called XGBoost (Chen and Guestrin 2016). XGBoost develops a set of regression forests, which

like the committees in the Cubist model sequentially predict residuals from the past regression. Appendix

C contains further details on the estimation of the algorithm; we just summarize the material found in the

online XGBoost documentation9 as well as in the original paper by Chen and Guestrin (2016).



2.6     Boosted Regression Forests

The last method we compare is Boosted Regression Forests (BRF), implemented in the GRF package for R

and described in the online documentation to that package as well as in Athey, Tibshirani, and Wager (2019).

Boosted Regression Forests are very similar to XGBoost, in that both estimate a series of regression forests

that successively predict the residuals from the previous round. However, BRF diﬀers from XGboost by

using one subsample of the data to grow trees and another to generate predictions at the leaves of the tree, a

procedure which is more theoretically sound. Each regression forest consists of a set of decision trees that the

algorithm grows on randomly selected subsets of the data. Further details on BRF are in Appendix C.
9 https://xgboost.readthedocs.io/en/stable/tutorials/model.html




                                                            10
2.7        Uncertainty Estimates for ML Estimators

For EBP, we model assets and monetary welfare at the household level, which is aggregated up to the target

area level. Because the model speciﬁcation contains a random eﬀect at the target area level, this procedure

accounts for the hierarchical nature of the data. The procedure uses a parametric bootstrap to estimate

uncertainty and we use the canned implementation of this procedure from the EMDIplus package, following

González-Manteiga et al. (2008).

For the three ML estimators, we estimate the models at the subarea level.10 In other words, we aggregate

each sample to the subarea level, estimate the model at the subarea level, and then manually aggregate the

predictions to the area level using estimated population from WorldPop as aggregation weights.

While Friedberg et al. (2020) prove a central limit theorem for local linear forests that allows for the

construction of uncertainty estimates, their theory and the implementation of it in the accompanying R

package only allow for uncertainty estimates at the same level as that of the estimation itself. As an

alternative, we propose a non-parametric bootstrap procedure that draws from Chambers and Chandra (2013)

and Krennmair and Schmid (2022). For subarea sa, consider the sample direct estimate of the outcome:

ˆsa
y direct
                                                                                     ˆsa
         . In addition, there is the prediction from the machine learning algorithm, y ML
                                                                                          . With these two

estimates, we calculate subarea-speciﬁc “residuals” as:


                                                    ˆ sa = y
                                                    R      ˆsa
                                                             ML
                                                                 ˆsa
                                                                −y direct
                                                                          .                                                    (2)



                                                                      ˆsa
We can likewise calculate residuals at the area level, by aggregating y ML
                                                                               ˆsa
                                                                           and y direct
                                                                                        to the area:


                                                     ˆa = y
                                                     R    ˆaML
                                                                ˆs
                                                               −y direct
                                                                         .                                                     (3)



The proposed bootstrap continues in two steps. First, note that we can only calculate this residual for

in-sample subareas and in-sample areas. After estimating predictions, we ﬁrst calculate the residuals in
                                                                ˆ sa for each in-sample subarea and add
Equation 2. Then, we randomly draw one residual from the vector R

that residual to the prediction: y
                                 ˆsa
                                   ML  ˆ sa . We do this sampling with replacement, for in- and out-of-sample
                                      +R

subareas.

We now have adjusted subarea predictions for all subareas. We aggregate all of these predictions to the area

level, using estimated population from WorldPop as weights. At the area level, we pursue a similar strategy,
10 Whilepredicting at the area level is also an option, this would discard important variation in the predictor variables. As such,
  we opt to estimate at the subarea level, instead.



                                                               11
but this time we draw area-level residuals, with replacement, for all areas, regardless of sample status. We

repeat this residual bootstrap 1,000 times and calculate the 2.5th percentile (25th value), the 97.5th percentile

(975th value), and the standard deviation. We use the 2.5th and 97.5th percentiles to calculate conﬁdence

intervals, not the standard deviation of the estimate.

The proposed bootstrap consists of the following steps:

   1. Predict an outcome (asset index or poverty) using XGBoost, BRF, or Cubist.

   2. Calculate subarea residuals for in-sample subareas by diﬀerencing the prediction and the direct estimate
                                                       ˆ sa = y
        from the survey. Call this vector of residuals R      ˆsa
                                                                ML
                                                                    ˆsa
                                                                   −y direct
                                                                             .

   3. Aggregate predictions to the area level, using estimated population from WorldPop as weights.

   4. Calculate area residuals for in-sample subareas by diﬀerencing the prediction and the direct estimate
                                                       ˆa = y
        from the survey. Call this vector of residuals R    ˆaML
                                                                  ˆa
                                                                 −y direct
                                                                           .

                                                                         ˆ sa for all subareas.
   5. With original subarea predictions, bootstrap with replacement from R

   6. Aggregate these new predictions to the area level.

                                      ˆ a for all areas.
   7. Bootstrap with replacement from R

   8. Repeat steps 5 through 7 1,000 times, saving new area estimates each replication.

   9. Calculate percentiles and standard deviation across the 1,000 replications.

Although residuals can only be calculated for sampled sub-areas and areas, these are used to generate

uncertainty estimates for both samples and non-sampled areas.

Importantly, the poverty rate is a variable bounded by zero below and by one above. In order to respect these

restrictions throughout the process, we do not estimate the poverty rate in levels. Instead, we estimate an
                                                                      √
arcsin (square root) transformed poverty rate: ptransf
                                                sa
                                                       ormed
                                                             = sin−1 ( psa ). We carry over this transformation

throughout the entirety of the bootstrap procedure, only back transforming it at the end, in step 9.11



2.8       Evaluating Performance

We are fortunate to have access to unit-record census data to evaluate the performance of each model. This

allows us to calculate true sampling distributions with the unit-level census data by simulating separate

surveys and saving the results from each iteration. In all countries, we treat subareas as enumeration areas.
11 We   of course also back transform for the original point estimate.




                                                                12
We draw 500 separate subareas with probability proportional to population size, using the census to deﬁne

population sizes. We then randomly draw eight households from within each selected subareas. This results

in a sample of approximately 4,000 households in each iteration of the survey.12

We independently draw 100 separate surveys, predict our outcomes of interest with each method, and then

evaluate the performance of the methods against the ground truth derived from the full unit-level census

                                                                    ˆ refers to the predicted outcome for area i,
data. We calculate the following statistics, where i indexes areas, y

and yi
     truth
           : refers to the true value for area i:

   • Correlation: We calculate both the Pearson correlation coeﬃcient, r, and the Spearman (rank) correlation

        coeﬃcient, ρ. We present the means across all 100 independent samples:

                                                       100                   100
                                                  1                     1
                                                             rs and                ρs                                (4)
                                                 100   s=1
                                                                       100   s=1



   • Correlation: We calculate both the Pearson correlation coeﬃcient, r, and the Spearman (rank) correlation

        coeﬃcient, ρ. We present the means across all 100 independent samples:

                                                       100                   100
                                                  1                     1
                                                             rs and                ρs                                (5)
                                                 100   s=1
                                                                       100   s=1



                                           ˆi − yi
   • Absolute deviation: This is deﬁned as y     truth
                                                       . We present the average across all areas and all

        simulations:

                                                        100 N
                                                  1
                                                                      ˆi − yi
                                                                      y     truth
                                                                                                                     (6)
                                                100N    s=1 i=1



                                                                  2
                                          ˆi − yi
   • Squared deviation: This is deﬁned as y     truth
                                                      . Similarly, we present the average across all areas

        and simulations:

                                                        100 N
                                                  1                                     2
                                                                      ˆi − yi
                                                                      y     truth
                                                                                                                     (7)
                                                100N   s=1 i=1



   • Width of conﬁdence interval: For the EBP estimates, this is derived from the estimated MSE. [ˆ12] For

        the upper and lower bounds of the CI for the three ML estimators, it is derived using percentiles from

        a bootstrap distribution.
12 In
    the rare cases in which a subarea contains fewer than eight households, we draw all households. This can lead to some
  sampling iterations to have slightly fewer than 4,000 households.




                                                             13
                                                               upper
    • Coverage rate: This is deﬁned as I yi
                                          truth
                                                ∈ [CIi
                                                     lower
                                                           , CIi     ] , where I (·) is the indicator function

      and CI refers to the conﬁdence interval for a given area. We calculate the proportion of areas with true

      values that fall within the conﬁdence interval:




                                           100 N
                                     1                                           upper
                                                     1 · I yi
                                                            truth
                                                                  ∈ [CIi
                                                                       lower
                                                                             , CIi     ]                   (8)
                                   100N   s=1 i=1



«««< HEAD

                                        100 N
                                  1                                           upper
                                                  1 · I yi
                                                         truth
                                                               ∈ [CIi
                                                                    lower
                                                                          , CIi     ]                      (9)
                                100N    s=1 i=1


======= »»»> eccbc7e76fa76d6d3322cad9d8a59d9662d4da5a




3     Results

We ﬁrst examine the accuracy of the four candidate estimators. We present average accuracy statistics

across all one hundred samples in Table 3. The table presents four separate indicators of accuracy – Pearson

correlation, Spearman (rank) correlation, absolute deviation, and squared deviation. The values are averages

across all areas and the 100 samples.

There are large diﬀerences across estimators. When looking at correlations, there is a noticeable diﬀerence

between traditional EBP – which has been the workhorse of small area prediction for decades – and the

three machine-learning estimators; the latter consistently outperform traditional EBP across all countries

and outcomes. In fact, there is not a single case where traditional EBP is more highly correlated with the

reference measure than any of the ML methods, for any of the outcomes. Among the three ML methods,

BRF and XGBoost perform slightly better in terms of correlations, though the diﬀerences are not large and

the ranking of the two is not consistent.

Correlations are an important measure of accuracy because they reﬂect targeting accuracy, or the ability to

discern the poorest areas. However, correlations do not typically capture bias, in the sense that correlations

are unchanged when a constant is added to all predictions. Because of this, we also examine deviations from

truth, which are more sensitive to bias. When looking at absolute and squared deviations, the three ML

estimators again substantially outperform EBP, at least on average. XGBoost is generally most accurate

across the ﬁve outcomes, although cubist and BRF perform better in Madagascar. Some of the diﬀerences


                                                           14
           Table 3: Accuracy statistics across simulations

                              EBP       cubist    XGBoost     BRF
Madagascar (assets)
corr. (pearson)              0.830      0.881      0.897      0.893
corr. (spearman)             0.750      0.808      0.824      0.836
absolute dev.                0.259      0.209      0.231      0.230
squared dev.                 0.109      0.070      0.078      0.072
Malawi (assets)
corr. (pearson)              0.801      0.870      0.889      0.881
corr. (spearman)             0.810      0.863      0.879      0.882
absolute dev.                0.289      0.230      0.215      0.237
squared dev.                 0.220      0.147      0.125      0.152
Malawi (poverty)
corr. (pearson)              0.834      0.902      0.915      0.923
corr. (spearman)             0.797      0.862      0.877      0.887
absolute dev.                0.147      0.079      0.076      0.103
squared dev.                 0.036      0.015      0.013      0.018
Mozambique (assets)
corr. (pearson)              0.867      0.886      0.920      0.922
corr. (spearman)             0.758      0.786      0.816      0.841
absolute dev.                0.204      0.174      0.168      0.168
squared dev.                 0.074      0.061      0.047      0.046
Sri Lanka (assets)
corr. (pearson)              0.867      0.895      0.912      0.884
corr. (spearman)             0.850      0.880      0.903      0.896
absolute dev.                0.167      0.146      0.135      0.162
squared dev.                 0.053      0.042      0.036      0.058
 Note: Measures of accuracy are averages across all 100 independent
samples drawn from the respective country census. These are simple,
unweighted averages across areas. EBP refers to small area estimation
and BRF refers to local linear forests.




                                 15
are quite large. Relative to traditional EBP, for example, XGBoost’s absolute deviation is 10.9 percent lower

in Madagascar and 18.9 percent lower in Sri Lanka. The largest diﬀerences are in Malawi, particularly for

poverty; traditional EBP’s absolute deviation is 94.9 percent larger than XGBoost’s. Squared deviation is

even starker: traditional EBP’s absolute deviation is 183.6 percent larger than XGBoost’s. All three ML

methods vastly outperform traditional EBP when predicting poverty in Malawi. For assets, the average

squared deviation across all four countries is 0.114 for EBP, 0.080 for cubist, 0.072 for XGBoost, and 0.082

for BRF. Overall, XGBoost appears to be the most accurate method in terms of deviations, and traditional

EBP the least accurate.

However, accuracy is of course not the only measure of concern. In particular, it is important to accurately

estimate measures of uncertainty, since these are often used to determine whether the estimates are suﬃciently

precise to be published. While machine learning methods have been consistently shown to be good predictors

of a number of outcomes, they generally have had less success when it comes to measures of uncertainty.

In Table 4, we present two key statistics of uncertainty, with the uncertainty measures calculated using a

random-eﬀect block residual bootstrap, as described in the methodology section. The ﬁrst statistic is the

coverage rate, which shows how often the true value (from the census) is within the conﬁdence intervals for a

given prediction. Since we calculate 95% conﬁdence intervals, we expect the coverage rates to be around 0.95.

Overall, coverage rates for all four estimators are quite good, with all rates around 95 percent. Average

coverage rates for assets are 0.950 for EBP, 0.966 for cubist, 0.940 for XGBoost, and 0.968 for BRF, while

overall coverage rates including poverty in Malawi are 0.916 for EBP, 0.959 for cubist, 0.938 for XGBoost,

and 0.961 for BRF. The large decrease in coverage rates for EBP comes from a coverage rate of just 0.780 for

poverty, which may be partly due to the less accurate predictions for this particular outcome.

There is, however, variation in coverage rates. For example, coverage rates for XGBoost are closer to 0.9 for

assets in Madagascar while BRF goes as high as almost 0.99 for assets in Mozambique. Some of the lower

coverage rates are due to out-of-sample estimates, a point to which we return below. Nonetheless, average

coverage rates are quite good for all estimators. This is particularly notable for the three ML estimators,

since unlike the older, more established estimation methods for EBP, no widespread consensus exists on how

best to measure uncertainty.

In addition to coverage rates, we also present the average width of conﬁdence intervals (“CI width”) as an

additional metric. For EBP, the upper and lower conﬁdence intervals are estimated by adding or subtracting

1.96 times the square root of the estimated mean squared error to the point estimate. For the machine

learning methods, the CIs are calculated using bootstrap percentiles for the three other estimators; as such,



                                                     16
                                  Table 4: Uncertainty statistics across simulations

                                                          EBP         cubist      XGBoost        BRF
                      Madagascar (assets)
                      coverage                            0.953        0.974        0.894        0.978
                      CI width                            1.334        1.224        0.761        1.147
                      Malawi (assets)
                      coverage                            0.925        0.925        0.915        0.936
                      CI width                            1.312        1.365        1.122        1.646
                      Malawi (poverty)
                      coverage                            0.780        0.931        0.928        0.934
                      CI width                            0.516        0.591        0.518        0.641
                      Mozambique (assets)
                      coverage                            0.957        0.983        0.979        0.987
                      CI width                            1.259        1.496        1.101        1.345
                      Sri Lanka (assets)
                      coverage                            0.964        0.980        0.973        0.971
                      CI width                            0.941        1.177        1.039        1.387
                       Note: Measures of uncertainty are averages across all 100 independent
                      samples drawn from the respective country census. These are simple,
                      unweighted averages across areas. EBP refers to small area estimation
                      and BRF refers to local linear forests. We do not use actual standard
                      errors when calculating coverage rates for Cubist, XGBoost, and BRF.
                      Instead, we use the appropriate percentiles of the bootstrapped distri-
                      bution. As such, we present the width of the 95-percent conﬁdence
                      interval instead of the standard errors.


we present CI widths instead of standard errors in the table. In line with the (generally) improved accuracy

of XGBoost over other methods, XGBoost also tends to have the narrowest conﬁdence intervals.13 XGBoost

has the smallest intervals except for Malawi poverty and Sri Lanka assets. However, the smallest for Malawi

poverty is traditional EBP, which achieves a coverage rate of just 0.780. Of the other cases, only for Sri Lanka

assets is XGBoost more uncertain than the others. As noted, though, XGBoost, does slightly underestimate

uncertainty for Madagascar.

Not surprisingly, the results show a trade-oﬀ between estimated precision and coverage rates. While XGBoost

achieves accurate coverage on average with the smallest conﬁdence intervals, it also clearly underestimates

uncertainty slightly in at least one situation. Nonetheless, the coverage rates are still quite respectable in

Madagascar, at around 0.9, nowhere near the worst performance of EBP (0.780 for poverty in Malawi). On

the other hand, BRF and cubist never underestimate uncertainty to the same extent as XGBoost does in

Madagascar, but they do so with much larger conﬁdence intervals.
13 Sincewe use a residual bootstrap, higher accuracy leads to smaller conﬁdence intervals by construction. However, it is not a
  one-to-one improvement, since the bootstrap consists of two steps – using residuals of both areas and subareas – while we only
  present results for areas.




                                                              17
3.1    In-Sample and Out-of-Sample Estimates

We next look at accuracy and precision based on whether an area is included in the sample or not. In each

of the 100 samples per country, we randomly select subareas, with probability proportional to size, and

then randomly select households from within each subarea. Not all areas appear in every sample, with less

populous areas more likely to be excluded from the sample on a given draw. For this section, we deﬁne

an area as “in sample” if at least one subarea from within that area is selected. We start with accuracy

statistics in Table 5. The ﬁrst ﬁve columns include in-sample areas only, while the last four columns include

out-of-sample areas. The extra column for in-sample areas is due to direct estimates, which we also include

as a way to gauge the performance of the other indicators. Statistics are means based on all 100 independent

samples.

The direct estimates are in the ﬁrst column. Across all outcomes, the direct estimate is actually quite

accurate for four of the ﬁve outcomes, although it is just 0.794 for assets in Sri Lanka (and 0.735 for the

rank correlation). Direct estimates consistently have the lowest correlation, followed by EBP, while the

three machine learning methods tend to perform better. The main pattern in the table is that out-of-sample

estimates are worse than in-sample estimates, which is true for all outcomes and all estimators. This is a

result of calibrating the model on a sample that under-represents sparsely populated sub-areas. Since these

areas are systematically diﬀerent from larger areas, the accuracy of the out-of-sample estimates suﬀers.

The average diﬀerence in performance in terms of correlations across the ﬁve outcomes is sometimes substantial.

For example, (Pearson) correlation for Malawi assets drops by 17.7 percent for EBP, by 15.2 percent for

Cubist, by 11.9 percent for XGBoost, and by 11.3 percent for BRF. The smallest drop in performance is in

Madagascar, where correlation drops by 13.7 percent for EBP, by 7.3 percent for Cubist, by 5.9 percent for

XGBoost, and by 6.5 percent for BRF.

We also see variation in the accuracy as measured by deviations from truth. In Sri Lanka, squared deviation

increases by 369.1 percent for EBP, by 676.2 percent for Cubist, by 601.2 percent for XGBoost, and by 541.9

percent for BRF. Malawi assets again shows the largest increases in deviation, across all estimators and both

absolute and squared deviation.

Despite these diﬀerences, a consistent pattern emerges both in and out-of sample. Of the four candidate

methods, EBP is consistently the least accurate – sometimes by large margins – both in and out of sample.

The three ML estimators vary across countries and outcomes. For example, BRF is slightly better than

XGBoost in Madagascar, the two are approximately the same in Mozambique, and XGBoost is slightly better

in Malawi and Sri Lanka.


                                                      18
           Table 5: Accuracy statistics across simulations by sample status

                                    In sample                           Out of sample
                    direct EBP        cubist XGB        BRF     EBP     cubist XGB      BRF
Madagascar
(assets)
corr. (pearson)     0.874   0.911     0.926     0.933   0.934   0.786   0.858   0.878   0.873
corr. (spearman)    0.784   0.818     0.846     0.865   0.864   0.727   0.794   0.811   0.827
absolute dev.       0.252   0.243     0.199     0.217   0.218   0.265   0.212   0.236   0.234
squared dev.        0.126   0.100     0.062     0.073   0.065   0.112   0.072   0.080   0.075
Malawi
(assets)
corr. (pearson)     0.870   0.891     0.953     0.955   0.945   0.734   0.808   0.841   0.838
corr. (spearman)    0.794   0.841     0.908     0.909   0.915   0.767   0.816   0.846   0.846
absolute dev.       0.255   0.194     0.137     0.133   0.143   0.410   0.346   0.319   0.357
squared dev.        0.124   0.083     0.037     0.034   0.045   0.391   0.284   0.241   0.288
Malawi
(poverty)
corr. (pearson)     0.805   0.867     0.947     0.945   0.955   0.795   0.852   0.879   0.886
corr. (spearman)    0.766   0.796     0.885     0.887   0.904   0.779   0.830   0.851   0.861
absolute dev.       0.129   0.150     0.063     0.066   0.095   0.143   0.098   0.087   0.113
squared dev.        0.030   0.034     0.007     0.008   0.014   0.038   0.024   0.019   0.024
Mozambique
(assets)
corr. (pearson)     0.870   0.919     0.942     0.956   0.957   0.838   0.853   0.900   0.903
corr. (spearman)    0.740   0.782     0.831     0.858   0.881   0.753   0.768   0.802   0.827
absolute dev.       0.237   0.192     0.149     0.144   0.145   0.208   0.183   0.177   0.177
squared dev.        0.115   0.065     0.042     0.035   0.034   0.078   0.069   0.052   0.051
Sri Lanka
(assets)
corr. (pearson)     0.794   0.895     0.915     0.913   0.897   0.799   0.843   0.901   0.863
corr. (spearman)    0.735   0.862     0.893     0.901   0.899   0.796   0.826   0.896   0.863
absolute dev.       0.233   0.142     0.125     0.125   0.137   0.221   0.192   0.158   0.216
squared dev.        0.095   0.034     0.027     0.030   0.038   0.092   0.073   0.050   0.102
 Note: Measures of accuracy are averages across all 100 independent samples drawn
from the respective country census. These are simple, unweighted averages across areas.
EBP refers to small area estimation and BRF refers to local linear forests. Sample
status is deﬁned separately in each independent sample, such that a single area can be
in sample in one draw and out of sample in another. In-sample areas refers to areas in
which at least one subarea is randomly selected to be included in the sample. Out-of-
sample areas refers to areas in which not a single subarea is randomly selected.




                                              19
                        Table 6: Uncertainty statistics across simulations

                                              In sample                           Out of sample
                             direct   EBP       cubist XGB        BRF     EBP     cubist XGB      BRF
Madagascar (assets)
coverage                     0.818    0.885     0.979     0.909   0.986   0.976   0.972   0.890   0.976
CI width                     0.859    1.009     1.216     0.760   1.169   1.444   1.227   0.762   1.140
Malawi (assets)
coverage                     0.895    0.941     0.987     0.983   0.987   0.905   0.850   0.829   0.871
CI width                     1.068    0.982     1.328     1.106   1.605   1.727   1.410   1.143   1.698
Malawi (poverty)
coverage                     0.756    0.700     0.987     0.984   0.988   0.881   0.862   0.856   0.866
CI width                     0.595    0.423     0.605     0.532   0.670   0.632   0.573   0.500   0.603
Mozambique (assets)
coverage                     0.850    0.865     0.990     0.988   0.994   0.991   0.980   0.975   0.984
CI width                     0.950    0.772     1.477     1.092   1.349   1.439   1.504   1.105   1.343
Sri Lanka (assets)
coverage                     0.884    0.972     0.994     0.982   0.988   0.946   0.952   0.955   0.935
CI width                     0.939    0.833     1.174     1.037   1.383   1.175   1.186   1.041   1.396
 Note: Measures of uncertainty are averages across all 100 independent samples drawn from the
respective country census. These are simple, unweighted averages across areas. EBP refers to
small area estimation and BRF refers to local linear forests. We do not use actual standard errors
when calculating coverage rates for Cubist, XGBoost, and BRF. Instead, we use the appropriate
percentiles of the bootstrapped distribution. As such, we present the width of the 95-percent
conﬁdence interval instead of the standard errors. Sample status is deﬁned separately in each inde-
pendent sample, such that a single area can be in sample in one draw and out of sample in another.
In-sample areas refers to areas in which at least one subarea is randomly selected to be included in
the sample. Out-of-sample areas refers to areas in which not a single subarea is randomly selected.




                                                  20
For the three machine learning models with the random eﬀects block bootstrap, in-sample coverage rates tend

to be higher than 0.95 across all outcomes and the estimators. Madagascar, where coverage rates are 0.909

for XGBoost, is the only exception. EBP, on the other hand, underestimates uncertainty for three of the ﬁve

outcomes, with coverage rates of 0.700 for Malawi poverty, which is slightly below results from other contexts.

The direct estimates also consistently underestimate uncertainty, with an average coverage rate of 0.841.

As noted above, out-of-sample estimates tend to be less accurate than in-sample estimates, because the sample

under-represents sparsely populated areas. The lower out-of-sample accuracy in turn leads to lower coverage

rates out of sample. This is particularly noticeable in Malawi, where coverage rates for assets are 0.905 for

EBP, 0.850 for cubist, 0.829 for XGBoost, and 0.871 for BRF. Uncertainty is similarly underestimated for

poverty in Malawi, with coverage rates of 0.881 for EBP, 0.862 for cubist, 0.856 for XGBoost, and 0.866

for BRF. Meanwhile, in Madagascar, XGBoost slightly underestimates uncertainty while the other three

estimators slightly overestimate uncertainty.

One disadvantage of the random eﬀects block bootstrap is that it can only utilize data from sampled sub-areas

and areas, and therefore cannot eﬀectively distinguish between sampled and un-sampled areas when estimating

uncertainty. Despite the presence of sample weights, the sample systematically under-represents less populous

areas and extrapolations into non-sampled areas are less accurate than estimates for sampled areas. Yet this

additional source of model error is not reﬂected in uncertainty estimates. There are possible alternatives, such

as using a parametric bootstrap – which is how EBP calculates uncertainty – or modeling heteroscedasticity,

as in Elbers, Lanjouw, and Lanjouw (2003). However, the random eﬀect block bootstrap does quite well in

the simulations reported above, with coverage rates always exceeding 85 percent out of sample. In these cases,

the tendency for the bootstrap procedure to underestimate coverage out-of-sample may not be a major issue.

There are two possible reasons why out-of-sample predictions are less accurate than in-sample predictions.

The ﬁrst is that the model is overﬁt. However, this seems very unlikely given that regularization methods are

used to avoid overﬁtting to the particular sample. For example, when implementing EBP, we select a model

using LASSO to avoid overﬁtting. Meanwhile, The ML estimators employ diﬀerent methods to help avoid

overﬁtting, such as using a random subset of data and/or variables across trees and splits, and estimating

regularized objective functions. Another more likely explanation is that smaller areas – which are much less

likely to appear in the sample – diﬀer systematically from sampled areas and are therefore related to the

predictors in a diﬀerent way.

To shed light on this, we examine diﬀerences in accuracy between estimates for sampled and non-sampled

areas, and how these change when area ﬁxed eﬀects are included. Including area ﬁxed eﬀects controls for



                                                      21
                   Table 7: Diﬀerence in accuracy within areas: in-sample vs. out-of-sample

                              EBP                     cubist                     XGB                     BRF
                        (1)         (2)         (3)            (4)         (5)         (6)         (7)            (8)
 madagascar          -0.122***   -0.102***   -0.051***      -0.041***   -0.105***   -0.065***   -0.067***   -0.038***
                       (0.020)    (0.016)     (0.018)        (0.006)     (0.018)     (0.011)     (0.019)     (0.006)
 malawi (assets)     -0.624***     -0.005    -0.776***      -0.209***   -0.743***   -0.162***   -0.792***   -0.096***
                       (0.053)    (0.015)     (0.054)        (0.016)     (0.058)     (0.024)     (0.065)     (0.011)
 malawi (poor)        0.247***    0.020**      -0.020       -0.142***     -0.009      0.002       0.040     -0.057***
                       (0.062)    (0.009)     (0.085)        (0.019)     (0.065)     (0.025)     (0.069)     (0.011)
 mozambique          -0.123***     -0.028    -0.182***      -0.082***   -0.216***   -0.095***   -0.212***   -0.092***
                       (0.022)    (0.021)     (0.020)        (0.008)     (0.020)     (0.012)     (0.024)     (0.007)
 sri lanka           -0.387***   -0.172***   -0.379***      -0.178***   -0.270***     -0.002    -0.366***   -0.112***
                       (0.039)    (0.023)     (0.038)        (0.016)     (0.037)     (0.015)     (0.055)     (0.013)
 Fixed eﬀects:         ———        ———         ———            ———         ———          ——-        ———         ———
 simulation              Yes        Yes         Yes            Yes         Yes         Yes         Yes         Yes
 area                    No         Yes          No            Yes          No         Yes          No         Yes
  Standard errors are clustered at the simulation and area level. Each row and column is a separate regression,
 where the dependent variable is absolute deviation from truth for a given area/estimator and the independent
 variable is an indicator for whether the area appears in the sample in a given simulation.
  *p<0.1 **p<0.05 ***p<0.01


ﬁxed characteristics of areas, meaning that the remaining diﬀerence in accuracy is attributable solely to these

areas being excluded from the sample. Table 7 presents a set of regressions where we estimate the diﬀerence

in (log) absolute deviation between areas that were included or excluded from the sample. In each pair of

columns, the ﬁrst column includes only simulation ﬁxed eﬀects, while in the second column we also include

area ﬁxed eﬀects that restrict identiﬁcation to within-area changes in sample status across simulations.

The ﬁrst column in each pair shows that in-sample areas have much higher accuracy (lower absolute deviations)

than out-of-sample areas. The diﬀerences ranges from approximately 12 to 39 percent. There are several

important diﬀerences in the second column when area ﬁxed eﬀects are included. First, focusing on XGBoost,

the inclusion of ﬁxed eﬀects substantially decreases the diﬀerence in accuracy based on sample status. For

example, the ﬁxed-eﬀect coeﬃcient is only 61.9 percent as large as the ﬁrst coeﬃcient in Madagascar, only

21.8 percent as large for Malawi assets, and only 44.0 percent as large in Mozambique. In Sri Lanka, the

coeﬃcient basically decreases to zero, while both coeﬃcients are zero for poverty in Malawi. In other words,

once we take into account diﬀerences in characteristics across areas, the diﬀerence in accuracy based on

sample status shrinks substantially. For two of the outcomes, in fact, sample status does not predict any

diﬀerences in accuracy. This suggests that systematic diﬀerences between sampled and non-sampled areas

explain a large share of the diﬀerence in accuracy based on sample status.




                                                       22
3.2    Accuracy for Poorer and Richer Areas

Figure 1 looks at how accuracy varies for poorer and wealthier areas. On the ﬁrst y-axis is mean squared

deviation across samples and on the x-axis is the true value for each area. On the second y-axis, we include

the kernel density estimate of the true value. Several patterns emerge. First, accuracy tends to be lowest

where density is lowest. In other words, predictions are most accurate in areas of common support. All

prediction methods are far more accurate for interpolating within the sample than for extrapolating where

there is little common support.

Second, areas of low density in the sample do not always correspond to smaller areas. In Madagascar, for

example, the fewest observations are in the upper tail of the distribution, where the asset index is highest.

These areas correspond to larger, more urban areas, with high populations. This also means that these areas

are much more likely to appear in the sample (since probability of inclusion is related to size). Nonetheless,

we see the worst predictions in these areas, because they are systematically diﬀerent from most areas in the

sample. In other words, the problem is more complex than simply whether an area is included in the sample.

This raises the possibility of revising sampling strategies to oversample the tails of the distribution, in terms

of the outcome of interest, in addition to more populous areas. The results also highlight the importance of

accurately reporting large conﬁdence intervals at the tails of the distribution.

Figure A2 in the appendix shows predicted-true plots for all ﬁve outcomes, using local polynomial smoothing

in order to better trace out average predictions across the range of truth. In general, average predictions

hew relatively closely to truth, represented by the 45-degree dashed line. The most notable exception to

this is assets in Malawi, where the tails are rather poorly estimated by all methods. However, Figure A3 in

the appendix presents the accompanying scatterplot. The scatterplot makes clear that the Malawi result is

really driven by just a few areas in the tails. In the lower tail, this is just a single area. While the upper tail

has more areas further from the 45-degree line, the majority of these are not as extreme as the polynomial

smoothing suggests.

Consistent with previous results, we again see both XGBoost and BRF usually being closest to the 45-degree

line. Importantly, EBP shows poor performance in the upper tail for poverty in Malawi, performing obviously

worse than the other estimators. These areas are arguably the areas that are most important to predict

accurately to guide interventions to reduce poverty.




                                                        23
                                                   Figure 1: Deviations from truth vs. Welfare
                       A. Madagascar assets                                                               B. Malawi assets

                1.5                                                   1.5                             4
                                                                                                                                                      0.75
squared deviation




                                                                                      squared deviation
                                                                                                      3




                                                                        density (x)




                                                                                                                                                         density (x)
                1.0                                                   1.0                                                                             0.50
                                                                                                      2

                0.5                                                   0.5                                                                             0.25
                                                                                                      1


                0.0                                                   0.0                             0                                               0.00
                      -1             0                 1          2                                         -2        -1        0         1   2
                                         asset index                                                                       asset index
                           C. Malawi poverty                                                                D. Mozambique assets
                0.125                                                 2.5
                                                                                                      1.5                                              1.5

                0.100                                                 2.0
squared deviation




                                                                                      squared deviation



                                                                                                      1.0                                              1.0
                                                                        density (x)




                                                                                                                                                         density (x)
                0.075                                                 1.5


                0.050                                                 1.0
                                                                                                      0.5                                              0.5

                0.025                                                 0.5

                                                                                                      0.0                                              0.0
                           0.00     0.25        0.50       0.75                                                  -1         0             1       2
                                           poverty                                                                          asset index
                        E. Sri Lanka assets



                0.75                                                  1.5
                                                                                                                                    density
squared deviation




                                                                        density (x)




                                                                                                                           Estimator:
                0.50                                                  1.0
                                                                                                                                    EBP

                                                                                                                                    Cubist
                0.25                                                  0.5
                                                                                                                                    XGBoost

                                                                                                                                    BRF
                0.00                                                  0.0
                                    -1                 0          1
                                         asset index

Note: Both ﬁgures are smoothed conditional means of the mean squared deviation across 100 independent samples (ﬁrst y-axis)
on truth (x-axis). The kernel density estimate refers to the density of truth, which is on the x-axis.




                                                                              24
3.3    Simple Cubist Regression

Two big advantages of traditional EBP are its parsimony and transparency. Models can be summarized in

a simple regression table that indicates how each predictor is related to the outcome. The ML methods

considered here, on the other hand, are opaque. This is because they aggregate large numbers of “weak

learners” that are estimated using diﬀerent subsets of candidate variables and observations. Because it is

diﬃcult to show a large number of weak learners in a tractable way, the predictions methods appear to be a

black box.

Cubist regression, when the estimation options are purposefully restricted, oﬀers a middle ground between

the transparency and parsimony of a linear model and the black box nature of more sophisticated machine

learning methods. In the main results above, we tune hyperparameters for cubist in the same way as the

others, to maximize prediction accuracy. However, we can set the parameters to make a more transparent

model by allowing for a small number of rules and discarding “committees” (essentially multiple Cubist

regression models). This option is not possible for XGBoost and BRF, given how they are designed. We

therefore consider a particular simple cubist model, with three rules and no committee. Compared with EBP,

this simple Cubist model beneﬁts from being able to estimate three distinct linear models on diﬀerent subsets

of the data, which are themselves selected to maximize predictive accuracy. The model can be reported quiet

simply as three regression tables with an additional row deﬁning the relevant subsets of the data. Cubist

regression, however, unlike EBP does not include a mixed eﬀect that treats the direct survey estimate as a

prior, and therefore discards some useful information in sampled areas.

We report the results from a simpliﬁed 3-rule Cubist model in Table 8. In column one, we present the EBP

results (presented earlier) in order to directly compare those to a more transparent cubist model. We include

all four accuracy statistics and both statistics related to uncertainty for each outcome. The average Pearson

correlation for EBP and the simpliﬁed cubist, across all ﬁve outcomes, is 0.840 and 0.848, respectively, while

the respective statistics for rank correlation are 0.793 and 0.810. On average, the correlations of the simple

cubist model are higher than those for EBP.

We can similarly compare deviations for the two estimators. In terms of absolute deviation, the average

across the four outcomes with assets is 0.230 for EBP and 0.213 for cubist, while the squared deviations are

0.114 and 0.107, respectively. For poverty in Malawi, the simple cubist model vastly outperforms EBP in

terms of deviations; the absolute deviation for the former is 25.5 percent smaller while the squared deviation

is 30.0 percent smaller.

Overall, simple Cubist regression tends to outperform EBP in more cases than not, and thus may be a


                                                     25
     Table 8: Accuracy statistics across simulations

                                  EBP            Cubist
                                                (simple)
Madagascar (assets)
corr. (pearson)                  0.830            0.856
corr. (spearman)                 0.750            0.793
absolute dev.                    0.259            0.217
squared dev.                     0.109            0.085
coverage                         0.953            0.969
CI width                         1.334            1.244
Malawi (assets)
corr. (pearson)                  0.801            0.801
corr. (spearman)                 0.810            0.810
absolute dev.                    0.289            0.289
squared dev.                     0.220            0.220
coverage                         0.925            0.914
CI width                         1.312            1.467
Malawi (poverty)
corr. (pearson)                  0.834            0.848
corr. (spearman)                 0.797            0.821
absolute dev.                    0.147            0.110
squared dev.                     0.036            0.025
coverage                         0.780            0.909
CI width                         0.516            0.584
Mozambique (assets)
corr. (pearson)                  0.867            0.882
corr. (spearman)                 0.758            0.784
absolute dev.                    0.204            0.175
squared dev.                     0.074            0.063
coverage                         0.957            0.979
CI width                         1.259            1.429
Sri Lanka (assets)
corr. (pearson)                  0.867            0.853
corr. (spearman)                 0.850            0.843
absolute dev.                    0.167            0.172
squared dev.                     0.053            0.061
coverage                         0.964            0.971
CI width                         0.941            1.256
 Note: Measures of accuracy are averages across all 100
independent samples drawn from the respective country
census. These are simple, unweighted averages across areas.
The ﬁrst column is for EBP, while the second is for a simple
version of cubist with just three rules and no committees.




                            26
preferred option to EBP models, although the diﬀerence is not large. Furthermore, the simple Cubist model

gives less accurate predictions in Sri Lanka. Another option, which we leave for further research, would

be estimating a version of the simple Cubist models that includes a conditional random eﬀect, bulding on

Krennmair and Schmid (2022). This would incorporate the additional model ﬂexibility that Cubist oﬀers

with the mixed eﬀect framework oﬀered by EBP. As for now, there appears to be a clear trade-oﬀ between

accuracy (XGBoost and BRF) and transparency (EBP and simpliﬁed Cubist).




4    An Application: The 2019 Malawi IHS

The results in the previous sections demonstrated that the three ML methods, particularly XGBoost and

BRF, predict more accurately than EBP and that the random eﬀects block bootstrap generates accurate

estimates of uncertainty. However, the evidence presented so far relies on simulated samples. Though the

samples come from real data, the sampling structure is somewhat simpliﬁed relative to most household

surveys. To see whether these results carry over to an actual survey, we turn to the 2019 Malawi Fifth

Integrated Household Survey (IHS).

The IHS took place the year following the census used above and is therefore well-timed to check whether

the main results hold when using a real-world survey. The IHS collected information on both assets and

expenditures. However, there are two minor concerns. First, the IHS took place the year after the census.

While a year might not lead to large changes in a developed countries, this is not necessarily true in a country

like Malawi, where rain-fed agriculture predominates and slightly diﬀerent weather can lead to large changes

in poverty. Second, we use the IHS in order to impute expenditures and poverty into the census. This could

lead to information leakage for poverty which could lead to inﬂated measures of accuracy. This is not a

concern with assets, however.

The 2019 IHS is a large survey. It consists of 11,249 households spread across 697 subareas (enumeration

areas) and 321 areas (Traditional Authorities – TAs). In the country as a whole, there are 420 TAs, meaning

that there are 99 out-of-sample areas in the survey. We can compare results in areas that are in the survey

(in-sample areas) to these out-of-sample areas in order to see whether the prediction methods diﬀer greatly in

their predictive power based on whether an area is in the sample or not.

Table 9 presents accuracy statistics for in-sample areas (the ﬁrst ﬁve columns) and out-of-sample areas (the

last four columns). As before, we include direct estimates as a point of comparison. All four alternative

estimators perform better than the direct estimates, across all accuracy statistics and for both assets and



                                                      27
                   Table 9: Accuracy and uncertainty statistics for the 2019 Malawi IHS

                                              In sample                           Out of sample
                            Direct    EBP      cubist     XGB     BRF     EBP     cubist   XGB      BRF
     Assets
     corr. (pearson)        0.906     0.927     0.952     0.953   0.957   0.705   0.765    0.795   0.806
     corr. (spearman)       0.859     0.871     0.900     0.904   0.906   0.725   0.777    0.813   0.817
     abs. deviation         0.227     0.202     0.172     0.151   0.145   0.550   0.502    0.481   0.476
     squared dev.           0.083     0.063     0.047     0.038   0.034   0.615   0.488    0.469   0.452
     coverage               0.776     0.869     1.000     1.000   1.000   0.828   1.000    1.000   1.000
     CI width               0.736     0.765     0.991     0.796   0.985   1.947   1.162    0.905   1.124
     Poverty
     corr. (pearson)        0.804     0.886     0.895     0.895   0.913   0.799   0.815    0.832   0.828
     corr. (spearman)       0.771     0.828     0.826     0.838   0.854   0.799   0.792    0.804   0.801
     abs. deviation         0.119     0.089     0.086     0.086   0.080   0.140   0.128    0.129   0.134
     squared dev.           0.023     0.013     0.012     0.012   0.010   0.036   0.032    0.029   0.031
     coverage               0.882     0.882     1.000     1.000   1.000   0.990   1.000    1.000   1.000
     CI width               0.574     0.393     0.432     0.335   0.427   0.861   0.462    0.351   0.455
     Note: Measures of accuracy and uncertainty are averages across all areas.


poverty. Since the IHS is a relatively large survey, the direct estimates are quite accurate; the in-sample

pearson correlation is 0.906 for the mean asset index. However, as before, direct estimates are less accurate

for poverty, at 0.804, because poverty rates are the mean of a discrete rather than continuous variable.

The overall patterns of accuracy remain consistent with the previous results. All four candidate small

area estimation methods outperform the direct estimate, sometimes substantially. For both in-sample and

out-of-sample areas, XGB and BRF perform noticeably better than EBP, while cubist performs better but

less so. For assets in in-sample areas, the correlation is 2.8 percent higher for XGB than EBP, while the

absolute deviation and squared deviation are 25.0 percent and 40.4 percent lower, respectively. For poverty,

the diﬀerences are less pronounced.

We see relatively larger diﬀerences in out-of-sample areas, though the overall rank ordering is the same.

For poverty, the correlation is 12.7 percent higher for XGBoost than EBP, while the absolute and squared

deviations are 12.6 percent lower and 23.8 percent lower, respectively. The accuracy decreases markedly for

all estimators when looking at assets, although there is a much less noticeable absolute drop oﬀ for poverty,

especially for XGBoost. XGBoost sees a drop of 16.6 percent for pearson correlations and an increase of

percent for absolute deviations and 68.6 percent for squared deviations for in-sample relative to out-of-sample

areas, leading to XGBoost outperforming all other estimators for out-of-sample areas for poverty, though

BRF appears very slightly more accurate for assets. Given that we only have one sample and how close

some of the statistics are, however, it is diﬃcult to determine conclusively whether XGBoost or BRF is most

accurate.


                                                          28
Table 9 also presents the two statistics related to uncertainty: the coverage rate and the mean width of

conﬁdence intervals. Both the direct estimates and EBP underestimate standard errors, with the latter doing

so for three of the four diﬀerent subsamples, with only out-of-sample poverty being above 0.95 while the

rest are all below 0.9. On the other hand, the three ML estimators overestimate uncertainty; not a single

area falls outside the conﬁdence intervals. Importantly, XGBoost accomplishes this with the smallest CIs,

on average, among the estimators, with only EBP for in-sample assets being smaller. Out of sample, the

widths of the CIs for XGBoost are 53.5 percent lower for assets and 59.3 percent lower for poverty than EBP,

despite XGBoost obtaining higher coverage rates in both instances.

Comparing the out-of-sample accuracy penalty for assets and poverty gives some insight into the importance

of diﬀerent reference measures. In general, the out-of-sample penalty tends to be higher for assets than for

poverty. This is surprising given the diﬀerence in how the reference benchmark is constructed. For assets, the

reference benchmark is measured directly in the census. Meanwhile, for poverty, the reference benchmark is

derived from a model estimated using the 2019 IHS. By construction, the model is estimated using data that

excludes non-sampled areas, which are systematically diﬀerent from sampled areas. One might expect this to

lead to particularly large discrepancies for poverty out of sample, compared both with in-sample estimates

and with assets. The fact that the opposite is true suggests that in this context, that the non-representative

nature of the IFS has limited impact on the comparison of accuracy across methods in out-of-sample areas.

Finally, Figure A6 in the appendix shows the predicted-true plot for assets (left panel) and poverty (right

panel), again using conditional mean smoothed lines. The patterns are generally similare to the previous

section for Malawi. The assets predictions are accurate in the middle of the distribution but poorly predicted

in the upper and lower tails, but Figure A5 in the appendix shows that the lower range, in particular, is

driven by just one or two areas.

Although these results are estimated using an independent survey, they generally echo the ﬁndings in the

previous section. XGBoost and BRF, on average, perform best on measures of both accuracy and precision.

EBP is the least accurate of the four prediction methods, though EBP still oﬀers a signiﬁcant improvement on

direct estimates in-sample. In all cases, in-sample estimates are more accurate than out-of-sample estimates

by a substantial margin.




                                                     29
5    Conclusion

This paper evaluates the use of tree-based machine learning techniques for the purposes of small area

estimation of wealth and poverty using household census data from four countries. In addition to evaluating

accuracy, it suggests and implements a bootstrap routine for the machine learning estimators and validates

its performance using real-world data. The three tree-based machine learning methods evaluated – Cubist

Regression, Extreme Gradient Boosting (XGboost), and Boosted Regression Forests (BRF) – all signiﬁcantly

outperformed the linear mixed model traditionally used for small area estimation (Molina and Rao 2010;

Tzavidis et al. 2018). Of these three methods, XGboost tended to perform equal to BRF, except for the left

tail of the Sri Lanka wealth distribution where it was far more accurate. However, although the rank-ordering

of XGBoost and BRF is not consistent across statistics and BRF is on more solid ground than XGBoost

theoretically. Meanwhile, predictions generated using Cubist regression models are generally a bit less

accurate.

The results make a strong case for the use of XGBoost or BRF in cases where the sample includes a

suﬃciently large number of subareas, those subareas comprise a small share of subareas in the population, and

transparency and parsimony are not ﬁrst-order concerns. In other cases where transparency and parsimony

are important, or the number of sampled sub-areas is small, linear mixed EBP models or Cubist regression

with a small number of rules are viable options. The Cubist regression option allows for more ﬂexible models

while the EBP approach contains a mixed eﬀect that allows the survey data to serve as a prior. In cases

where the sample contains a large share of population subareas, Bayesian or Empirical Bayesian methods

such as EBP become more attractive. Because predicted values are estimated fairly accurately and precisely

relative to direct estimates in this context, it does not appear that the inclusion of the mixed eﬀect is a

crucial element in generating accurate estimates, as its beneﬁcial eﬀect is generally outweighed by the more

ﬂexible set of linear models oﬀered by a simple Cubist regression with three rules. There are also methods

that combine machine learning with mixed eﬀects (Krennmair and Schmid 2022), though these cannot yet be

applied to gradient boosting using readily available software.

One notable ﬁnding, which has also been observed in other contexts, is the signiﬁcantly greater accuracy

of estimates in sampled areas than non-sampled areas. This diﬀerence occurs despite the use of sampling

weights and regularization methods to prevent overﬁtting. These diﬀerences are much diminished when

comparing accuracy estimates within the same area across simulations. Nonetheless, the tendency for samples

to under-represent less densely populated areas can lead to less accurate predictions out of sample. This

ﬁnding underlines the beneﬁts of including all target areas in the sample. Further research could explore



                                                     30
whether and how re-weighting the sample can improve the accuracy of out-of-sample predictions. Related

to this is the diﬃculty of predicting accurately at the tails of the welfare distribution, suggesting potential

beneﬁts from oversampling tails. Further research can also explore how the relative accuracy of diﬀerent

methods depends on the size and structure of the sample, as this analysis only considers one type of sample.

A key contribution of the paper is to evaluate the accuracy of the uncertainty estimates generated by a random

eﬀect block residual bootstrap, as proposed by Chambers and Chandra (2013) and applied by Krennmair

and Schmid (2022). These estimates perform well across the board, as coverage rates never fall below 98

percent in-sample and 85 percent out of sample. Overall, the results indicate that the combination of Extreme

Gradient Boosting or Boosted Regression Forests, the random eﬀect block residual bootstrap, and publicly

available geospatial data oﬀer a practical way to signiﬁcantly improve on both direct survey estimates and

EBP estimates when geolocated survey data are available.




                                                      31
References

Aiken, Emily, Guadalupe Bedoya, Joshua Blumenstock, and Aidan Coville. 2022. “Program targeting with

   machine learning and mobile phone data: Evidence from an anti-poverty intervention in Afghanistan.”

   arXiv Preprint arXiv:2206.11400.

Athey, Susan, Julie Tibshirani, and Stefan Wager. 2019. “Generalized Random Forests.” The Annals of

   Statistics 47 (2): 1148–78.

Atkinson, Anthony B. 2019. Measuring Poverty Around the World. Princeton University Press.

Battese, George E, Rachel M Harter, and Wayne A Fuller. 1988. “An error-components model for prediction

   of county crop areas using survey and satellite data.” Journal of the American Statistical Association 83

   (401): 28–36.

Blumenstock, Joshua. 2016. “Fighting Poverty with Data.” Science 353 (6301): 753–54.

Blumenstock, Joshua, Gabriel Cadamuro, and Robert On. 2015. “Predicting Poverty and Wealth from

   Mobile Phone Metadata.” Science 350 (6264): 1073–76.

Carter, Grace M, and John E Rolph. 1974. “Empirical Bayes Methods Applied to Estimating Fire Alarm

   Probabilities.” Journal of the American Statistical Association 69 (348): 880–85.

Chambers, Raymond, and Hukum Chandra. 2013. “A random eﬀect block bootstrap for clustered data.”

   Journal of Computational and Graphical Statistics 22 (2): 452–70.

Chen, Tianqi, and Carlos Guestrin. 2016. “XGBoost: A scalable tree boosting system.” In Proceedings of the

   22nd ACM SIGKDD international conference on knowledge discovery and data mining, 785–94.

Chi, Guanghua, Han Fang, Sourav Chatterjee, and Joshua Blumenstock. 2022. “Microestimates of Wealth

   for All Low-and Middle-Income Countries.” Proceedings of the National Academy of Sciences 119 (3):

   e2113658119.

Donaldson, Dave, and Adam Storeygard. 2016. “The view from above: Applications of satellite data in

   economics.” Journal of Economic Perspectives 30 (4): 171–98.

Efron, Bradley. 2020. “Prediction, Estimation, and Attribution.” International Statistical Review 88: S28–59.

Efron, Bradley, and Carl Morris. 1973. “Stein’s estimation rule and its competitors–an empirical Bayes

   approach.” Journal of the American Statistical Association 68 (341): 117–30.

Elbers, Chris, Tomoki Fujii, Peter Lanjouw, Berk Özler, and Wesley Yin. 2007. “Poverty alleviation through

   geographic targeting: How much does disaggregation help?” Journal of Development Economics 83 (1):

   198–213.

Elbers, Chris, Jean O Lanjouw, and Peter Lanjouw. 2003. “Micro-Level Estimation of Poverty and Inequality.”

   Econometrica 71 (1): 355–64.



                                                     32
Engstrom, Ryan, Jonathan Hersh, and David Newhouse. 2022. “Poverty from space: Using high resolution

   satellite imagery for estimating economic well-being.” The World Bank Economic Review 36 (2): 382–412.

Fay, Robert E, and Roger A Herriot. 1979. “Estimates of income for small places: an application of

   James-Stein procedures to census data.” Journal of the American Statistical Association 74 (366a):

   269–77.

Friedberg, Rina, Julie Tibshirani, Susan Athey, and Stefan Wager. 2020. “Local Linear Forests.” Journal of

   Computational and Graphical Statistics 30 (2): 503–17.

Friedman, Jerome. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of

   Statistics, 1189–1232.

Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. 2010. “Regularization Paths for Generalized Linear

   Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1.

Fujii, Tomoki, and Roy van der Weide. 2020. “Is Predicted Data a Viable Alternative to Real Data?” The

  World Bank Economic Review 34 (2): 485–508.

González-Manteiga, Wenceslao, Maria J Lombardía, Isabel Molina, Domingo Morales, and Laureano Santa-

   maría. 2008. “Bootstrap Mean Squared Error of a Small-Area EBLUP.” Journal of Statistical Computation

   and Simulation 78 (5): 443–62.

Henderson, J Vernon, Adam Storeygard, and David N Weil. 2012. “Measuring Economic Growth from Outer

   Space.” American Economic Review 102 (2): 994–1028.

Jean, Neal, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell, and Stefano Ermon. 2016.

   “Combining satellite imagery and machine learning to predict poverty.” Science 353 (6301): 790–94.

Jiang, Jiming, and Partha Lahiri. 2006. “Mixed Model Prediction and Small Area Estimation.” TEST 15

   (1): 1–96.

Kilic, Talip, Umar Serajuddin, Hiroki Uematsu, and Nobuo Yoshida. 2017. “Costing Household Surveys for

   Monitoring Progress Toward Ending Extreme Poverty and Boosting Shared Prosperity.” World Bank

   Policy Research Working Paper, no. 7951.

Krennmair, Patrick, and Timo Schmid. 2022. “Flexible domain prediction using mixed eﬀects random forests.”

   arXiv Preprint arXiv:2201.10933.

Kreutzmann, Ann-Kristin, Sören Pannier, Natalia Rojas-Perilla, Timo Schmid, Matthias Templ, and Nikos

   Tzavidis. 2019. “The R package emdi for estimating and mapping regionally disaggregated indicators.”

   Journal of Statistical Software 91.

Kuhn, Max, and Kjell Johnson. 2013. Applied Predictive Modeling. Vol. 26. Springer.

Kuhn, Max, Steve Weston, Chris Keefer, and Maintainer Max Kuhn. 2022. “Package Cubist.” Rule-and

   Instance-Based Regression Modeling: Version 0.2 3.


                                                    33
Masaki, Takaaki, David Newhouse, Ani Rudra Silwal, Adane Bedada, and Ryan Engstrom. 2022. “Small

   area estimation of non-monetary poverty with geospatial data.” Statistical Journal of the IAOS 38 (3):

   1035–51.

McBride, Linden, Christopher B Barrett, Christopher Browne, Leiqiu Hu, Yanyan Liu, David S Matteson,

   Ying Sun, and Jiaming Wen. 2022. “Predicting Poverty and Malnutrition for Targeting, Mapping,

   Monitoring, and Early Warning.” Applied Economic Perspectives and Policy 44 (2): 879–92.

Merfeld, Joshua D, and Jonathan Morduch. 2022. “Poverty at Higher Frequency.” Working Paper.

Merfeld, Joshua D, David Locke Newhouse, Michael Weber, and Partha Lahiri. 2022. “Combining Survey and

   Geospatial Data Can Signiﬁcantly Improve Gender-Disaggregated Estimates of Labor Market Outcomes.”

   World Bank Working Paper 10077.

Molina, Isabel, and JNK Rao. 2010. “Small Area Estimation of Poverty Indicators.” Canadian Journal of

   Statistics 38 (3): 369–85.

Newhouse, David Locke, Joshua D Merfeld, Anusha Ramakrishnan, Tom Swartz, and Partha Lahiri. 2022.

   “Small Area Estimation of Monetary Poverty in Mexico Using Satellite Imagery and Machine Learning.”

   World Bank Working Paper 10175.

Peterson, Ryan A, and Joseph E Cavanaugh. 2019. “Ordered Quantile Normalization: A Semiparametric

   Transformation Built for the Cross-Validation Era.” Journal of Applied Statistics.

Quinlan, John R. 1992. “Learning with Continuous Classes.” In 5th Australian joint conference on artiﬁcial

   intelligence, 92:343–48. World Scientiﬁc.

———. 2014. C4. 5: Programs for Machine Learning. Elsevier.

Rao, John NK, and Isabel Molina. 2015. Small Area Estimation. John Wiley & Sons.

Ratledge, Nathan, Gabriel Cadamuro, Brandon De la Cuesta, Matthieu Stigler, and Marshall Burke. 2021.

   “Using Satellite Imagery and Machine Learning to Estimate the Livelihood Impact of Electricity Access.”

   National Bureau of Economic Research.

Ravallion, Martin. 2015. The economics of poverty: History, measurement, and policy. Oxford University

   Press.

Tibshirani, Julie, Susan Athey, Rina Friedberg, Vitor Hadad, David Hirshberg, Luke Miner, Erik Sverdrup,

   Stefan Wager, Marvin Wright, and Maintainer Julie Tibshirani. 2022. “Package Grf.”

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical

   Society: Series B (Methodological) 58 (1): 267–88.

Tzavidis, Nikos, Li-Chun Zhang, Angela Luna, Timo Schmid, and Natalia Rojas-Perilla. 2018. “From Start to

   Finish: A Framework for the Production of Small Area Oﬃcial Statistics.” Journal of the Royal Statistical

   Society: Series A (Statistics in Society) 181 (4): 927–79.


                                                     34
Wang, Yong, and Ian H Witten. 1997. “Inducing Model Trees for Continuous Classes.” In Proceedings of the

   ninth European conference on machine learning, 9:128–37. 1. Citeseer.

Yeh, Christopher, Anthony Perez, Anne Driscoll, George Azzari, Zhongyi Tang, David Lobell, Stefano Ermon,

   and Marshall Burke. 2020. “Using publicly available satellite imagery and deep learning to understand

   economic well-being in Africa.” Nature Communications 11 (1): 1–11.




                                                   35
Appendix A




                               Table A1: Geospatial features

                   Indicator                            Source
                  Population                         WorldPop
                 Precipitation                      TerraClimate
                 Temperature                        TerraClimate
                  Nightlights                      NOAA VIIRS
                  Land cover                       EU Copernicus
                   Elevation                Conservation Science Partners
                     NDVI                              MODIS
              Pollution measures                   EU Copernicus
             Distance to key cities             Collected by authors


                                            36
                                   Figure A1: Deviations from truth and truth, out-of-sample areas only
                       A. Madagascar assets                                                               B. Malawi assets

                                                                                                      4
                                                                                                                                                      0.75
                1.5                                                   1.5
squared deviation




                                                                                      squared deviation
                                                                                                      3




                                                                        density (x)




                                                                                                                                                         density (x)
                                                                                                                                                      0.50
                1.0                                                   1.0
                                                                                                      2


                0.5                                                                                                                                   0.25
                                                                      0.5                             1


                0.0                                                   0.0                             0                                               0.00
                      -1             0                 1          2                                         -2        -1        0         1   2
                                         asset index                                                                       asset index
                           C. Malawi poverty                                                                D. Mozambique assets
                0.125                                                 2.5                             1.5                                              1.5


                0.100                                                 2.0
squared deviation




                                                                                      squared deviation



                                                                                                      1.0                                              1.0
                                                                        density (x)




                                                                                                                                                         density (x)
                0.075                                                 1.5


                0.050                                                 1.0
                                                                                                      0.5                                              0.5

                0.025                                                 0.5

                                                                                                      0.0                                              0.0
                           0.00     0.25        0.50       0.75                                                  -1         0             1       2
                                           poverty                                                                          asset index
                        E. Sri Lanka assets

                1.00                                                  2.0


                                                                                                                                    density
squared deviation




                0.75                                                  1.5
                                                                        density (x)




                                                                                                                           Estimator:
                0.50                                                  1.0                                                           EBP

                                                                                                                                    Cubist
                0.25                                                  0.5                                                           XGBoost

                                                                                                                                    BRF
                0.00                                                  0.0
                                    -1                 0          1
                                         asset index

Note: All ﬁgures are smoothed conditional means of the mean squared deviation across 100 independent samples (ﬁrst y-axis)
on truth (x-axis), with means restricted to only samples in which an area appears (in-sample areas). The kernel density estimate
refers to the density of truth, which is on the x-axis.




                                                                              37
                                                     Figure A2: Predicted-true plots across areas
                         A. Madagascar assets                                                         B. Malawi assets
                     3
                                                                                                  2
average prediction




                                                                             average prediction
                     2
                                                                                                  1

                     1
                                                                                                  0


                     0                                                                        -1



                 -1                                                                           -2
                         -1              0                  1            2                            -2        -1        0             1   2
                                               truth                                                                      truth
                              C. Malawi poverty                                                       D. Mozambique assets


                 0.75
                                                                                                  2
average prediction




                                                                             average prediction




                 0.50                                                                             1



                                                                                                  0
                 0.25


                                                                                              -1
                 0.00
                              0.00      0.25         0.50       0.75                                       -1         0             1       2
                                                  truth                                                                   truth
                         E. Sri Lanka assets
                     1



                                                                                                                     Estimator:
average prediction




                     0
                                                                                                                          EBP

                                                                                                                          Cubist

                                                                                                                          XGBoost
                 -1
                                                                                                                          BRF




                                        -1                  0           1
                                               truth
Note: All ﬁgures are smoothed conditional means of the estimated value (y-axis) on truth (x-axis).




                                                                         38
                                                  Figure A3: Predicted-true plots across areas, scatter
                         A. Madagascar assets                                                          B. Malawi assets


                                                                                                   2
                     2
average prediction




                                                                              average prediction
                                                                                                   1

                     1
                                                                                                   0


                     0
                                                                                               -1



                 -1                                                                            -2
                         -1              0                   1            2                            -2        -1        0             1   2
                                               truth                                                                       truth
                              C. Malawi poverty                                                        D. Mozambique assets


                 0.75                                                                              2
average prediction




                                                                              average prediction




                                                                                                   1
                 0.50


                                                                                                   0
                 0.25


                                                                                               -1
                 0.00
                              0.00      0.25          0.50        0.75                                      -1         0             1       2
                                                  truth                                                                    truth
                         E. Sri Lanka assets
                     1



                                                                                                                      Estimator:
average prediction




                     0
                                                                                                                           EBP

                                                                                                                           Cubist

                                                                                                                           XGBoost
                 -1
                                                                                                                           BRF




                                        -1                   0            1
                                               truth
Note: In all ﬁgures, areas are ordered by true value, which is displayed on the x-axis. The y-axis presents the estimated value for
each given area.




                                                                           39
                                                                Figure A4: MSE and probability of appearing in the sample
                                  A. Madagascar assets                                                                               B. Malawi assets
                                                                                                                                 1
mean squared deviation (log)




                                                                                                  mean squared deviation (log)
                                                                                                                                 0
                           -2.0
                                                                                                                             -1


                           -2.5                                                                                              -2

                                                                                                                             -3

                           -3.0                                                                                              -4

                                                                                                                             -5
                                  -6                -4                -2        0             2                                           -7.5        -5.0         -2.5       0.0
                                                   sum of probabilities (log)                                                                    sum of probabilities (log)
                                  C. Malawi poverty                                                                                  D. Mozambique assets

                           -3.0
mean squared deviation (log)




                                                                                                  mean squared deviation (log)


                           -3.5                                                                                              -2

                           -4.0


                           -4.5                                                                                              -3


                           -5.0


                           -5.5                                                                                              -4
                                       -7.5              -5.0          -2.5         0.0                                              -6          -4           -2          0         2
                                                   sum of probabilities (log)                                                                    sum of probabilities (log)
                                  E. Sri Lanka assets
                           -1.5
mean squared deviation (log)




                           -2.0
                                                                                                                                                        Estimator:
                           -2.5
                                                                                                                                                             EBP

                           -3.0                                                                                                                              Cubist

                                                                                                                                                             XGBoost
                           -3.5
                                                                                                                                                             BRF
                           -4.0


                                              -2           -1              0    1         2
                                                   sum of probabilities (log)

Note:




                                                                                              40
                                     Figure A5: Predicted-true plot, 2019 Malawi IHS
               Panel A: Assets                                               Panel B: Poverty



           2                                                          0.75




           1

                                                                      0.50
estimate




                                                               estimate
           0



                                                                      0.25

       -1




       -2                                                             0.00

               -2       -1       0        1        2                         0.00        0.25         0.50   0.75
                                 truth                                                            truth

                                 Estimator:      EBP         Cubist            XGBoost          BRF

Note: In both ﬁgures, the true. The y-axis presents the log of the mean squared deviation for each area.




                                                             41
          Table A2: Accuracy statistics across simulations

                              EBP       cubist   XGBoost      BRF
Madagascar (assets)
corr. (pearson)              0.830      0.881      0.897      0.893
corr. (spearman)             0.750      0.808      0.824      0.836
absolute dev.                0.259      0.209      0.231      0.230
squared dev.                 0.109      0.070      0.078      0.072
Malawi (assets)
corr. (pearson)              0.801      0.870      0.889      0.881
corr. (spearman)             0.810      0.863      0.879      0.882
absolute dev.                0.289      0.230      0.215      0.237
squared dev.                 0.220      0.147      0.125      0.152
Malawi (poverty)
corr. (pearson)              0.834      0.902      0.915      0.923
corr. (spearman)             0.797      0.862      0.877      0.887
absolute dev.                0.147      0.079      0.076      0.103
squared dev.                 0.036      0.015      0.013      0.018
Mozambique (assets)
corr. (pearson)              0.867      0.886      0.920      0.922
corr. (spearman)             0.758      0.786      0.816      0.841
absolute dev.                0.204      0.174      0.168      0.168
squared dev.                 0.074      0.061      0.047      0.046
Sri Lanka (assets)
corr. (pearson)              0.867      0.895      0.912      0.884
corr. (spearman)             0.850      0.880      0.903      0.896
absolute dev.                0.167      0.146      0.135      0.162
squared dev.                 0.053      0.042      0.036      0.058
 Note: Measures of accuracy are averages across all 100 independent
samples drawn from the respective country census. These are simple,
unweighted averages across areas. EBP refers to small area estimation
and BRF refers to local linear forests.




                                 42
          Table A3: Precision statistics across simulations

                              EBP        cubist    XGBoost     BRF
Madagascar (assets)
CI width                      1.147      1.224      0.761      1.147
coverage                      0.953      0.974      0.894      0.978
Malawi (assets)
CI width                      1.646      1.365      1.122      1.646
coverage                      0.925      0.925      0.915      0.936
Malawi (poverty)
CI width                      0.641      0.591      0.518      0.641
coverage                      0.780      0.931      0.928      0.934
Mozambique (assets)
CI width                      1.345      1.496      1.101      1.345
coverage                      0.957      0.983      0.979      0.987
Sri Lanka (assets)
CI width                      1.387      1.177      1.039      1.387
coverage                      0.964      0.980      0.973      0.971
 Note: Measures of precision are averages across all 100 independent
samples drawn from the respective country census. These are simple,
unweighted averages across areas. EBP refers to small area estimation
and BRF refers to local linear forests. We do not use actual standard
errors when calculating coverage rates for Cubist, XGBoost, and BRF.
Instead, we use the appropriate percentiles of the bootstrapped distri-
bution. As such, we present the width of the 95-percent conﬁdence
interval instead of the standard errors.




                                  43
Table A4: Accuracy statistics across simulations, LASSO variables

                                     XGBoost     BRF
         Madagascar (assets)
         corr. (pearson)              0.892       NA
         corr. (spearman)             0.808       NA
         absolute dev.                0.235       NA
         squared dev.                 0.077       NA
         Malawi (assets)
         corr. (pearson)              0.881       NA
         corr. (spearman)             0.862       NA
         absolute dev.                0.227       NA
         squared dev.                 0.137       NA
         Malawi (poverty)
         corr. (pearson)              0.900
                                                  NA
         corr. (spearman)             0.858       NA
         absolute dev.                0.101       NA
         squared dev.                 0.019       NA
         Mozambique (assets)
         corr. (pearson)              0.900       NA
         corr. (spearman)             0.800       NA
         absolute dev.                0.181       NA
         squared dev.                 0.055       NA
         Sri Lanka (assets)
         corr. (pearson)              0.904       NA
         corr. (spearman)             0.896       NA
         absolute dev.                0.139       NA
         squared dev.                 0.039       NA




                               44
                   Table A5: Optimal hyperparametercs from cross-validation

                      Madagascar       Malawi        Malawi       Mozambique   Sri Lanka
                                       assets        poverty
cubist
rules                      3              3              3              3          3
committees                 3              5              5              1          5
sample                    0.4            0.4            0.4            0.4        0.4
neighbors                  0              0              0              0          0
XGB
max depth                  8              6              6              6          6
eta                      0.02           0.01           0.01           0.01       0.01
col sample               0.6            0.6             0.4            0.6       0.6
subsample                 0.6            0.6            0.4            0.4        0.4
BRF
sample fraction          0.5             0.5            0.5           0.5         0.5
mtry                      30              30             25            30         30
min. node size             4              5              5              5          4
honesty fraction          0.6            0.6            0.6            0.6        0.6
alpha                    0.05           0.025          0.025          0.05       0.05




                                                45
                                     Figure A6: Predicted-true plot, 2019 Malawi IHS
               Panel A: Assets                                              Panel B: Poverty



           2                                                         0.75




           1

                                                                     0.50
estimate




                                                              estimate
           0



                                                                     0.25

       -1




       -2                                                            0.00

               -2       -1       0        1        2                        0.00        0.25         0.50   0.75
                                 truth                                                           truth

                                 Estimator:      EBP        Cubist            XGBoost          BRF

Note: Both ﬁgures are smoothed conditional means of the estimated value (y-axis) on truth (x-axis).




                                                            46
    Table A6: Accuracy statistics across simulations

                              orderNorm         logShift
Madagascar (assets)
corr. (pearson)                  0.830           0.848
corr. (spearman)                 0.750           0.771
absolute dev.                    0.259           0.242
squared dev.                     0.109           0.087
Malawi (assets)
corr. (pearson)                  0.801           0.800
corr. (spearman)                 0.810           0.811
absolute dev.                    0.289           0.304
squared dev.                     0.220           0.253
Malawi (poverty)
corr. (pearson)                  0.834           0.834
corr. (spearman)                 0.797           0.797
absolute dev.                    0.147           0.147
squared dev.                     0.036           0.036
Mozambique (assets)
corr. (pearson)                  0.867           0.876
corr. (spearman)                 0.758           0.763
absolute dev.                    0.204           0.195
squared dev.                     0.074           0.069
Sri Lanka (assets)
corr. (pearson)                  0.867           0.854
corr. (spearman)                 0.850           0.835
absolute dev.                    0.167           0.182
squared dev.                     0.053           0.059
 Note: Measures of accuracy are averages across all 100
independent samples drawn from the respective country
census. These are simple, unweighted averages across areas.
orderNorm refers to an ordered quantile normalization and
logShift refers to a log-shift transformation.




                            47
Appendix B - Imputing Poverty in the Malawi Census

We impute welfare into the 2018 Malawi census using the 2019 Integrated Household Survey (IHS). Both

the census and the IHS include information on household assets and key household demographics, while the

survey has information on expenditures/consumption. We select the variables that are common to both

datasets and then use lasso to select the most predictive variables to use in the imputation procedure. The

post-lasso regression results are in Table B1.

We predict welfare directly into the census using the results in Table B1. Figure B1 shows the resulting

densities of welfare (expenditures) in the survey and the census. The census-imputed results show a lot less

variation – as we would expect. The estimated poverty rate in the IHS is 0.507. Given the diﬀerences in the

variation of assets, we set the poverty threshold at median expenditures in the census, leading to a poverty

rate of 0.500, almost equal to that in the survey.

                      Figure B1: Distribution of welfare in 2019 IHS and 2018 census




          1.5
density




          1.0




          0.5




          0.0

                10                    12                        14                         16
                                                 log(expenditures)

                                                 2019 IHS     2018 census




                                                     48
 Table B1: Welfare imputation results

                              Model 1
headAge                   −0.0002 (0.0004)
headMale                  −0.069 (0.012)***
headEducPrimary            0.103 (0.013)***
headEducSecondary          0.136 (0.018)***
headEducUniversity         0.217 (0.038)***
hhOwned                   −0.137 (0.014)***
hhRoofGrass                 −0.095 (0.097)
hhRoofIron                    0.004 (0.097)
hhRoofCement                  0.719 (0.553)
hhWallMud                   −0.050 (0.059)
hhWallConcrete              −0.063 (0.063)
hhWallBricksBurnt           −0.069 (0.057)
hhWallBricksUnburnt         −0.051 (0.056)
hhFloorEarth                  0.166 (0.223)
hhFloorCement                 0.308 (0.222)
hhFloorWood                  0.843 (0.444)*
hhFloorTile                 0.608 (0.236)**
hhDwellingTypePerm         0.105 (0.028)***
hhDwellingTypeSemiperm     0.077 (0.020)***
hhDwellingRooms           −0.065 (0.005)***
hhAssetsCell               0.125 (0.012)***
hhAssetsRadio              0.063 (0.012)***
hhAssetsTV                 0.078 (0.025)***
hhAssetsComputer           0.364 (0.039)***
hhAssetsFridge             0.197 (0.031)***
hhAssetsBike                −0.010 (0.012)
hhAssetsTable              0.050 (0.013)***
hhAssetsBed                0.219 (0.014)***
hhAssetsIron               0.122 (0.017)***
hhAssetsSolarPanel          0.028 (0.014)**
hhAssetsCDDVD              0.070 (0.026)***
hhAssetsCar                0.375 (0.042)***
r-squared                      0.407
Observations                   11.425
* = p<0.1, ** = p<0.05, *** = p<0.01




                   49
Appendix C - Methods

C1        Cubist

The Cubist algorithm proceeds as follows:

   1. Form a decision tree by conducting an exhaustive search over the predictor space and

         training set samples. Splits are determined by minimizing the standard error of the dependent

         variables within groups. In mathematical terms, splits are chosen recursively to maximize the reduction

         in a measure of error. Deﬁning S as the entire set of data and S1 , ..., Sp as the P subsets of the data

         after splitting, the algorithm maximizes




                                                                        P
                                                                             ni
                                             reduction = SD(S ) −               SD(Si ),                                   (10)
                                                                       i=1
                                                                             n


         where SD is the standard deviation, n is the number of sample observations considered, and ni is the

         number of sample observations in partition i. In other words, the algorithm identiﬁes the set of splits

         that maximizes the reduction in the weighted average, across child nodes, of the standard deviations

         within the nodes. Splitting ceases, and the node becomes a leaf, when the maximum residual falls below

         a minimum tolerance level or when the number of training cases falls below a minimum threshold.14

   2. Estimate and simplify linear models at each node. At each node of the tree, a linear model is

         estimated using only the variable attributes used to split the sub-tree above the node. In other words,

         the model for the ﬁrst split from the top will be a bivariate regression with a single predictor. At

         subsequent nodes further down the tree, the set of candidate variables expands to include the set of all

         variables used for splitting to that point.

         Not all candidate variables are actually used in the models. In particular, the resulting linear models

         are simpliﬁed to avoid overﬁtting, by greedily dropping variables to minimize "adjusted error rate".

         The adjusted error rate is the mean absolute error multiplied by a term to penalize models with many

         variables, deﬁned as:
14 The
     minimum tolerance level is set at ﬁve percent of the standard deviation of the dependent variable in the full training data
 (Wang and Witten, 1997). The minimum number of observations is set to 10 percent of the sample if the sample is less than
 2000 observations, or 20 if the sample is more than 2000 observations.




                                                              50
                                                             n∗
                                                    n∗ + p
                               adjusted error rate = ∗                   ˆi (Xi )| ,
                                                                   |yi − y                                (11)
                                                    n −p     i=1



  where n∗ is the number of observations in the training data at the node used to build the model; p

                                                                                          ˆi (Xi ) is
  is the number of parameters, equal to the number of independent variables plus one; and y

  the predicted value from the model given a set of predictor variables Xi . The variable that leads to

  the largest reduction in the adjusted error rate is removed, sequentially, until the adjusted error rate

  increases when removing any of the remaining predictors. Removing attributes inevitably increases
                                                                       n∗ +p
  mean absolute error but also reduces the multiplication factor       n∗ −p ,   which may reduce the adjusted

  error rate.

  Finally, the procedure performs an outlier check, deﬁning outliers as cases where residuals are greater

  than ﬁve times the average absolute value of the model residuals for that node. At each node, before

  ﬁnalizing the model, outliers are eliminated from the estimation sample and the model is re-estimated

  and re-simpliﬁed.

3. Prune the rules. Each leaf of the tree is translated into a set of "rules" based on the sequence of

  splitting conditions that lead to the leaf. For example, a rule based on a leaf with two branches above

  it would consist of three conditions, for example X1 > 10, X2 < 2 and X3 = 1. These rules are then

  "pruned", a process that eliminates conditions that are harmful or not useful for predicting the full set

  of training data. To measure prediction accuracy, the algorithm uses the adjusted error rate deﬁned in

  Equation 11, applied to the full set of training data.

  As a ﬁrst step, the algorithm calculates smoothed predictions across the various conditions of a rule,

  which corresponds to particular nodes along the tree that lead to a leaf, using the following formula

  [@hastie1990shrinking]:




                                                ˆkid + (1 − a)Y
                                        ˆpar = aY
                                        Y                     ˆpar ,                                      (12)



                                                                             ˆkid is the prediction of the
        ˆpar is the prediction of the model estimated at the parent node and Y
  where Y

  model estimated at the current node. a is equal to




                                                  51
                                           var(epar ) − cov (ekid , epar )
                                      a=                                   ,                          (13)
                                                var(epar − ekid )


  where epar are the model residuals from the parent node and ekid are the model residuals from the

  current node, for training cases under consideration at the current node. All rule pruning is based on

  these smoothed predictions.

  The second step is to eliminate all conditions (nodes) that increase in the adjusted rate, deﬁned as in

  Equation 12 except taken over the full training sample. The program identiﬁes the condition (node)

  that, when removed, leads to the largest decline in the adjusted rate. If removing that condition does

  not increase the adjusted error rate, it is removed. This proceeds sequentially until no condition can be

  removed without increasing the adjusted error rate.

  The third step repeats step 2, except that conditions are removed as long as they do not raise the

  adjusted error rate by more than 0.5 percent. This additional step is implemented to further simplify

  the tree structure. Finally, if necessary, conditions are further pruned until the number of remaining

  rules is equal to the maximum number of rules speciﬁed by the user.

4. Generate smoothed models for each rule. For each rule, a model is created by coeﬃcients at the

  leaves with all the models above it on the path to the initial split, similar to Equation 12. The model

  coeﬃcients for each rule (leaf) are averaged according to the following formula:




                                                ˆkid + (1 − a)β
                                        ˆpar = aβ
                                        β                     ˆpar .                                  (14)



5. This procedure smooths the model coeﬃcients by collecting the sequence of linear models at each node

  into a single, smoothed representation of the models. The algorithm adjusts the ﬁnal model so that all

  continuous cutpoints match those present in the data, by changing the cutpoint to equal the closest

  value in the data.

  The Cubist software also allows an option to estimate "committees," which are sets of Cubist models

  that successively correct errors in the previous estimates, similar to boosting. In other words, each

  committee produces a series of rules and associated models that iteratively predict the errors from the

  previous committeeś prediction. The package uses cross-validation to determine the optimal number of



                                                   52
      rules and committees.

      The Cubist method, like the other machine learning methods, is associated with several hyperparameters.

     We opt to tune the hyperparameters just once, using a random survey sample drawn in the exact same

      way as described above in order to cut down on total computation time. We then keep this set of

      hyperparameters constant through all survey draws. In addition, we tune the hyperparameters via cross

      validation, hand coding the folds to draw all subareas in a given area to help preserve the hierarchical

      nature of the data and prevent some information leakage across folds. Table A1 in the appendix shows

      the optimal hyperparameters for cubist that we use in all sample iterations. There are three rules in

      each country and the number of committees varies from 1 in Mozambique to 5 in Sri Lanka and Malawi;

      we use default values for any unlisted hyperparameters.



C2     XGBoost

Extreme gradient boosting estimates a function that predicts the dependent variable yi as a function of the

set of independent variables xi . This function is deﬁned as the sum of a series of individual decision tree

functions. In mathematical terms, for a single observation i of a set of predictors xi ,




                                                           K
                                        ˆi = φ(xi ) =
                                        y                        fk (xi ), fk ∈ F,                            (15)
                                                          k=1



                                                                                                       ˆi in
where K is the number of trees estimated in the model and fk is a decision tree function mapping xi to y

the functional space F , which is the set of all possible decision trees. f1 , ..., fK is deﬁned as the minimum of

the following objective function of φ(xi ):




                                                 n                         K
                                     obj (φ) =         l (yi , φ(x0 )) +         ω (fk ),                     (16)
                                                 i=1                       k=1



where l is a diﬀerential convex loss function that measures the distance between the predicted value and the

training value and ω (fk ) is a regularization term that penalizes model complexity, deﬁned below.

Because the algorithm is optimizing over a set of feasible functions fk , instead of parameters, it is not possible

to use standard optimization tools. Instead, the algorithm proceeds by estimating each individual fk (xi ) tree




                                                            53
function in a “greedy” manner (Friedman 2001). Speciﬁcally, the algorithm identiﬁes a tree ft (xi ) at step t

to minimize the following objective function:




                                               n
                                                             (t−1)
                                  obj (t) =                 ˆi
                                                     l yi , y          + ft (xi ) + ω (ft ).                                   (17)
                                               i=1



This sequentially adds the ft that provides the largest improvement in performance according to the objective
                                                                                  (t−1)
                                                              ˆi
function, Equation 16, given the previous round’s prediction, y                             ˆi
                                                                                          . y 0
                                                                                                is set to zero so the ﬁrst iteration
                                                n
generates the tree f1 (xi ) that minimizes      i=1    l (yi , ft (xi )) + ω (ft ).

The mean value of the asset index is continuous when aggregated to the subarea level – the level at which we

estimate welfare – we use mean-squared error as the loss function, thus:




                                                                      (t−1)
                                          ˆi + ft (xi )) = y
                                  l (yi , y                ˆi                 + ft (xt ) − yi                                  (18)



The resulting objective function at step t, after removing constants, becomes




                                        n
                                                   (t−1)
                            obj (t) =           ˆi
                                              2 y          − yi ft (xi ) + ft (xi )2 + ω (ft ),                                (19)
                                        i=1



which the algorithm minimizes at each step by choosing ft (xi ).

Regularization prevents overﬁtting and, in a general case, is deﬁned as




                                                                          T
                                                              λ
                                               ω (fi ) = γT +                    2
                                                                                wj ,                                           (20)
                                                              2          j =1



where T represents the number of leaves on tree fk and wj is the score assigned to leaf j . λ and γ are tuning

parameters controlling the extent of regularization. We follow the default and set λ = 1 and γ = 0 for

estimation. Table (ref?)(tab:featuresStats) in the appendix shows the optimal hyperparameters for XGBoost

that we use in all sample iterations; we use default values for any unlisted hyperparameters.




                                                                 54
C3     Boosted Regression Forests (BRF)

To grow a tree, the algorithm ﬁrst takes a random sample of the data. The share of the data selected is a

parameter determined by cross-validation. The algorithm then begins the tree at the root node with this

random sample, and recursively splits the data to create child nodes. At each split, the algorithm randomly

selects a subset of the predictor variables as splitting candidates. For each splitting candidate, the algorithm

considers all the possible values these variables take on in the data. For all values taken on by all the splitting

candidates, the algorithm ﬁrst considers whether the split would meet three basic eligibility criteria:

   1. That the resulting children have a minimum number of observations that exceeds a minimum absolute

      node size parameter

   2. That each child contains more than a minimum threshold fraction of the parent observations, to prevent

      splits that are too imbalanced.

   3. That the split improves heterogeneity in outcomes as deﬁned in equation (19) below

The minimum node size and balance thresholds are parameters estimated through cross-validation, as

described below. Of the remaining candidate splits, the algorithm selects the threshold that maximizes

heterogeneity in the average outcome across the child nodes. All observations with variables below that

threshold are assigned to child 1 and all observations with variables above that threshold are assigned to

child 2. For boosted regression forests, heterogeneity in the split, denoted H , is deﬁned as:

                                           NC 1 NC 2             IP
                                  H=                         −        + IP NC 2 ,                             (21)
                                        NP (¯
                                         2         ¯C 2 )2
                                            yC 1 − y             NC 1

where NC 1 , NC 2 , and NP
                         2
                           are the number of observations in child 1, child 2, and the parent node, respectively,

    ¯C 1 and y
and y        ¯C 2 are the average values of the predicted outcome in the children. IP is an imbalance penalty

parameter, selected through cross validation, that favors more balanced splits.

To simplify the process, BRF automates the tuning of hyperparameters using cross-validation. However, for

computational reasons, the hyperparameters are estimated once in the ﬁrst boosting step.




                                                         55