Policy Research Working Paper                          10429




Poverty Mapping in the Age of Machine Learning
                                        Paul Corral
                                      Heath Henderson
                                       Sandra Segovia




 Poverty and Equity Global Practice
 May 2023
Policy Research Working Paper 10429


  Abstract
 Recent years have witnessed considerable methodological                           the Mexican Intercensal Survey of 2015, which is used to
 advances in poverty mapping, much of which has focused                            conduct several design-based simulation experiments. The
 on the application of modern machine-learning approaches                          findings show that the validation procedure often used for
 to remotely sensed data. Poverty maps produced with these                         machine-learning approaches can be misleading in terms
 methods generally share a common validation procedure,                            of model assessment since it yields incorrect information
 which assesses model performance by comparing subna-                              for choosing what may be the best set of estimates across
 tional machine-learning-based poverty estimates with                              different methods and scenarios. Using alternative valida-
 survey-based, direct estimates. Although unbiased, sur-                           tion methods, the paper shows that machine-learning-based
 vey-based estimates at a granular level can be imprecise                          estimates can rival traditional, more data intensive poverty
 measures of true poverty rates, meaning that it is unclear                        mapping approaches. Further, the closest approximation to
 whether the validation procedures used in machine-learning                        existing machine-learning approaches, using publicly avail-
 approaches are informative of actual model performance.                           able geo-referenced data, performs poorly when evaluated
 This paper examines the credibility of existing approaches                        against “true” poverty rates and fails to outperform tradi-
 to model validation by constructing a pseudo-census from                          tional poverty mapping methods in targeting simulations.




 This paper is a product of the Poverty and Equity Global Practice. It is part of a larger effort by the World Bank to
 provide open access to its research and make a contribution to development policy discussions around the world. Policy
 Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted
 at pcorralrodas@worldbank.org




        The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
        issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
        names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
        of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
        its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


                                                      Produced by the Research Support Team
               Poverty Mapping in the Age of Machine Learning

                        Paul Corral,∗Heath Henderson,†and Sandra Segovia‡§




Key words: Small area estimation, Poverty mapping, Machine learning, Satellite imagery

JEL classiﬁcation: C13, C55, C87, C15
  ∗
     The World Bank Group - Poverty and Equity Global Practice (pcorralrodas@worldbank.org)
   †
     Department of Economics and Finance, Drake University, United States of America
   ‡
     The World Bank Group
   §
     The authors acknowledge ﬁnancial support from the World Bank. We thank Roy van der Weide and Isabel Molina
for comments. Additionally, we thank Benu Bidani, Carlos Rodriguez-Castelan, and Johan Mistiaen for providing
support and space to pursue this work. Finally, we thank the Global Solutions Group on Data for Policy. Any error
or omission is the authors’ responsibility alone.
1     Introduction

Poverty maps provide granular estimates of poverty at the sub-national level in order to deepen the
understanding of poverty in a given country, better inform the targeting of resources, and support
the design of interventions tailored to local needs (Bedi et al. 2007; Elbers et al. 2007). Household
surveys provide estimates that are suﬃciently reliable for large areas in a given country but often
lack the desired precision and coverage to be able to properly inform targeting interventions at a
granular geographic level. Thus, it is necessary to rely on small area estimation.

Small area estimation is a branch of statistics focused on obtaining estimates of higher quality than
those obtained directly from the household survey. These small area estimation techniques often
combine data from household surveys and auxiliary information from censuses, registers, or others
to produce estimates of higher quality than what is possible from survey data alone for areas or
groups with small sample sizes. Since there is no such thing as a free lunch, to achieve these gains
in quality it is necessary to rely on model assumptions which must be thoroughly checked.

The literature on small area estimation is rich and several variations on this basic procedure have
been proposed. Unit-level models conduct estimation and prediction at the household level, assum-
ing a linear relationship between the welfare measure and covariates (Hentschel et al., 1998; Elbers
et al., 2003; Molina and Rao, 2010). Unit-level models are not well-suited to situations where the
survey and census data correspond to diﬀerent years, which is often the case in developing coun-
tries where censuses are conducted infrequently. Area-level models represent a feasible alternative
that similarly relies on linear functional forms, but conduct estimation and prediction using only
aggregate data for the geographical entities of interest (Fay and Herriot, 1979; Torabi and Rao,
2014). Unit-context models represent another alternative and are characterized by an estimation
stage wherein household-level measures are modeled exclusively as a linear function of area-level
characteristics (Nguyen, 2012; Lange et al., 2018; Masaki et al., 2020).             1


Recent developments in poverty mapping have focused on the application of machine learning to
remotely-sensed data, largely in response to the issue of out-of-date census information. These
approaches also use survey-derived welfare measures but are distinguished by a ﬁrst stage where
a machine-learning model is ﬁt to remotely-sensed covariates (e.g., data derived from satellite
imagery or call detail records) rather than census-based covariates. For example, Chi et al. (2022)
matched data from several remotely-sensed data sources to survey-based measures of “village”
wealth for 56 countries, and then ﬁt a prediction model using gradient boosting machines.2 They
then applied the model to the populated surface of all 135 low- and middle-income countries to
develop granular poverty maps for each country. Similar approaches have been used for individual
countries, including Rwanda (Blumenstock et al., 2015), Senegal (Pokhriyal and Jacques, 2017),
   1
     Despite their desirable features, the gains in precision oﬀered by area-level models is often quite limited (see
Molina and Morales (2009)).
   2
     Their remotely-sensed data includes high-resolution satellite imagery, mobile phone data, topographic maps, and
connectivity data from Facebook.



                                                         2
Bangladesh (Steele et al., 2017), and Belize (Hersh et al., 2021), among others.3
While often subsumed under the term “machine learning,” this modern approach to poverty map-
ping actually consists of several practices that diﬀer from more traditional approaches. Most
obviously, the modern approach substitutes non-parametric statistical methods for the parametric
approaches traditionally used for poverty mapping. In addition, the modern approach relies heav-
ily on remotely-sensed covariates rather than covariates derived from census data. Finally, poverty
maps produced with machine-learning methods generally share a common validation procedure.
The key feature of this procedure is that model performance is assessed by calculating the R2 from
a regression of the observed poverty measures on the estimated poverty measures for the same geo-
graphical units.4 The poverty measures used in this procedure are most often direct, sample-based
estimates rather than the true poverty measures for the regions of interest. While direct estimates
are unbiased, they can be imprecise estimates of true poverty measures, meaning that it is unclear
whether the validation procedure is informative of actual model performance (Corral et al., 2021b).
In this paper, we provide a more rigorous assessment of the performance of the modern approach to
poverty mapping. Our approach consists of constructing a pseudo-census (hereafter “census”) that
we use to conduct design-based simulation experiments. The census provides the true values we
wish to estimate. Our experiments entail repeatedly drawing survey samples from the census where
each sample produces direct poverty estimates for the sampled geographic areas. For each direct
estimate, the presence of the census means that we observe the true poverty measure corresponding
to the direct estimate and we use these true poverty measures to gain insight not only into the
credibility of model validation based on direct estimates, but also the performance of machine-
learning methods when validated against the true poverty measures. Given that population censuses
rarely collect detailed data on income or expenditures, we construct our census from a large-scale
household survey conducted in Mexico: the Mexican Intercensal Survey of 2015.
Prior to presenting our simulation results, we discuss three ways in which the R2 can be mislead-
ing in terms of model assessment. Speciﬁcally, we show analytically that the R2 based on direct
estimates is biased downward, that it is insensitive to diﬀerences in location and scale between the
poverty estimates and the reference measures, and that it is context-dependent in the sense that
it is inﬂuenced by the variance of the true poverty rates. We argue that the mean squared error is
a better measure for understanding the strength of the association between the poverty estimates
and reference measures. In addition, we argue that the ultimate concern is to understand how
diﬀerent methodological choices aﬀect poverty by inﬂuencing the eﬃciency of targeting. We there-
fore propose a targeting experiment through which we examine the relative ability of alternative
   3
     See Jean et al. (2016), Yeh et al. (2020), Lee and Braithwaite (2020), and Aiken et al. (2022) for examples of
other recent applications.
   4
     For examples of the above procedure, see Jean et al. (2016), Pokhriyal and Jacques (2017), Steele et al. (2017),
Lee and Braithwaite (2020), Yeh et al. (2020), or Chi et al. (2022). Note that a few studies use a similar procedure, but
ﬁt the machine-learning model to household-level data and then aggregate the model’s predictions to the geographic
unit of interest (Blumenstock et al., 2015; Hersh et al., 2021; Aiken et al., 2022). Others rely on reporting correlation
coeﬃcients between survey based estimates and model predictions, which is similar (see Smythe and Blumenstock
2022).


                                                           3
mapping methods to alleviate poverty in the context of the Mexican data.

The results of our simulations can be summarized as follows. First, we ﬁnd that the magnitude
of the downward bias of the R2 based on direct estimates is large, with the true R2 being around
35 to 50 percent higher depending on the model considered. We further ﬁnd that this bias has
important implications for model selection in that the direct R2 incorrectly identiﬁes the appro-
priate level of estimation in the vast majority of our simulations. Second, when assessing model
performance based on the true poverty rates using the mean squared error, we ﬁnd that our closest
approximation to the standard machine-learning implementation performs poorly relative to several
benchmark models. We ﬁnd that these performance issues are largely due to the limited predic-
tive power of remotely-sensed covariates relative to census-based covariates. Finally, our targeting
simulations show that none of our machine-learning implementations outperform more traditional
poverty mapping methods that are feasible with the same data.

Our paper builds on the work of Corral et al. (2021a) and Corral et al. (2022), who similarly used
the Mexican Intercensal Survey to examine the performance of several poverty mapping methods.
Corral et al. (2021a) focused exclusively on traditional poverty mapping approaches and did not
consider the performance of machine-learning methods. While Corral et al. (2022) consider the
performance of machine-learning approaches, their implementation only uses census-based covari-
ates and thus does not examine the performance of the standard implementation that relies on
remotely-sensed covariates. Importantly, we are not the ﬁrst paper to evaluate the performance of
machine-learning methods relative to census data. While Yeh et al. (2020) and Chi et al. (2022)
validate their models relative to censuses, both papers use the R2 as a performance metric and
focus on wealth estimates rather than poverty. Finally, van der Weide et al. (2022) examine how
well machine-learning methods using remotely-sensed data can reproduce poverty maps estimated
using census data. They do not, however, compare their estimates to a ground truth.

This paper then represents the ﬁrst attempt to rigorously assess the performance of modern poverty
mapping methods relative to true poverty rates. In what follows, Section 2 discusses the data
we use for our experiments and Section 3 discusses the various methods we examine, including
both machine-learning methods and some traditional poverty mapping methods that we use as
performance benchmarks. Section 4 then considers the issue of model validation where we critically
assess the R2 as a validation metric and then propose some alternative metrics. Finally, Section 5
presents the results of our experiments and Section 6 provides some concluding remarks.



2    Data

The Mexican Intercensal Survey was carried out by the Mexican National Institute of Statistics
and Geography (INEGI). The sample consists of 5.9 million households and is representative at the
national, state (32 states), and municipality level (2,457 municipalities). It is also representative for
localities with populations of 50,000 or more inhabitants. The administered questionnaire gathered

                                                   4
information on household income, geographic location, household demographics, dwelling charac-
teristics, and economic activities, among others. The size of the dataset is especially important,
as it allows us to draw repeated samples that are suﬃciently large and diverse. In addition, the
fact that the survey gathered detailed income information on all households allows us to reliably
calculate the required poverty measures that serve as the basis for our simulation experiments.5

Prior to sampling from the Intercensal Survey, we make three modiﬁcations to the data. First, as
a large number of households reported earning no income, we randomly remove 90 percent of these
households.6 Second, to ensure that all municipalities are suﬃciently large, all municipalities with
less than 500 households are removed. Finally, to ensure that all primary sampling units (PSUs)
are also suﬃciently large for sampling purposes, we combine several neighboring PSUs such that
all include at least 300 households. Our ﬁnal census dataset consists of 3.9 million households in
1,865 municipalities and 16,297 PSUs. The resulting census is used to draw 500 survey samples
that serve as the basis for our simulation experiments (Tzavidis et al., 2018). In what follows, we
describe our approach to constructing these samples.

Our sampling procedure is intended to reﬂect standard design elements used in many face-to-face
household surveys conducted in the developing world (Grosh and Muñoz, 1996). Mexico’s 32 states
comprise the intended domains of the sample and the indicator of interest is household income
per capita. For each sample, we seek to achieve a relative standard error (RSE) of 7.5 percent
for average household income in each state.7 We use a two-stage clustered design wherein PSUs
or clusters are selected within each domain and then a sample of 10 households is selected within
each cluster. Our decision to select 10 households within each cluster is consistent with the design
of other large-scale household surveys conducted in developing countries. Under simple random
sampling, the minimum sample size required for each state is as follows:

                                                                      2
                                                           σs
                                               ns =                                                                 (1)
                                                        ¯s × RSE
                                                        w

      ¯s and σs represent average household income per capita and the standard deviation of per
where w
capita income for state s, respectively.

The minimum sample size under simple random sampling must be adjusted to account for the
clustered design. The design eﬀect due to clustering is accounted for by estimating the intra-
cluster correlation of per capita income within each state. The correlation estimates can then be
used to adjust the above sample sizes for the clustered design as follows:


                                              ns × 1 + ρs (nsc − 1)                                                 (2)
   5
     Income is deﬁned as money received from work performed during the course of the reference period by individuals
aged 12 or older.
   6
     This is done to ensure some missing values are present in the data used but not as many as in the original data.
   7
     The desired precision of 7.5 percent is somewhat arbitrary, but corresponds to precision targets in similar surveys
and yields samples of reasonable sizes.


                                                           5
where nsc denotes the number of households selected in cluster c within state s (10 in this case)
and ρs is the intra-cluster correlation of per capita income. Given the above, we can calculate the
number of clusters needed to achieve the minimum sample size for each state and then multiply
this number by 10 to ﬁnd the ﬁnal household sample size. Taking these sample size requirements
as given, each of our samples is then drawn in accordance with the two-stage design.

Speciﬁcally, the clusters within each state are selected with probability proportional to size (without
replacement), where the measure of size is the number of households within each cluster. The
households within each cluster are then selected via simple random sampling. According to this
design, the inclusion probability for a given household is then approximately:

                                                   ˜ sc Ns nsc
                                                   n
                                                          ×                                                       (3)
                                                     n˜s    n
                                                            ˜ sc

      ˜ sc is the total number of households in cluster c within state s, Ns is the number of clusters
where n
                         ˜ s is the total number of households in state s.8 The sample size across the
selected in state s, and n
500 samples is roughly 23,500 households. Under this sampling scenario, not all municipalities are
included, and the number of municipalities included varies from sample to sample, ranging between
951 to 1,020 municipalities. The median municipality included in a given sample is represented by
a single cluster and thus its sample size is 10 households.

In addition to the Intercensal Survey, INEGI has also released geo-referenced data that is similar to
the remotely-sensed data often used in machine-learning applications. These geographic variables
were calculated by INEGI for the year 2015 (the same year as the Intercensal Survey) and are
only available at the municipality level. The geo-referenced data includes 105 diﬀerent indicators.
Speciﬁcally, the dataset includes information from 21 diﬀerent dimensions and each dimension is
associated with ﬁve diﬀerent summary measures. For example, we summarize the atmospherically
resistant vegetation index for each municipality using the minimum, maximum, mean, sum, and
standard deviation of the index. The same procedure is applied to all dimensions.9 We then have
two alternative sets of covariates that we use in our simulations: the geo-referenced data and the
standard set of sociodemographic characteristics from the census.
   8
     The equation used to calculate inclusion probabilities assumes sampling with replacement, but is used here as
an approximation of inclusion probabilities under proportional selection without replacement. This should provide a
reasonable approximation in this case since there are a relatively large number of clusters present in the frame. The
design weight for each household is simply the inverse of the inclusion probability. In a typical survey, the design
weights would be further adjusted for nonresponse and calibrated to known population characteristics. However,
since the sampling is only a simulation exercise, there is no nonresponse and thus no nonresponse adjustment is
required. Calibration or post-stratiﬁcation could be performed, but was not implemented to simplify the process.
   9
     The 21 dimensions are as follows: enhanced vegetation index, normalized diﬀerence vegetation index, normalized
diﬀerence built-up index, built-up index, simple ratio, atmospherically resistant vegetation index, urban index, index
of biological integrity, normalized diﬀerence water index, modiﬁed normalized diﬀerence water index, new built-
up index, band ratio for built-up area, normalized built-up area index, built-up area extraction index, normalized
diﬀerence snow index, visible atmospherically resistant index, soil-adjusted vegetation index, optimized soil-adjusted
vegetation index, normalized diﬀerence moisture index, digital altitude model, and digital slope model.




                                                          6
3        Methods

This section describes the methods we apply to the Mexican Intercensal Survey. The ﬁrst subsection
discusses machine learning, with a special focus on gradient boosting, as it is among the most
popular machine-learning methods for poverty mapping. The second subsection then discusses more
traditional approaches to poverty mapping, which we use as the benchmark in our performance
assessment of the machine-learning methods.


3.1      Machine learning

The goal of machine learning is to develop high-performance algorithms for prediction, classiﬁcation,
and clustering/grouping tasks (Varian, 2014; Athey, 2018; Athey and Imbens, 2019). Such tasks
can be divided into supervised and unsupervised learning problems. While unsupervised learning
focuses on identifying clusters of observations that are similar in terms of their features, supervised
learning uses a set of features to predict some outcome of interest. Supervised learning can be
further divided into regression and classiﬁcation tasks, where regression is concerned with predicting
continuous outcomes and classiﬁcation focuses on categorical outcomes. While there are many
machine-learning methods used for regression and classiﬁcation tasks (e.g., lasso, random forests,
or support vector machines), here we ﬁx ideas by focusing on a method that has been particularly
popular for poverty mapping: gradient boosting.

Originally developed by Friedman (2001), gradient boosting machines are a family of machine-
learning techniques that combine a large number of weak learners to form a stronger ensemble
prediction. Unlike some common ensemble techniques (e.g., random forests) that simply average
models in the ensemble, gradient boosting adds models to the ensemble sequentially by ﬁtting new
weak learners to the negative gradient of some chosen loss function. With the classic squared-
error loss function, this amounts to sequentially ﬁtting new models to the current residuals of the
ensemble.10 Extreme gradient boosting (XGBoost) is a particularly popular implementation of
gradient boosting that uses classiﬁcation and regression trees as the base models or weak learners.
The popularity of XGBoost is due to the fact that it is fast, scalable, and has been shown to
outperform competing methods across a wide range of problems (Chen and Guestrin, 2016).

Classiﬁcation and regression trees are based on a sequential binary splitting of the covariate space
that serves to partition the data into subsamples that are increasingly homogeneous in terms of the
outcome variable. A tree begins with a single root node that is split into two child nodes according
to some decision rule. A decision rule pertains to a single explanatory variable and observations
are assigned to the child nodes depending on whether they meet some condition associated with
the explanatory variable.11 The child nodes can be further split into new child nodes based on
    10
    See Natekin and Knoll (2013) for an accessible introduction to gradient boosting.
    11
    For example, for a continuous explanatory variable, one determines whether a given observation falls above or
below some numeric cutoﬀ.


                                                       7
diﬀerent decision rules that involve other explanatory variables and cutoﬀ points. Any child node
that is not further split is referred to as a terminal node or leaf, and each leaf is associated with
a parameter that provides a prediction for the subsample associated with that leaf. For any given
observation, the ensemble prediction used by XGBoost is the sum of that observation’s predictions
across all trees in the ensemble.

In what follows, we discuss the key features of the XGBoost algorithm, drawing on the presentation
                                           d denote the outcome of interest (e.g., direct estimates
provided in Chen and Guestrin (2016). Let yi
of poverty rates) for observation or entity i, and let xi denote a vector of explanatory variables. In
                                                         p
line with the above, the predicted value of the outcome yi is given by the sum of predictions across
the individual trees in the ensemble:


                                                 p
                                                yi =         ft (xi )                                   (4)
                                                        t

where the trees are indexed by t and ft (xi ) gives the prediction of a given tree as a function of
xi . To build the trees used in the model, XGBoost minimizes the following regularized objective
function:


                                                      d p
                                         L=        l(yi , yi ) +        r(ft )                          (5)
                                               i                    t

where      d, yp)
        l(yi        is the loss associated with the ith observation and r(ft ) is a regularization term that
               i
serves to mitigate overﬁtting. In our implementation, we use the classic squared-error loss such
        d , y p ) = (y d − y p )2 .
that l(yi    i        i     i

XGBoost uses the following speciﬁcation of the regularization term:

                                                        1
                                          r(ft ) = γMt + λ              m2
                                                                         tj                             (6)
                                                        2           j

where Mt is the total number of leaves in tree t, mtj is the prediction associated with the j th leaf
in a given tree, and γ and λ are hyperparameters that must be chosen by the user (we discuss
our selection of hyperparameters below). The regularization term serves to mitigate overﬁtting by
penalizing complicated trees and smoothing the predictions associated with each tree. That is, the
method relies on a large number of weak learners and the regularization term helps control the
complexity of the learners. It is important to note that the above speciﬁcation of the regularization
term is only one of many possible speciﬁcations, though Chen and Guestrin (2016) claim that it is
simpler than the leading alternatives and easier to parallelize. Further note that when the hyper-
parameters are set to zero, the objective function becomes equivalent to that used in traditional
gradient boosting.

To minimize the objective function, the model is trained in a sequential manner, building one tree
at a time. The objective function at iteration t can be written as follows:

                                                         8
                                                      d p
                                      Lt =         l(yi , yi,t−1 + ft (xi )) + r(ft )                     (7)
                                              i
       p      p
where yi,t = yi,t −1 + ft (xi ). Each iteration then adds the tree ft (xi ) that greedily minimizes the
objective function at that iteration. More speciﬁcally, XGBoost uses a second-order Taylor series
expansion to approximate the loss function at any given iteration:


                      ˜t =              d p                   1   d p
                      L             g (yi , yi,t−1 )ft (xi ) + h(yi , yi,t−1 )ft (xi )2 + r(ft )          (8)
                               i
                                                              2
                                                                                                 d, yp
where g (·, ·) and h(·, ·) denote the ﬁrst- and second-order derivatives of the loss function l(yi   i,t−1 )
                 p                            p
with respect to yi,t −1 . That is, the point yi,t−1 is taken as the basis for the expansion. The above
                                                  d, yp
conveniently omits the ﬁrst term of the series l(yi   i,t−1 ) because it is a constant.

The XGBoost algorithm is based on a re-expression of Eq. (8). Let zj denote the set of observations
that belong to the j th leaf of the tree being built at iteration t. Substituting in the explicit form
of the regularization term and noting that all observation that belong to the same leaf receive the
same prediction mtj , we can write


               ˜t =                    d p              1   d p                    1
               L                   g (yi , yi,t−1 )mtj + h(yi , yi,t−1 )m2
                                                                         tj + γMt + λ              m2
                                                                                                    tj    (9)
                      j i∈zj
                                                        2                          2           j

or, equivalently,


                                   ˜t =                   1
                                   L              Gj mtj + (Hj + λ)m2
                                                                    tj + γMt                             (10)
                                          j
                                                          2

where Gj and Hj represent the sums over g (·, ·) and h(·, ·) for all observations in zj . Note that the
summation in the above is over j , which indexes the leaves in the tree.

For a given tree structure, we can ﬁnd the optimal prediction values by minimizing the objective
function with respect to mtj , which yields m∗
                                             tj = −Gj /(Hj + λ). With a squared-error loss function,
          d, yp             d    p              d p
where g (yi   i,t−1 ) = −2(yi − yi,t−1 ) and h(yi , yi,t−1 ) = 2, the optimal prediction values are eﬀec-
tively the (penalized) average of the residuals associated with a given leaf. We can then substitute
the optimal prediction values into the objective function to obtain the following:

                                                              G2
                                          ˜∗ = − 1
                                          L
                                                                j
                                                                    + γMt ,                              (11)
                                            t
                                                 2       j
                                                             Hj + λ

which gives us the minimal objective function for a given tree structure. Ideally, one would enu-
merate all possible tree structures and then use the tree associated with the smallest value of the
(minimized) objective function, but this is often intractable. XGBoost instead implements an al-
ternative approach that seeks to minimize the objective at a given iteration by greedily optimizing
one level of the tree at a time.

                                                             9
To this end, consider splitting a node in a given tree into “left” and “right” child nodes based on
some candidate splitting rule. Let GL and GR represent Gj for the left and right child nodes,
respectively. Similarly, let HL and HR represent Hj for the child nodes. We can then ﬁnd the
reduction in the objective function as follows:


                          ˜∗ =    1    G2L    G2R    (GL + GR )2
                         ∆Lt                +      −                            −γ                       (12)
                                  2   HL + λ HR + λ HL + HR + λ
where the ﬁrst two terms in parentheses capture the loss associated with the child nodes and the
third term captures the loss associated with the parent node being split. When splitting a node
in any given tree, XGBoost then evaluates Eq. (12) for every possible split rule for every available
explanatory variable and chooses the split associated with the maximal reduction in the objective
function.12 XGBoost then continues to split nodes in a given tree until a stopping criterion is
met (e.g., a user-speciﬁed maximum tree depth). Once a given tree is built, XGBoost then moves
to building the next tree in the ensemble and continues to do so until reaching the user-speciﬁed
number of trees to build.
The above references several hyperparameters that must be chosen by the user, including the
hyperparameters in the regularization term (i.e., γ and λ), the maximum tree depth, and the
number of trees in the ensemble. There are two additional hyperparameters that can be used
to further prevent overﬁtting. First, one can use column or feature subsampling where each tree
in the ensemble is built based on a (uniformly) random subset of all explanatory variables. This
procedure can also speed up the algorithm because it reduces the number of covariates that must be
searched across when creating new splits. Second, one can use shrinkage by scaling the predictions
associated with each tree by a factor η , which is often called the learning rate. The predicted value
                                                p      p
of the outcome at iteration t is then given by yi,t = yi,t −1 + ηft (xi ). Shrinkage reduces the inﬂuence
of each tree so that no individual tree has too much inﬂuence on the ensemble prediction.
One possible approach to hyperparameter selection is to use a grid search where the user (1) speciﬁes
the domain of the hyperparameters, (2) applies the model to every combination of hyperparam-
eters, and (3) selects the hyperparameters that perform the best in terms of some metric (e.g.,
mean squared prediction error in the validation dataset). Grid search, however, is computationally
ineﬃcient because hyperparameter proposals are uninformed by previous evaluations. We instead
use a Bayesian hyperparameter optimization procedure, which sequentially updates a probability
model that relates the hyperparameter space to the performance metric and then uses the model to
intelligently explore the hyperparameter space (Bergstra et al., 2011). Speciﬁcally, our implemen-
tation of gradient boosting relies on the open-source library XGBoost, which we combine with the
Hyperopt library that conducts Bayesian hyperparameter optimization using the tree-structured
Parzen estimator.
Finally, regarding our application to the Mexican data, recall that mapping procedures entail
  12
     This is known as the “exact greedy algorithm.” XGBoost also has an approximate algorithm that can be used
for large datasets. See Chen and Guestrin (2016) for details.


                                                     10
an estimation stage and a prediction stage. For the machine-learning methods, we perform this
procedure at two diﬀerent levels of analysis, namely the PSU and municipality levels. Our focus
is nevertheless on obtaining estimates of poverty rates at the municipality level, meaning that
when we conduct estimation and prediction at the PSU level, we further aggregate the predictions
to the municipality level using population-weighted averaging. Further, recall that we have two
alternative sets of covariates available, including geo-referenced and census-based variables. Given
that the geo-referenced data is only available at the municipality level, we conduct estimation and
prediction at the municipality level for any implementation that uses these covariates. When using
census-based covariates, however, we consider implementations at both the PSU and municipality
levels.13


3.2    Traditional small-area methods

We consider three alternative traditional methods for obtaining small area estimates of poverty as
a way to benchmark the performance of the non-parametric machine-learning approach.14 Specif-
ically, we consider area-level models, unit-level models, and a method that attempts to combine
unit- and area-level models. First proposed by Fay and Herriot (1979), area-level models conduct
estimation and prediction using data aggregated to the geographic area of interest (e.g., the PSU
or municipality level).

The area-level model consists of two basic parts. The ﬁrst part of the model links the actual or
                    a for the ith entity to a vector of area-level covariates x as follows:
true poverty rates yi                                                          i



                                                    a
                                                   yi = x i β + εi                                                (13)

where β is a vector of parameters to be estimated and εi represents random area-level eﬀects that
                                                      2 . Since we lack information on the true
are assumed to be mean zero with a constant variance σε
poverty rates, this model cannot be immediately ﬁt.

Instead, direct estimates of poverty rates from survey data y
                                                            ˆid are used as the outcome variable.

These estimates are nevertheless subject to sampling error because they are based on the area-
speciﬁc sample data. The second part of the model, thus, assumes that the direct estimates are
centered around the true poverty rates as follows:


                                                     d    a
                                                   y
                                                   ˆi  = yi + ωi                                                  (14)
  13
     In all cases, we pass every eligible variable to the machine-learning model and let the algorithm conduct variable
selection.
  14
     For a thorough review of small area estimation methods refer to Rao and Molina (2015), for a review of methods
in the context of poverty mapping refer to Molina et al. (2022) as well as reviews by Pfeﬀermann (2013) on other
developments in the area.




                                                          11
This model is referred to as the sampling model, and is assumed to have heteroscedastic error
                         2 . These variances are assumed to be known although in reality these
variance var(ωi |yi ) = σω,i
are estimated using the survey data. The best linear unbiased predictor (BLUP) is obtained by
predicting yi through the model, y
                                 ˆid = x βˆ ˆi . The β
                                        i +ε
                                                      ˆ are the weighted least squares estimator
of β under the Fay-Herriot model and, εˆi = ψi y        ˆ is also the BLUP of the area eﬀect εi .
                                                ˆd − xi β   i

Additionally, ψi =     2/
                      σu     2
                            σu   +    2
                                     σω,i
                                 . In essence, the ﬁnal estimates for areas contained in the survey
                                                                                       ˆ, where the
                                                     ˆd and the synthetic estimator xi β
are the weighted average between the direct estimate y           i
weights, ψi and (1 − ψi ), are determined by the quality of the model and the quality of the direct
estimator. For the non-sampled entities, the area-level model simply uses the synthetic estimates
to produce predictions. As Molina et al. (2022) notes, normality assumptions are not required for
the estimates to be BLUP, but assumptions are needed for estimation of the mean squared error
(MSE). Finally, similar to the machine-learning approach, the area-level model can be implemented
at the PSU or municipality levels.15

Regarding unit-level models, the well-known methods of Elbers et al. (2003) and Molina and Rao
(2010) are based on the nested error model originally proposed by Battese et al. (1988). This
model assumes that (transformed) household income relates linearly to a vector of household-level
covariates according the following model:


                                              wiv = xiv δ + µi + ξiv                                             (15)

where wiv denotes the transformed equivalized income of household v in location i, xiv represents
a vector of household-speciﬁc characteristics, δ is a vector of coeﬃcients on the household charac-
                                                                     2 ) , and ξ is a household-level
teristics, µi is location-speciﬁc random eﬀect such that µi ∼ N (0, σµ          iv
                                  2 ). The two errors are assumed to be independent from each
error term such that ξiv ∼ N (0, σξ
other.16 We use a particular version known as the “census empirical best” method (CensusEB),
which is a variant of the “empirical best” method of Molina and Rao (2010).17

It is evident that the dependent variable of the model is not the poverty status of a given household.
Instead, the method attempts to replicate the welfare distribution based on an assumed data
generating process. The parameter estimates obtained when ﬁtting model (15) to the survey data
 ˆσ
(δ,           ˆ 2 ) are applied to the census household level data to generate a simulated value of wiv
    ˆ 2 , and σ
       µ       ξ
for each household in the census. From the simulated census welfare vector it is possible to obtain
estimates of any welfare indicator, not just poverty headcount rates.

The parameter estimates are ﬁt through a suitable method such as REML or Henderson’s Method
  15
     Estimates at the PSU level may be aggregated to a higher level, see Rao and Molina (2015) for more details.
  16
     Elbers et al. (2003) oﬀer the option of accommodating for non-normally distributed errors – however, despite this
ﬂexibility Corral et al. (2021b) illustrate how even under non-normality the method lags Molina and Rao’s (2010)
EB approach.
  17
     As opposed to empirical best estimates, census empirical best estimates do not include survey observations when
calculating area-level poverty rates. Corral et al. (2021b) show that when the sample size is small relative to the
population the diﬀerence between the two methods is negligible. For additional discussion, see Molina (2019).


                                                         12
III (Henderson 1953).18 The wiv for each household is simulated M times using Monte Carlo
simulation where each simulated welfare m follows:

                                             ∗         ˆ + µ∗ + ξ ∗
                                            wivm = xiv δ    im   ivm



where µ∗              ∗
          im and ξivm are drawn from their assumed distributions.
                                                                     19 Here, µ∗ is generated as µ∗ ∼
                                                                               im                  im
N (ˆ    ˆ 2 (1 − κ
   µi , σ        ˆ i )) for sampled municipalities, where µ
                                                          ˆi = κ
                                                               ˆ w        ˆ and κ
                                                                 ¯i − xiv δ          ˆ2 / σ
                                                                                ˆi = σ    ˆ2 + σ
                                                                                               ˆ 2 /ni ,
        µ                                                                                         µ      µ     ξ
ni is the sample size for municipality i. For municipalities that are not part of the sample, then µ∗
                                                                                                    im
is generated as µ∗
                 im ∼ N (0, σ
                            ˆµ2 ) which corresponds to the method used in ELL for all municipalities

regardless of whether or not they are in the sample. Thus, EB makes more use of the survey data
for predictions corresponding to sampled areas. Note that EB methods only ensure that the area
level means of the dependent variable are Empirical Best Linear Unbiased Predictors (EBLUP); it
does not guarantee the same for poverty which is a non-linear parameter. Hence, the importance
of replicating the welfare distribution.

One major aspect of concern for small area estimation is the estimation of noise. A statistical
agency in a given country may seek to undertake a small area estimation application because the
quality of the estimates derived from survey data may not be suﬃciently precise or estimates may
not even be possible since the area was not sampled. Hence, a big concern surrounding small area
estimation is the estimation of MSEs for each area’s estimate. Estimates of noise are obtained via
bootstrap replications where the model’s assumptions are exploited in order to obtain an estimated
MSE (González-Manteiga et al., 2008).20 This aspect is one of the biggest diﬀerences between
poverty maps based on small area estimation and those based on machine learning. Statistical
agencies across the globe will assess an estimate’s noise before deciding to publish it or not, thus
the importance of such estimates.

The key assumption of the unit-level model is that the distribution of the explanatory variables is
similar between the survey used for estimation and the census used for prediction. This assumption
is generally violated if the census is outdated – which is often the case in developing countries –
and in this event the application of the unit-level model will lead to similarly outdated poverty
maps (Lange et al., 2018). Importantly, machine-learning approaches to poverty mapping are
commonly motivated by the fact that censuses are outdated and are thus viewed as a way to produce
timely poverty maps in a data-constrained environment. Stated diﬀerently, given the diﬀering data
requirements between the unit-level model and the machine-learning approach, the two methods
are not generally considered to be direct competitors. The unit-level model nevertheless serves as
a useful performance benchmark, as it is often viewed as the “gold standard” of poverty mapping.

Unlike unit-level models, unit-context models were speciﬁcally developed to be used in oﬀ-census
  18
     Henderson’s method III is coupled with GLS to obtain the regression coeﬃcients.
  19
     See Corral et al. (2021b) for a more detailed description.
  20
     For a much more detailed discussion on the estimation of noise refer to Rao and Molina (2015) and for a discussion
focused on ELL and EB estimates refer to Corral et al. (2021b) and Corral et al. (2022).


                                                          13
years. Initially developed by Nguyen (2012) and further reﬁned by Masaki et al. (2020), these mod-
els are based on a simple modiﬁcation of the unit-level approach: rather than modeling household-
level income as a function of household-level characteristics, unit-context models instead use only
area-level covariates from the census. Speciﬁcally, the unit-context model replaces the survey-based
covariates xiv in Eq. (15) with census aggregates at the municipality and the PSU level, xi and xic ,
and thus relaxes the assumption that the survey-based covariates match the moments of those from
the census. Like the area-level model, the unit-context model is a direct competitor to the machine-
learning approach because it is feasible to implement with the same data. For this reason, we use
the unit-context model as reference model as we evaluate the performance of machine-learning
methods. However, it is worth noting that the unit-context model has been shown to produce bi-
ased and noisy poverty estimates, largely due to its inability to explain between-household variation
in income (Corral et al., 2022).



4         Model Validation

The coeﬃcient of determination or R2 is the most commonly used metric for validating poverty
maps generated using machine-learning methods (see Jean et al. (2016), Pokhriyal and Jacques
(2017), Steele et al. (2017), Lee and Braithwaite (2020), Yeh et al. (2020), or Chi et al. (2022)).
In this section, we ﬁrst critically assess this validation approach by discussing three ways in which
the R2 can be misleading in terms of model assessment. We then argue that a better alternative
is to validate poverty estimates using the mean squared error, which is not subject to many of the
limitations of the R2 . Finally, we outline another validation procedure that focuses more directly
on what is of ultimate concern with poverty mapping, namely informing the targeting of resources
for the purposes of poverty alleviation.
     a denote the actual or true poverty indicator for the ith geographical entity. Further, let y d
Let yi                                                                                            i
     p
and yi denote the direct estimate and model-based prediction of the poverty indicator, respectively.
The standard method for assessing the predictive performance of model-based poverty estimates
calculates the R2 from a regression like the following:


                                                     d         p
                                                    yi = α + δyi + ζi                                                 (16)

where α and δ are the regression coeﬃcients and ζ represents the error term.21 The R2 for the
above regression can be written as follows:

                                                              2 − σ2
                                                             σd
                                                       2           ζ
                                                      Rd =         2                                                  (17)
                                                                  σd
       2 is the variance of the direct estimate and σ 2 is the variance of ζ . R2 thus conveys
where σd                                             ζ                          d
    21
         Note that any regression of this form can only be estimated for the sampled regions with direct estimates.


                                                             14
information about how much of the variance of the direct estimate is explained by the model-based
prediction.
While unbiased, direct estimates can be imprecise estimates of true poverty indicators (Corral
et al., 2021b). Ideally, one would assess the performance of model-based predictions on the basis
of the true poverty indicators themselves. That is, model-based predictions are more appropriately
assessed using the R2 from the following regression:


                                           a         p
                                          yi = α + δyi + υi                                        (18)

where υ represents the error term. The R2 for this regression can written as

                                                    2 − σ2
                                                   σa
                                             2           υ
                                            Ra =       2
                                                                                                   (19)
                                                      σa
       2 is the variance of the true poverty indicator and σ 2 is the variance of υ . We are speciﬁcally
where σa                                                    υ
                                          2 when it is used to assess true quantity of interest, R2 .
interested in characterizing the bias of Rd                                                       a
     d = y a + ω where ω is a mean-zero error term with variance σ 2 . That is, we view this as a
Let yi    i     i                                                 ω
classical measurement error problem where the direct estimate of the poverty indicator is a noisy
estimate of the true poverty indicator. Note that the preceding equation implies that ζi = υi + ωi
     2 = σ 2 + σ 2 . Following Majeske et al. (2010), we then deﬁne the bias as:
and σζ    υ     ω

                                                 2    2
                                            B = Ra − Rd .                                          (20)

Substituting Eqs. (17) and (19) into our expression for the bias, we have:

                              2 − σ2
                             σa         2 − σ2 − σ2
                                       σd              2 (σ 2 − σ 2 )
                                                      σω                 2
                                                                        σω
                                   υ         υ    ω        a     υ          2
                        B=       2
                                     −       2      =       2σ2       =  2 Ra .                    (21)
                                σa          σd            σd   a        σd
Finally, we can substitute Eq. (21) into Eq. (20) and rearrange, which gives

                                                     2 − σ2
                                                    σd
                                          2    2          ω
                                         Rd = Ra        2                                          (22)
                                                       σd
                                              2 and R2 . See Majeske et al. (2010) for additional
and establishes a basic relationship between Rd      a
mathematical details.
                                2 = R2 when σ 2 = 0 or, equivalently, when y d = y a for all i. In the
In Eq. (22) above, we see that Rd    a       ω                              i     i
                                       2 > 0, we then see that R2 will be biased downward and the
presence of measurement error or when σω                        d
                                                          2 − σ 2 )/σ 2 . Validation exercises based
magnitude of the bias will be determined by the size of (σd    ω     d
    2 will thus tend to mischaracterize the performance of model-based approaches to poverty
on Rd
mapping, with the resulting estimates of R2 being biased downward. One implication of the above
         2 is context-dependent and will generally be inﬂuenced by the design of the survey data
is that Rd
(e.g., larger sample sizes will tend to reduce measurement error, all else equal). In particular, a

                                                   15
                                                                                2 that is higher
poorly performing model applied to a low bias setting may yield an estimate of Rd
                     2 from a strong model applied to a high bias setting. The presence of bias
than an estimate of Rd
in the measures of R2 can thus reverse the rank ordering of models when comparisons are made
across contexts.22
There is an additional form of bias in R2 that can lead to overoptimistic conclusions related to model
                                         2 in Eq. (19) is mathematically equivalent to the squared
validity. To illustrate this, note that Ra
                                         a and y p . A well-known property of the correlation
(Pearson) correlation coeﬃcient between yi      i
coeﬃcient is that it is invariant to any positive aﬃne transformation of its arguments, implying
that the squared correlation coeﬃcient is invariant to any aﬃne transformation (Gujarati, 2003).23
We can thus write
                                           2 a p           2 a              p
                                          Ra (yi , yi ) = Ra (yi , q1 + q2 yi )                                  (23)

where q1 and q2 are some arbitrary constants. The implication of the above is that the R2 is
insensitive to systematic bias in the model-based poverty estimates. For example, consider some
                           u , meaning that E (y u − y a ) = 0. Now consider another estimate y b that
unbiased poverty estimate yi                    i     i                                        i
                 u up to some locational shift, implying that y b = y u + q and E (y b − y a ) = q . While
is identical to yi                                             i     i              i     i
 b is biased (i.e., it will systematically over- or under-estimate poverty), it will achieve an R2 that
yi
                 u.
is identical to yi
A ﬁnal issue related to R2 as a performance metric is that it is aﬀected by the variance of the true
                                                                        2:
poverty measures. To see this, consider the following re-expression of Ra

                                     1            a          p 2            a , α + δy p )
                          2          N        i (yi − α − δyi )        MSE(yi         i
                         Ra =1−           1          a−y a )2
                                                                 = 1 −            a)                             (24)
                                          N       (y
                                                 i i    ¯                 Var  (y i


where N is the number of geographical entities. The above shows that R2 can be viewed as a
normalized version of the mean squared error with the variance of the true poverty indicator used
as the normalization factor. This approach to normalization produces an additional source of
context-dependence in R2 . Speciﬁcally, consider two models that perform equally well in that they
achieve the same mean squared error. However, assume that one model is applied to a context
                       a is large and the other is applied to a low variance context. The above then
where the variance of yi
implies that the model applied to the high variance context will achieve a higher R2 , even though
the performance of the two models is arguably identical.
Rather than relying on R2 as a performance metric, one can avoid regression-based model assess-
ment altogether by calculating the mean squared error directly:

                                              a p          1          a    p 2
                                         MSE(yi , yi ) =            (yi − yi ) ,                                 (25)
                                                           N    i
   22
      See Yeh et al. (2020), Chi et al. (2022), or Aiken et al. (2022) for this type of cross-context comparison of R2
values.
   23
      More speciﬁcally, the correlation coeﬃcient is invariant to any positive aﬃne transformation because it changes
signs when one of its arguments is scaled by some negative constant. The squared correlation coeﬃcient is then
invariant to any aﬃne transformation.


                                                           16
which avoids the normalization issue and provides a more direct understanding of the expected
squared error of the estimates. The mean squared error in Eq. (25) is sensitive to systematic biases
in the predictions. Reconsidering the biased and unbiased poverty estimates described above (i.e.,
 u and y b ), we can write MSE(y b , y a ) = MSE(y u + q, y a ) = MSE(y u , y a ) + q 2 . Systematic biases in
yi      i                       i i               i        i           i     i
the estimates will thus be penalized by the mean squared error. Further, with multiple estimates
for each entity, one can calculate entity-speciﬁc mean squared errors, which can be decomposed
                                                                          p
into parts associated with the bias and variance of the estimates. Where yik denotes estimate
k = 1, 2, . . . , K for the ith entity, we have

                                   1         a    p 2      1          p     p 2    p    a 2
                        MSEi =             (yi − yik ) =            (yik −y
                                                                          ¯i      yi
                                                                             ) + (¯  − yi )                       (26)
                                   K   k
                                                           K    k

where the ﬁrst term on the right-hand side of the above captures the variance of the estimates for
entity i and the second term captures the (squared) bias.24
The primary reason for obtaining granular poverty estimates is to improve the targeting of resources.
While understanding the error of the poverty estimates is useful in this regard, the ultimate concern
is to understand how diﬀerent methodological choices aﬀect poverty alleviation by inﬂuencing the
eﬃciency of targeting. Consequently, we also consider how machine-learning methods perform in
terms of reducing poverty in the context of the Census created from the Mexican Intercensal Survey.
Similar to Elbers et al. (2007), we propose simulating a hypothetical poverty-alleviation program
that distributes resources to households according to a given set of small area estimates.25 In
addition to the estimates, the procedure requires true household-level incomes from a census-type
dataset.
The procedure we use in our simulations is as follows:
  1. Using the true household-level incomes, calculate the poverty headcount and poverty gap
     for the country as a whole for some chosen poverty line. Using the poverty gap measure,
     determine the budget required to completely eradicate poverty.26 Calculate an individual-
     level transfer amount by dividing the total budget by the number of people living in poverty.
   2. Using a given set of small area estimates, order the municipalities according to their estimated
      poverty headcounts, from most to least poor. For the poorest municipality, augment the true
      per capita income of all households by an amount equal to the transfer (i.e., each household
      receives an amount equal to the transfer times the number of household members). Continue
      this distribution procedure for the second poorest municipality, the third poorest municipality,
      and so on until there are not enough resources to cover all households in a given municipality.
      For the marginal municipality, distribute the remaining budget equally among individuals in
      that municipality.
   3. Using the post-transfer household incomes, calculate an updated poverty headcount to assess
      the eﬀects of the distribution scheme on poverty alleviation.
  24
     See Hastie et al. (2009) for derivation and detailed discussion of the bias-variance decomposition.
  25
     See Yeh et al. (2020),Chi et al. (2022), and Aiken et al. (2022) for examples of other studies that have conducted
targeting simulations.
  26
     This requires knowing who is poor and how poor they are. The transfer is enough to bring every poor individual
up to the poverty threshold.


                                                           17
     4. Repeat steps 2 and 3 for each estimation method for all 500 sets of small area estimates
         corresponding to each sample.

The benchmark poverty alleviation from this scheme corresponds to the results of applying the
above to the true municipal poverty rankings. The results for each method are compared to that
result. In our implementation, we ﬁx the poverty line at the 25th percentile of per capita income
prior to transfers. The benchmark poverty rate, after delivering transfers to people in municipalities
ranked by their true poverty rate, is 19.9 percent.


5        Results

Recall from Section 4 that the R2 measure often used to assess the Recall that the R2 measure often
used to assess the performance of machine-learning models is biased downward.27 To understand
the extent of this bias in practice, we have calculated the R2 associated with gradient boosting when
using both the direct estimates and true poverty rates as the basis for performance assessment. We
conduct this comparison at two diﬀerent levels of estimation, namely the PSU and municipality
levels. All R2 calculations are nevertheless made at the municipality level, meaning that the
PSU-level model is estimated at that level and then the poverty estimates are aggregated to the
municipality level. For each of our 500 samples, we thus calculate four R2 measures: (1) PSU-
level estimates evaluated against the direct poverty estimates, (2) PSU-level estimates evaluated
against the true poverty rates, (3) municipality-level estimates evaluated against the direct poverty
estimates, and (4) municipality-level estimates evaluated against the true poverty rates. All models
use only census-based covariates.28
Panel (a) of Figure 1 presents these results and we ﬁnd evidence of considerable downward bias
for both the PSU- and municipality-level models. For example, for the PSU-level model, we ﬁnd
that the median “true” R2 is 0.88 whereas the median “direct” R2 is 0.59. Interestingly, the true
R2 tends to be higher for the PSU-level model than the municipality-level model, but the opposite
is the case for the direct R2 . This suggests that model selection based on the direct R2 can be
misleading in that the direct R2 incorrectly suggests that the municipality is the preferred level
of estimation. More speciﬁcally, we ﬁnd that the true R2 identiﬁes the PSU as the preferred level
of estimation for 437 of the 500 samples, but the direct R2 only selects the PSU level for four of
the 500 samples. The direct R2 then identiﬁes the incorrect level of estimation for 433 of the 500
samples.
In Section 4, we discussed two other ways that the R2 can be misleading for assessing model
performance: it is invariant to locational shifts and inﬂuenced by the variance of the true poverty
rates. Using the Mexican data, we can assess the relevance of the former issue by comparing the
    27
     For examples, see Jean et al. (2016), Pokhriyal and Jacques (2017), Steele et al. (2017), Lee and Braithwaite
(2020), Yeh et al. (2020), or Chi et al. (2022).
  28
     To ensure that our results are not aﬀected by the composition of the sample, we calculate all R2 measures for
each sample using only those municipalities with direct poverty estimates.


                                                       18
                      Figure 1: R2 estimates at the PSU and municipality levels




                                               (a) Direct versus true




                                       (b) Unconstrained versus constrained
Note: This ﬁgure presents the R2 from a regression where on the right-hand side we have the model-based estimates
and on the left-hand side we either have the true poverty rate or the direct estimates of poverty. Since we have 500
samples drawn from our census, we have 500 sets of model-based estimates and, consequently, 500 R2 measures.




                                                        19
true R2 to a constrained version of the true R2 that ﬁxes the intercept and coeﬃcient to zero
and one, respectively. By constraining the R2 in this manner, we remove its ability to adjust for
systematic diﬀerences between the poverty estimates and the true poverty rates. Panel (b) of Figure
1 presents the results and we see that, as expected, the unconstrained R2 overstates the model’s
performance, albeit only to a small degree. We nevertheless ﬁnd that even these small diﬀerences
can lead to model selection issues. In particular, the constrained R2 identiﬁes the PSU level as the
appropriate level of estimation for 355 of the 500 samples whereas the unconstrained version, as
mentioned, selects the PSU level for 437 out of the 500 samples. The commonly used unconstrained
R2 then selects the incorrect model for 82 of the 500 samples.

Given the shortcomings of the R2 , we have argued that the empirical MSE of the poverty estimate
at the area level represents a better metric for validating the performance of a particular set of
estimates. Panel (a) of Figure 2 thus presents our empirical MSE results for several alternative
models. The ﬁrst three models apply gradient boosting at the municipality level using three alterna-
tive covariate sets: geo-reference covariates (“GIS-MUN”), census-based covariates (“CEN-MUN”),
and then a model with all available covariates (“ALL-MUN”). The fourth model applies gradient
boosting at the PSU level and only uses census-based covariates (“CEN-PSU”).29 The ﬁnal three
models are all traditional methods that we use to benchmark the performance of gradient boost-
ing. We present unit-level (CensusEB), unit-context, and area level models. We consider three
alternative implementations of the area-level model that, much like for gradient boosting, vary the
level of analysis and the covariates used. In particular, the empirical best (CensusEB) model is
often considered the “gold standard” of small-area estimation and corresponds to an ideal scenario
where up-to-date, micro-level census data are available. For these traditional models, all available
covariates enter the models linearly (i.e., we do not include interaction or quadratic terms).

One advantage of the MSE is that it can be calculated for each geographical entity. As such, each
box plot in panel (a) of Figure 2 summarizes the distribution of municipality-level MSE estimates
rather than the aggregate MSE for each sample (i.e., the results are based on Eq. [26]). It is useful
to ﬁrst consider how gradient boosting performs relative to the traditional models when using
the same census-based covariates. As is evident in the results from panel (a) and (b) of Figure
2 the gradient boosting methods using census aggregate data – Gradient boosting (CEN-PSU)
and Gradient boosting (CEN-MUN) – perform better than unit-context and Fay-Herriot models,
despite using similar data, and approximate the performance of unit-level (CensusEB) estimates.
The median MSE for the CensusEB model is 0.003 while that of gradient boosting methods is 0.002
and both methods have similar interquartile ranges. The results suggest that gradient boosting
methods can approximate CensusEB unit-level methods and will likely outperform unit-context
and Fay-Herriot methods by a considerable margin, which is quite encouraging for cases where
estimates are needed when the survey and census data are not contemporaneous.30
  29
    Recall that the geo-referenced covariates are only available at the municipality level.
  30
    The unit-context models include covariates at the PSU and municipality level. The location eﬀects are speciﬁed
at the municipality level.



                                                       20
                    Figure 2: Empirical design MSE and bias for select models




                                            (a) Design MSE




                                             (b) Design bias
Note: Figures presents the design MSE and bias for each applied method. The design MSE is approximated
as the mean squared diﬀerence between the model based estimate for a given municipality (obtained from
each of the 500 samples drawn from the census) and the municipality’s true poverty rate. The bias is the
average diﬀerence between the model based estimate for a given municipality (obtained from each of the
500 samples drawn from the census) and the municipality’s true poverty rate. The box-plots shows the
municipality level spread of the design MSE and bias of the method’s estimates.




                                                   21
In line with the ﬁndings and recommendations from Corral et al. (2022), gradient boosting methods
appear to outperform the traditional small area methods recommended for oﬀ-census years. The
unit-context and Fay-Herriot models are direct competitors to the machine-learning models, which
are often motivated by a lack of up-to-date census data. Also aligned to Corral et al. (2022),
area-level models outperform unit-context models in terms of bias and MSE.

Now that we have established that in the context of the Mexican data used and with covariates de-
rived from census data, gradient boosting methods can perform just as well as CensusEB methods
our attention is turned to gradient boosting’s performance under geo-referenced data – Gradient
Boosting (GIS-MUN).31 The gradient boosting model with GIS-based covariates is our closest ap-
proximation to the standard implementation of machine learning in the context of poverty mapping,
the method is contrasted to an application of an area-level model with similar data – Area-level
(GIS-MUN).32 The median MSE for the gradient boosting model with GIS-based covariates is 0.008
(with a root MSE of about 0.09), meaning that the median municipality’s error is roughly nine
percentage points. The area-level model using geo-referenced covariates achieves a median MSE of
0.01.

The bias associated with poverty estimates is especially important due to the policy relevance of
understanding absolute living standards. In panel (b) of Figure 2, we thus present box plots for
the municipality-level bias estimates associated with each model (see Eq. [26] for the bias-variance
decomposition of the MSE). The median bias for all models is approximately zero, with the possible
exception of the unit-context model and the area-level model estimated at the PSU level.33 Perhaps
the more interesting result is that the bias estimates from the gradient boosting model with geo-
referenced covariates exhibits considerable variation, with minimum and maximum biases reaching
-0.43 and 0.43, respectively. That is, the standard implementation with remotely-sensed covariates
is severely biased for some municipalities, even though its median bias is approximately zero.
All other models, including the competing unit-context and area-level models, exhibit much less
variation.

The performance of machine learning appears to depend heavily on the covariates used in the model,
with census-based covariates considerably improving model performance. Another way to see that
the census-based covariates have greater predictive power than the remotely-sensed covariates is by
looking at feature importance. One popular measure of feature importance summarizes the gain
(i.e., reduction of the loss function) associated with any given feature. That is, for a given run of the
model, feature importance captures the average gain of a given feature, where the gains are averaged
across all trees in the model.34 Figure 3 plots the 20 covariates with the highest feature importance
  31
     Note that the underlying model for CensusEB estimates relies on household level microdata. Consequently, the
data requirements for CensusEB are even higher.
  32
     A notable aspect of Fay-Herriot and Gradient Boosting models is their relative ease of application and the fact
that these usually take considerably less time in implementing as opposed to CensusEB and unit-context models.
  33
     The same issue for unit-context models is highlighted by Corral et al. (2021a) and Corral et al. (2022).
  34
     In addition, the average gains for a given run of the model are normalized by dividing each by the sum of all
average gains for that run.



                                                        22
                                     Figure 3: Feature importance




Note: Figure illustrates the average feature importance across all 500 models ﬁt to the 500 samples of the
census data. The importance corresponds to the reduction of the loss function associated to each feature.




                                                   23
after running gradient boosting with all covariates on each sample and averaging each covariate’s
importance across the 500 samples. We ﬁnd that all of the top 10 covariates are census-based, with
only ﬁve geo-referenced covariates entering the top 20.
Figure 4 presents MSE and bias estimates for various models across diﬀerent “poverty quantiles.”
That is, we order all municipalities by their true poverty rates, divide them into 50 quantiles, and
calculate the average MSE and bias for each quantile. For the sake of simplicity, we only present
results for two machine-learning models – the municipality-level models using only geo-referenced
and census-based covariates – and select traditional methods. Panel (a) presents the MSE results
and panel (b) presents the bias results. In both plots, gradient boosting with census-based covariates
performs quite similarly to the “gold standard” unit-level model across all quantiles. However,
gradient boosting with geo-referenced covariates again departs considerably from this benchmark,
particularly at the lowest and highest quantiles. The bias results are particularly notable: the
model with geo-referenced covariates systematically underestimates poverty at the lowest quantiles
and overestimates poverty at the highest quantiles. The standard implementation thus understates
the spatial distribution of poverty.
An additional consideration is how the methods perform for predictions which are out of sample.
Because the experiment consists of 500 samples taken from the census data, a given municipality
may be present in one of the samples and may be absent in others. Consequently, to evaluate out
of sample properties we rank municipalities by their likelihood of selection across the 500 samples.
The likelihood of being present ranges from 0.09 to 1, with a median of 0.48 and a mean of 0.53.
Among the 20 percent least likely to be sampled municipalities, the gradient boosting model with
GIS-based covariates is the worst performing in terms of MSE followed closely by unit-context
models (Figure 5). In terms of bias among the 20 percent of municipalities that are least likely
to be sampled, the gradient boosting model with the geo-referenced covariates shows very wide
variability, while unit context models show downward bias. The other gradient boosting models
perform similarly to CensusEB models in terms of bias and MSE. The performance, in terms of
MSE, of the diﬀerent methods seems to improve with the likelihood of being selected. Among
municipalities most likely to be selected, the gradient boosting model with the geo-referenced
covariates is still the worst performing (Figure 6). However, in terms of bias, among the most likely
to be selected municipalities, the bias of Fay-Herriot methods at the municipality level is the best
performing illustrating the importance of the methods EBLUP.
While we have focused on the performance of gradient boosting up to this point, there are nev-
ertheless several other machine-learning models that can be used for poverty mapping. Leading
alternatives include the least absolute shrinkage and selection operator (lasso) (Tibshirani, 1996),
the random forest (Breiman, 2001), and Bayesian additive regression trees (BART) (Chipman et al.,
2010). While the lasso is a simple way to conduct variable selection and regularization in the con-
text of linear regression, the random forest and BART models are similar to gradient boosting in
that they all rely on regression tree frameworks. We ask whether these alternative models can
perform better than gradient boosting within the conﬁnes of the standard implementation, which

                                                 24
        Figure 4: Empirical average MSE and bias by poverty quantiles for select models
                                                  (a) MSE




                                                   (b) Bias




Note: Figure illustrates the average empirical bias and MSE for municipalities corresponding to each quantile.
Municipality quantiles correspond to the ordering of municipalities by their true poverty rate.




                                                     25
Figure 5: Empirical design MSE and bias with alternative machine-learning models among 20
percent least likely to be sampled municipalities
                                             (a) Design MSE




                                             (b) Design bias




Note: Figures presents the design MSE and bias for each applied method. The design MSE is calculated
as the mean squared diﬀerence between the model based estimate for a given municipality (obtained from
each of the 500 samples drawn from the census) and the municipality’s true poverty rate. The bias is just
the average diﬀerence between the model based estimate for a given municipality (obtained from each of
the 500 samples drawn from the census) and the municipality’s true poverty rate. The box-plots shows the
municipality level spread of the design MSE and bias of the method’s estimates for the least likely to be
sampled municipalities.


                                                   26
Figure 6: Empirical design MSE and bias with alternative machine-learning models among 20
percent most likely to be sampled municipalities
                                             (a) Design MSE




                                             (b) Design bias




Note: Figures presents the design MSE and bias for each applied method. The design MSE is calculated
as the mean squared diﬀerence between the model based estimate for a given municipality (obtained from
each of the 500 samples drawn from the census) and the municipality’s true poverty rate. The bias is just
the average diﬀerence between the model based estimate for a given municipality (obtained from each of
the 500 samples drawn from the census) and the municipality’s true poverty rate. The box-plots shows the
municipality level spread of the design MSE and bias of the method’s estimates for the most likely to be
sampled municipalities.


                                                   27
       Figure 7: Empirical design MSE and bias with alternative machine-learning models
                                              (a) Design MSE




                                              (b) Design bias




Note: Figures presents the MSE and bias for each applied method. The MSE is calculated as the mean
squared diﬀerence between the model based estimate for a given municipality (obtained from each of the
500 samples drawn from the census) and the municipality’s true poverty rate. The bias is just the average
diﬀerence between the model based estimate for a given municipality (obtained from each of the 500 samples
drawn from the census) and the municipality’s true poverty rate. The box-plots shows the municipality level
spread of the MSE and bias of the method’s estimates. All methods rely on GIS covariates only and are ﬁt
at the municipality level.



                                                    28
                                       Figure 8: Poverty targeting




Note: Figure illustrates the poverty reduction potential of relying on estimates from each method to rank
municipalities and prioritize the poorest municipalities in a cash transfer scheme described in section 4. The
box-plot show the spread of the 500 resulting targeting simulations conducted with each methods estimates.


relies exclusively on remotely-sensed covariates. Panels (a) and (b) of Figure 7 present the MSE
and bias results for all models. With the possible exception of the lasso, we ﬁnd that all models
perform comparably, meaning that the performance of gradient boosting when using geo-referenced
covariates cannot be remedied by simply adopting a diﬀerent machine-learning model.

Finally, recall from Section 4 that we argued that the ultimate concern with poverty mapping is to
understand how diﬀerent methodological choices aﬀect poverty alleviation. Figure 8 thus compares
the machine-learning and traditional approaches in terms of their ability to reduce poverty in
the context of our targeting simulations. In these simulations, a hypothetical poverty-alleviation
program targeted on the basis of true municipality-level poverty rates is able to reduce the overall
poverty headcount to just under 20 percent. While the achieved poverty rates of all models fall
within the 20 percent range, we ﬁnd that gradient boosting with geo-referenced covariates is one
of the worst performing models. With the exception of the area-level model using geo-referenced
covariates, the standard implementation is thus outperformed by all traditional methods, including
those that are direct competitors. However, when incorporating census-based covariates, we ﬁnd
that gradient boosting achieves poverty rates that are comparable to the traditional methods, yet
again showing that an exclusive reliance on remotely-sensed data can limit the performance of the
machine learning.


                                                     29
The results of the targeting simulation are relevant to cases where targeting is based solely on
the ranking of municipalities – all methods appear to do a solid job. Nevertheless, there are many
instances where the targeting mechanism may rely on the method’s poverty estimates. For example,
Senegal’s Unique National Registry (RNU - in French) of vulnerable households relies on actual
estimates from poverty maps to determine the quotas for sampling by municipality (Ferre 2018).
It is in those cases where a method that yields unbiased and precise estimates should be employed.



6    Conclusion

This paper sets out to validate traditional small area estimation and machine learning approaches
for poverty mapping. Machine learning approaches have become popular across the globe for
producing highly disaggregated estimates of poverty which often feed into policy making. The
methods are particularly popular in the most data deprived scenarios where census data are not
frequently collected, and even when collected may not be contemporaneous to a household survey.
Hence, the methods are also a popular alternative for cases where the census data is too old for
use under a traditional unit-level small area estimation exercise. To the best of our knowledge, a
rigorous validation of machine learning methods like the one done here, where machine learning
methods are compared to traditional small area estimates and to a true welfare measure, has not
been done before.

We ﬁrst showed analytically that that the R2 can be a misleading measure of model performance: it
is biased downward when direct estimates are used as the reference measure of model performance,
it is insensitive to diﬀerences in location and scale between the reference measure and the model
predictions, and it is inﬂuenced by the variance of the reference measure. One practical issue this
raises is that it can be misleading to compare the performance of poverty mapping methods across
diﬀerent datasets, as is done in Yeh et al. (2020), Chi et al. (2022), and Aiken et al. (2022), among
others. Most notably, survey design elements can aﬀect measurement error in the direct estimates
(e.g., through diﬀerent sample sizes), which in turn inﬂuences the magnitude of bias in the R2 . All
else equal, models applied to low-bias settings will appear to perform better than those applied to
high-bias settings.

Our simulations built on these analytical results. First, we showed empirically that the direct R2
exhibits considerable downward bias, with the true R2 being roughly 35 to 50 percent higher than
the direct R2 . We also showed that this bias has implications for model selection, as the direct
R2 commonly identiﬁed the incorrect level of estimation in our data. Second, we assessed the
performance of machine learning methods and found that they may outperform or achieve similar
bias and MSE as the best performing traditional small area estimation approaches, particularly
in out-of-sample predictions. Third, we present a comparison on the potential beneﬁts of publicly
available GIS covariates. We show how GIS covariates under the data used here do not outperform
census aggregate data. This result suggests that there may be considerable gains in linking census


                                                 30
and administrative data for poverty mapping in oﬀ census years. Fourth, we expand the validations
from Corral et al. (2022) and illustrate how Fay-Herriot models are still the preferred approach for
oﬀ-census years among traditional small area estimation methods. Finally, we present evidence on
how, despite diﬀerent performance across models, all models are beneﬁcial for targeting. While
some models may outperform others in their potential poverty reduction when relying on estimates
for targeting, even the worst performing models achieve decent poverty reduction, but may come
at a considerable cost.

The results found here should be caveated by the fact that they are dependent on the Mexican
Intercensal data used. Results and model performance may diﬀer considerably in a diﬀerent country
or using diﬀerent data. It is quite possible that in less-developed economies the geo-referenced
covariates perform better than what is observed here, but there is no guarantee this will be the
case. Validating the usefulness of a particular type of data is not straight forward as it may not work
as well everywhere, it is for this reason that comparisons here are made in a more holistic manner.
When comparing the performance of geo-referenced covariates to other type of data one should use
the same method, as is done here. On the one hand, we show how using census aggregates at the
municipality level relying on gradient boosting comes very close to the best performing method’s
estimates. On the other hand, using the same method with GIS covariates we show how it falls
short of competing with a model where only census aggregates are used.

The paper opens considerable avenues for future research. One avenue is related to methodology.
Speciﬁcally, how the noise of machine learning methods can be accurately estimated since the noise
measures presented here rely on having the true poverty rates which is not possible in the vast
majority of contexts. Second, the results suggest that most methods would do a solid job for
spatial targeting type exercises. However, a key question is whether or not this will hold elsewhere
and under what circumstances will one method outperform others.




                                                  31
References
Aiken, E., Bellue, S., Karlan, D., Udry, C., and Blumenstock, J. E. (2022). Machine learning and
  phone data can improve targeting of humanitarian aid. Nature, 603:864–870.

Athey, S. (2018). The impact of machine learning on economics. In Agrawal, A., Gans, J., and Gold-
  farb, A., editors, The economics of artiﬁcial intelligence: An agenda, pages 507–547. University
  of Chicago Press, Chicago, IL.

Athey, S. and Imbens, G. W. (2019). Machine learning methods that economists should know
  about. Annual Review of Economics, 11:685–725.

Battese, G. E., Harter, R. M., and Fuller, W. A. (1988). An error-components model for predic-
  tion of county crop areas using survey and satellite data. Journal of the American Statistical
  Association, 83(401):28–36.

Bedi, T., Coudouel, A., and Simler, K. (2007). More than a pretty picture: using poverty maps to
  design better policies and interventions. World Bank Publications.

Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. (2011). Algorithms for hyper-parameter
  optimization. In Advances in Neural Information Processing Systems, volume 24.

Blumenstock, J., Cadamuro, G., and On, R. (2015). Predicting poverty and wealth from mobile
  phone metadata. Science, 350(6264):1073–1076.

Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.

Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of
  the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
  pages 785–794.

Chi, G., Fang, H., Chatterjee, S., and Blumenstock, J. E. (2022). Microestimates of wealth for all
  low-and middle-income countries. Proceedings of the National Academy of Sciences, 119(3):1–11.

Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). BART: Bayesian additive regression
  trees. The Annals of Applied Statistics, 4(1):266–298.

Corral, P., Himelein, K., McGee, K., and Molina, I. (2021a). A map of the poor or a poor map?
  Mathematics, 9(21):2780.

Corral, P., Molina, I., Cojocaru, A., and Segovia, S. (2022). Guidelines to small area estimation
  for poverty mapping. The World Bank, Washington, DC.

Corral, P., Molina, I., and Nguyen, M. (2021b). Pull your small area estimates up by the bootstraps.
  Journal of Statistical Computation and Simulation, 91(16):3304–3357.



                                                32
Elbers, C., Fujii, T., Lanjouw, P., Özler, B., and Yin, W. (2007). Poverty alleviation through
  geographic targeting: How much does disaggregation help? Journal of Development Economics,
  83(1):198–213.

Elbers, C., Lanjouw, J. O., and Lanjouw, P. (2003). Micro-level estimation of poverty and inequality.
  Econometrica, 71(1):355–364.

Fay, R. E. and Herriot, R. A. (1979). Estimates of income for small places: An application of James-
  Stein procedures to census data. Journal of the American Statistical Association, 74(366a):269–
  277.

Ferre, C. (2018). Sénégal - résultats de l’enquête de mise à jour du rnu: Registre national unique.
  Washington, D.C. : World Bank Group.

Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of
  Statistics, 29(5):1189–1232.

González-Manteiga, W., Lombardía, M. J., Molina, I., Morales, D., and Santamaría, L. (2008).
  Bootstrap mean squared error of a small-area eblup. Journal of Statistical Computation and
  Simulation, 78(5):443–462.

Grosh, M. E. and Muñoz, J. (1996). A manual for planning and implementing the Living Standards
  Measurement Study Survey. The World Bank, Washington, DC.

Gujarati, D. N. (2003). Basic Econometrics. McGraw-Hill, New York, NY.

Hastie, T., Tibshirani, R., and Friedman, J. H. (2009). The elements of statistical learning: Data
  mining, inference, and prediction. Springer, New York, NY.

Henderson, C. R. (1953). Estimation of variance and covariance components. Biometrics, 9(2):226–
  252.

Hentschel, J., Lanjouw, J. O., Lanjouw, P., and Poggi, J. (1998). Combining census and survey
  data to study spatial dimensions of poverty. World Bank Policy Research Working Paper Series
  No. 1928.

Hersh, J., Engstrom, R., and Mann, M. (2021). Open data for algorithms: Mapping poverty in
  Belize using open satellite derived features and machine learning. Information Technology for
  Development, 27(2):263–292.

Jean, N., Burke, M., Xie, M., Davis, W. M., Lobell, D. B., and Ermon, S. (2016). Combining
  satellite imagery and machine learning to predict poverty. Science, 353(6301):790–794.

Lange, S., Pape, U. J., and Pütz, P. (2018). Small area estimation of poverty under structural
  change. World Bank Policy Research Working Paper No. 8472.


                                                 33
Lee, K. and Braithwaite, J. (2020). High-resolution poverty maps in sub-Saharan Africa. arXiv
  preprint arXiv:2009.00544.

Majeske, K. D., Lynch-Caris, T., and Brelin-Fornari, J. (2010). Quantifying R2 bias in the presence
  of measurement error. Journal of Applied Statistics, 37(4):667–677.

Masaki, T., Newhouse, D., Silwal, A. R., Bedada, A., and Engstrom, R. (2020). Small area
  estimation of non-monetary poverty with geospatial data. World Bank Policy Research Working
  Paper No. 9383.

Molina, I. (2019). Desagregación de datos en encuestas de hogares: metodologías de estimación en
  áreas pequeñas.

Molina, I., Corral, P., and Nguyen, M. (2022). Estimation of poverty and inequality in small areas:
  Review and discussion. TEST, pages 1–24.

Molina, I. and Morales, D. (2009). Small area estimation of poverty indicators. Estadistica e
  Investigacion Operativa, 25(3).

Molina, I. and Rao, J. (2010). Small area estimation of poverty indicators. Canadian Journal of
  Statistics, 38(3):369–385.

Natekin, A. and Knoll, A. (2013). Gradient boosting machines, a tutorial. Frontiers in Neuro-
  robotics, 7(21):1–21.

Nguyen, V. C. (2012). A method to update poverty maps. The Journal of Development Studies,
  48(12):1844–1863.

Pfeﬀermann, D. (2013). New important developments in small area estimation. Statistical Science,
  28(1):40–68.

Pokhriyal, N. and Jacques, D. C. (2017). Combining disparate data sources for improved poverty
  prediction and mapping. Proceedings of the National Academy of Sciences, 114(46):E9783–E9792.

Rao, J. and Molina, I. (2015). Small area estimation. John Wiley & Sons, Hoboken, NJ, 2nd
  edition.

Smythe, I. S. and Blumenstock, J. E. (2022).          Geographic microtargeting of social assis-
  tance with high-resolution poverty maps. Proceedings of the National Academy of Sciences,
  119(32):e2120025119.

Steele, J. E., Sundsøy, P. R., Pezzulo, C., Alegana, V. A., Bird, T. J., Blumenstock, J., Bjelland,
  J., Engø-Monsen, K., de Montjoye, Y.-A., Iqbal, A. M., Hadiuzzaman, K. N., Lu, X., Wetter,
  E., Tatem, A. J., and Bengtsson, L. (2017). Mapping poverty using mobile phone and satellite
  data. Journal of The Royal Society Interface, 14(127):20160690.


                                                34
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
  Statistical Society: Series B (Methodological), 58(1):267–288.

Torabi, M. and Rao, J. (2014). On small area estimation under a sub-area level model. Journal of
  Multivariate Analysis, 127:36–55.

Tzavidis, N., Zhang, L.-C., Luna, A., Schmid, T., and Rojas-Perilla, N. (2018). From start to ﬁnish:
  A framework for the production of small area oﬃcial statistics. Journal of the Royal Statistical
  Society: Series A (Statistics in Society), 181(4):927–979.

van der Weide, R., Blankespoor, B., Elbers, C., and Lanjouw, P. (2022). How accurate is a poverty
  map based on remote sensing data?       World Bank Policy Research Working Paper Series No.
  10171.

Varian, H. R. (2014). Big data: New tricks for econometrics. Journal of Economic Perspectives,
  28(2):3–28.

Yeh, C., Perez, A., Driscoll, A., Azzari, G., Tang, Z., Lobell, D., Ermon, S., and Burke, M. (2020).
  Using publicly available satellite imagery and deep learning to understand economic well-being
  in Africa. Nature Communications, 11(1):1–11.




                                                35