Policy Research Working Paper                   10512




          Small Area Estimation of Poverty
         and Wealth Using Geospatial Data
                 What Have We Learned So Far?

                         David Newhouse




Development Economics
Development Data Group
June 2023
Policy Research Working Paper 10512


  Abstract
 This paper offers a nontechnical review of selected applica-                       sampled areas are significantly more accurate than those for
 tions that combine survey and geospatial data to generate                          non-sampled areas due to informative sampling. In general,
 small area estimates of wealth or poverty. Publicly available                      estimates benefit from using geospatial data at the most
 data from satellites and phones predicts poverty and wealth                        disaggregated level possible. Tree-based machine learning
 accurately across space, when evaluated against census data,                       methods appear to generate more accurate estimates than
 and their use in model-based estimates improve the accu-                           linear mixed models. Small area estimates using geospatial
 racy and efficiency of direct survey estimates. Although                           data can improve the design of social assistance programs,
 the evidence is scant, models based on interpretable fea-                          particularly when the existing targeting system is poorly
 tures appear to predict at least as well as estimates derived                      designed.
 from Convolutional Neural Networks. Estimates for




 This paper is a product of the Development Data Group, Development Economics. It is part of a larger effort by the
 World Bank to provide open access to its research and make a contribution to development policy discussions around the
 world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The author may
 be contacted at dnewhouse@worldbank.org.




         The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
         issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
         names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
         of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
         its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


                                                       Produced by the Research Support Team
    Small Area Estimation of Poverty and Wealth Using Geospatial Data:
                     What Have We Learned So Far?1

                                     David Newhouse (World Bank Group)




1
 JEL codes: C53, I32. Keywords: poverty, small area estimation, poverty mapping, satellite data, machine learning
We thank Partha Lahiri for his encouragement to write this article, William Bell, Chris Elbers, Carolina Franco, and
Josh Merfeld for helpful comments on a previous draft, participants at the 2022 Small Area Estimation conference
at the University of Maryland College Park, and Haishan Fu and Keith Garrett for their support and encouragement.
    1. Introduction

Using geospatial data as auxiliary data for small area estimation is an old idea. Proof of concept was
initially demonstrated thirty-five years ago by Battese, Harter, and Fuller (1988), who combined survey
data with early imagery from the Landsat satellite to predict the area under corn and soybean
production in 11 counties in Iowa. That paper is widely cited in the field of small area estimation
statistics, with nearly 1,100 cites on Google Scholar as of May 2023. But the paper is better known for
another seminal contribution, as it was the first to develop and apply the well-known nested-error unit-
level model, with a conditional random effect specified at the target area level, to estimate means for
small areas. From 1988 to about 2015, economists and statisticians devoted considerable effort to
refining this model in various ways, with a particularly important innovation introduced by Molina and
Rao (2010) to estimate indicators other than means, such as poverty headcount rates, using simulation
techniques. In the meantime, the publication of Elbers, Lanjouw, and Lanjouw (2003), which used a
slightly different unit-level model, popularized the use of small area estimation at the World Bank.
Nonetheless, until relatively recently virtually all applications during this time used census or other
administrative data as auxiliary data, ignoring geospatial data as a potential source of auxiliary data from
which surveys could вЂњborrow strengthвЂќ to improve the measurement of socioeconomic data.

Geospatial data was rediscovered as a potential source of auxiliary data in the mid 2010s, as advances in
computing power and storage enabled geospatial data to become publicly available at a wide scale; as
surveys began to be regularly implemented on tablets that collect geocoordinates; and as a new
generation of data scientists, economists, and statisticians discovered the potential of geospatial data to
improve socioeconomic measurement. This, in turn, sparked interest in combining survey and satellite
indicators for the purposes of small area estimation. Using appropriate methods for this type of вЂњdata
fusionвЂќ is important because small area poverty estimates have implications for the targeting and
evaluation of public interventions and can shed light on economic geography more generally. At the
same time, in part because of recent advances in machine learning algorithms, different disciplines and
authors have taken very different methodological approaches to combining geospatial data and survey
data for the purposes of small area estimation.

This paper provides a non-technical review of selected evidence from this relatively new literature. It
builds on two recent reviews (Burke, 2021, McBride et al, 2022) but focuses exclusively on the small
area estimation of wealth and poverty, devoting particular attention to differences in statistical
methodology across studies. In particular, it ignores some of the excellent recent work on agricultural
crops and yields (Lobell et al, 2020, Erciulescu et al, 2019), labor (Merfeld et al, 2022), and other
indicators. There is now a robust literature documenting that estimates of wealth and poverty derived
from survey and geospatial data are correlated with benchmarks derived from surveys or censuses. The
strength of these correlations varies widely and depends on a myriad number of factors, including the
country context, the method used for prediction, the target area for prediction, the exact indicator
being predicted, the choice of geospatial variables, and the nature of the training and evaluation data.

Because the literature is relatively new, no consensus has yet emerged around the optimal prediction
method in different contexts. Furthermore, comparisons of alternative prediction methods in the same
geographic context remains rare, and some of the few examples of these comparisons have not yet


                                                     1
been published in peer reviewed journals. Therefore, most of the evidence presented below on
comparisons across alternative models should be interpreted as tentative priors based on limited
evidence from specific contexts.

This review is divided into three sections. The first section begins by very briefly describing some of the
many publicly available geospatial indicators. It then reviews selected studies from a rapidly growing
literature evaluating the accuracy of small area estimates of wealth and poverty using geospatial data,
documenting strong correlations across several studies when compared with census-based estimates. I
then briefly touch on three related issues: The sensitivity of accuracy to the nature of the training data;
the more limited ability of geospatial data to predict variation across time in welfare than variation
across space; and the important distinction between sampled and non-sampled target areas when
considering the accuracy of estimates. The second section focuses on comparisons between different
types of statistical methods for cross-sectional predictions, including the nature of the geospatial
features and different types of models used for prediction. The third section briefly discusses an
important recent paper describing how survey and geospatial data were combined to target poor
households in Togo (Aiken et al, 2022). The final section concludes with a summary of key points and
suggestions for further research.

    2. Small area estimates of poverty and wealth with geospatial data
    a. What type of geospatial features are publicly available?
Geospatial data are typically obtained from satellites, mobile phones, or internet activity. Satellite
indicators have a few key advantages over mobile phones and internet activity, however, including the
public availability of a large number of indicators, in many cases derived from publicly available imagery
provided by the Sentinel 2 and Landsat satellites. Proprietary high-resolution satellite imagery--from
companies such as Maxar, Planet, Airbus, and others--can also either be used directly as an input into
deep learning models, or as inputs to derive interpretable features such as building footprints, roads,
and vehicles. Unlike call detail records, satellite-based indicators typically cover the entire country and
therefore avoid selection bias. Call Detail Records (CDR) from mobile phones, in addition to only
representing mobile phone users, are also more difficult to obtain for privacy reasons. However, CDR
can in some contexts provide more informative indicators such as location information, cell phone
behavior, and connection quality and device type. Internet records such as Twitter usage can also be
informative (Tonneau et al, 2022). Information from online platforms also suffers from selection bias,
however, since only a portion of the population uses it in developing countries, and it is difficult to
estimate the extent to which this source of bias affects estimates.

A rich variety of geospatial indicators derived from satellite imagery have become publicly available and
can be found in Google Earth Engine, Microsoft Planetary Computer, and other freely accessible
websites. These offer access to several climate-related variables as well as a host of predictive features
such as night-time lights, land classification, year of switch from pervious to impervious surface,
estimates of net primary production, cell phone placement, a wide variety of climate and temperature
variables, pollution estimates from the Sentinel 5-P satellite, a variety of soil quality measures, and
countless other geospatial indicators. Meta has also publicly released the Relative Wealth Index, based
on the pioneering work of Chi et al. (2021). Modeled population estimates from Worldpop, Meta, or



                                                     2
Google are also critical inputs into small area estimation, as they are both strong predictors of welfare
and also essential for aggregating predictions to higher administrative levels.

Information on building footprints is also valuable when they can be obtained. Worldpop has made
statistical information on building footprints available for much of Africa (Dooley et al, 2020); these are
derived by Ecopia using Maxar imagery. The Microsoft planetary computer also now contains building
footprint data for a variety of countries, including most of Europe and the Americas, and parts of Africa
and Southeast Asia. Google recently released a new version of its Open buildings layer covering Africa
and Southeast Asia, and the German Aerospace Center recently released the World Settlement
Footprint global database of 3-D building footprints (Esch et al, 2023). Liu et al. (2023) recently showed
that building footprints can be modeled accurately using Sentinel 1 and Sentinel 2 imagery, but the
resulting indicator data have not yet been publicly released. Dynamic information on building footprints
should become increasingly available in the near future. In addition, a variety of data pertaining to
agriculture and food security are posted online through FAOвЂ™s hand-in-hand geospatial platform, which
contains information on food security, crops and vegetation. Recent subnational estimates of crop type
and yield estimates are only available for a few countries at this time, but coverage will likely expand
significantly in the coming years. Overall, an impressive amount of geospatial imagery and indicators is
already publicly available, and more should be coming online in the next few years.

    b. Geospatial data predicts poverty and wealth accurately across space

Several studies have examined how predictions of wealth or poverty derived from linking survey and
geospatial data compare with either survey or census-based measures of poverty and welfare. Accuracy
is often assessed using R2, defined as:
                                М‚ н µн±– )2
                     в€‘н µн±–(н µн±¦н µн±– в€’н µн±¦
    (1) н µн±… 2 = 1 в€’              М…)2
                     в€‘н µн±–(н µн±¦н µн±– в€’н µн±¦

                                                                                                       М‚н µн±– is the
Where н µн±– is the target area, н µн±¦н µн±– is the reference measure of poverty or welfare for target area н µн±– , н µн±¦
predicted value for target area i, and н µн±¦  М…н µн±– is the mean across target areas. Some studies instead report the
Pearson correlation between the predicted and reference measure, which can be squared to obtain R2.
Table 1 lists the actual or implied R2 values reported by several studies.

An important early paper on how geospatial data predict poverty and wealth was Jean et al (2016). This
paper used вЂњdeep learningвЂќ in the form of a convolutional neural network (CNN) to predict welfare,
using daytime imagery taken from Google Earth and the luminosity of night-time lights in several
countries in Sub-Saharan Africa. Each layer of the CNN successively filters the original image into more
and more condensed abstract features, until the final layer represents the predicted luminosity value.
Jean et al. (2016) transfer the features from the penultimate layer of the CNN to a ridge regression that
estimates the value of an asset index or per capita consumption in withheld villages.

In Jean et al (2016), the target areas are survey clusters, the reference measure н µн±¦н µн±– is the predicted
average values of a wealth index or per capita consumption for cluster н µн±– taken from the household
survey and withheld from the training sample, and н µн±¦  М‚н µн±– are predictions generated from convolutional
                                           2
neural network models. Out-of-sample R is assessed through survey cross-validation and varied from
0.37 to 0.55 for per capita consumption, and from 0.55 to 0.75 for asset wealth. However, because the


                                                        3
Table 1: Comparison of accuracy across different sources

    Country         Target area       Indicator            Survey data   Source of validation data     Estimation R2 against validation      Source
                                                                                                       Method2    data
    Bangladesh      Upazilla          Predicted            2014          Census-based ELL              BSEM       0.95                       Steele et al. (2017)
                                      consumption                        estimates
    Burkina Faso    Commune           Predicted poverty    2018 EHCVM    Census-based EBP              EBP         0.63                      Edochie et al.
                                                                         estimates                                                           (forthcoming)
    Madagascar      Commune           Aseet Index          Census        Design-based simulation       XGB         0.80                      Merfeld and Newhouse
                                                                         using census                                                        (2023)
    Malawi          Traditional       Poverty rates        2019 HIS      Census-based EBP              BSEM        0.81                      Van der Weide et al.
                    Authority                                            estimates                                                           (2022)

    Malawi          Traditional       Asset Index          Census        Design-based simulation       XGB         0.79                      Merfeld and Newhouse
                    Authority                                            using census                                                        (2023)

    Malawi          Traditional       Poverty rates        Census        Poverty rates based on        XGB          0.84                     Merfeld and Newhouse
                    Authority                                            predicted welfare in census                                         (2023)

    Mexico          AGEB              Per capita income    2014 MCS-     Census estimates              CNN         0.47                      Babenko et al. (2017)
                                                           Enigh
    Mexico          Municipality      Per capita income    2014 MCS-     EBP estimates using           EBP         0.74 in-sample            Newhouse et al. (2022)
                                                           ENIGH         Intercensus                               0.49 out-of-sample
    Mexico          Municipality      Per capita labor     2015          Design-based simulation       EBP         0.86 in-sample            Newhouse et al. (2022)
                                      income               Intracensus   using Intercensus                         0.64 out-of-sample
    Mozambique      Locality          Asset Index          2018 census   Design-based simulation       XGB         0.85                      Merfeld and Newhouse
                                                                         using census                                                        (2023)
    Senegal         Commune           Non-monetary         Census        Cross-validation              GPR         0.83                      Pokhriyal and Jacques
                                      poverty                                                                                                (2018)
    Sri Lanka       DS Division       Predicted            HIES 2012     ELL estimates from census     OLS         0.61                      Engstrom et al. (2022)
                                      consumption
    Sri Lanka       DS Division       Predicted            HIES 2012     ELL estimates from census     CNN         0.39                      Engstrom et al. (2022)
                                      consumption
    Sri Lanka       DS Division       Non-monetary         Census        2012 Census                   EBP         0.77                      Masaki et al. (2022)
                                      poverty



2
 BSEM = Bayesian Structural Equation Model, EBP = Empirical Best Predictor, GPR = Gaussian process Regression, OLS = Ordinary Least Squares, CNN =
Convolutional Neural Network, CNNTL = Convolutional Neural Network with Transfer Learning, XGB = XGBoost.

                                                                             4
Country       Target area   Indicator      Survey data   Source of validation data   Estimation R2 against validation   Source
                                                                                     Method2    data
Sri Lanka     DS Division   Asset Index    Census        Design-based simulation     XGB        0.83                    Merfeld and Newhouse
                                                         using census                                                   (2023)
Sub-Saharan   Village       Asset Index    DHS           Cross-validation            CNN         0.70                   Yeh et al. (2020)
Africa
Sub-Saharan   Village       Consumption    LSMS          Cross-validation            CNNTL       0.37 to 0.55           Jean et al. (2016)
Africa
Sub-Saharan   Village       Asset Index    DHS           Cross-validation            CNNTL       0.55 to 0.75           Jean et al. (2016)
Africa
Sub-Saharan   Village       Asset Index    DHS           Cross-validation            XGB         0.56                   Chi et al. (2021)
Africa
Tanzania      District      Non-monetary   2018          Census                      EBP         0.77                   Masaki et al. (2022)
                            poverty
Northeastern District       Non-monetary   Simulated     Census                      EBP         0.96                   Masaki et al. (2022)
Tanzania                    census         sample
Togo         Canton         Asset Index    Pooled DHS    Independent survey          XGB         0.84                   Chi et al. (2021)




                                                             5
model was trained on night-time lights, accuracy declined precipitously when the model attempted to
predict within the lower portion of the per capita consumption distribution. In the African countries
considered, most poor households live in dark rural areas, and a model trained to predict only night-
time lights cannot distinguish welfare levels among them.

Babenko et al. (2017) improved upon this method by using daytime imagery to train a CNN model
directly using survey data on per capita income. They trained the CNN model to predict the share of the
population in extreme and moderate poverty in different Area Geo Estadistica Basicas (AGEBs) вЂ“ small
areas analogous to a census block -- based on per capita income data collected in the 2014 MCS-ENIGH
household survey. The prediction of AGEB-level poverty rates achieved an R2 of 0.47 when compared
with survey estimates from withheld AGEBs. Interestingly, this is within the range reported Jean et al.
(2016) for per capita consumption, despite the difference in welfare measure (income vs. consumption)
and country context. A prediction model based solely on land-cover classification predicted equally well,
however, and models that used both predicted poverty (from the CNN) and the land-cover classification
achieved an R2 of 0.57. This suggests that the CNN-based estimates of poverty in this case did not
capture all of the imagery features correlated with average household per capita income. We further
touch on the differences between interpretable and CNN-derived features below.

Newhouse et al. (2022) use the CNN poverty estimates and land cover classification from Babenko et al.
(2017) as inputs into Empirical Best Predictor (EBP) models to predict poverty at the municipality level.
The EBP model provides a simple framework for combining the two features in a linear mixed model, in
addition to offering a well-established parametric bootstrap method for estimating uncertainty
(Gonzalez-Manteiga et al, 2008). Official estimates for municipalities developed by the government
based on the household-level intercensus were used as the reference for comparison. The R2 of the
estimates was 0.74 for sampled municipalities, but only 0.49 for non-sampled municipalities. Because
this is only based on one sample, the paper also performed design-based simulations using a measure of
per capita labor income taken from the intercensus. In the simulations, the R2 was 0.86 for sampled
areas and 0.64 for out-of-sample areas. I further discuss the difference in accuracy between sampled
and non-sampled areas below.

Steele et al. (2017) predict a wealth index and per capita expenditure in Bangladesh using a Hierarchical
Bayes Structural Equation Model, allowing for spatial covariance. The paper differs from many others by
also incorporating call detail record (CDR) features from mobile phones in addition to satellite features.
Results were validated both using cross-validation and using poverty estimates derived from the 2012
census. The results showed that it is much easier to predict wealth than per capita consumption, a
finding consistent with Jean et al. (2016). When using cross-validation for evaluation, out-of-sample R2
for village-level estimates was 0.76 for wealth as opposed to 0.36 for consumption. However, when
comparing Upazilla-level (sub-district level) estimates with previous estimates derived using traditional
small area estimates from the 2010 census, R2 was a much higher 0.95.

Similarly, Pokhriyal and Jacques (2017) combine CDR and satellite data from Senegal with census data to
predict non-monetary poverty across communes in Senegal. They use Gaussian process regression, a
non-parametric machine learning method. They evaluate their estimates using 10-fold cross-validation.
The estimates for non-monetary poverty across Communes achieved an out-of-sample R2 of 0.83 and a
rank correlation of 0.87.



                                                    6
Chi et al. (2021) used several Demographic and Health Surveys (DHS) and a mix of publicly available and
proprietary geospatial data to predict an asset index for 2.4 km grids across 135 countries. The authors
trained the model on the asset index available in the DHS, using data for 56 countries. Proprietary
predictors include internet connectivity information obtained from Meta. In validation exercises, R2
varied greatly, depending on the context, level of geographic disaggregation, and comparison indicator.
When validated using cross-validation across enumeration areas in the survey data, the R2 of the
estimates was 0.56, similar to the 0.6 value reported by Yeh et al. (2020) for Africa. Meanwhile, when
validated against independent wealth measures, R2 was 0.60 across rural Kenyan villages, 0.70 across
Nigerian Local Government Areas, and 0.84 across Togolese Cantons. But when validated against the
predicted probability of being poor, or predicted per capita consumption or income, R2 values are much
lower: 0.04 across Malawian villages, 0.17 in rural Kenya, and approximately 0.3 across Mexican
municipalities (Gualavisi and Newhouse, 2022, Chi et al, 2021, Newhouse et al, 2022). This is due to key
differences between wealth and predicted per capita consumption, including the fact that the latter is
expressed in per capita terms.

Masaki et al. (2022) also consider the prediction of non-monetary poverty in Tanzania and Sri Lanka.
Their study used census data from both countries to construct a non-monetary welfare index, and
classified households whose index fell below a percentile threshold roughly equal to the prevailing
national poverty rate as non-monetarily poor. The analysis combines survey-based estimates with
publicly available geospatial indicators using an Empirical Best Predictor model following Molina and Rao
(2010). Relative to direct survey estimates, the correlation with the census rose from 0.72 to 0.88 in Sri
Lanka and from 0.77 to 0.88 in Tanzania. Unlike previous papers integrating survey and geospatial data,
this one estimates the efficiency gain as well as the gain in accuracy due to incorporating geospatial data
and finds that it is roughly equivalent to expanding the size of the sample by a factor between three and
five.

Van der Weide et al. (2022) generate small area estimates of monetary poverty in Malawi for Traditional
Authorities by combining survey data with publicly available geospatial features. Like Steele et al. (2017),
their study uses a Bayesian structural equation model that accounts for spatial correlation across areas
and validates the prediction against census-based estimates. It finds a correlation between the
geospatial features and census estimates above 0.9, although some individual target areas show
substantial discrepancies between the census and geospatial-based estimates due to differences in the
data used for prediction.

Krennmair and Schmid (2022) propose a mixed effect random forest model, which is tested in a design-
based simulation using household-level covariates in Mexico. The results demonstrate the benefits of
applying machine learning methods over traditional linear models. Design-based simulations from the
Mexican state of Nuevo LeГіn indicate that this approach reduces median relative bias by about 20
percent relative to the more typical approach of applying an empirical best predictor (EBP) model with a
transformation. The paper also evaluates a random effect block residual bootstrap approach to
estimating uncertainty and finds that it performs well.

Merfeld and Newhouse (2023) evaluate small area estimates of an asset index for four countries:
Madagascar, Malawi, Mozambique, and Sri Lanka. In addition, the paper evaluates small area estimates
of poverty for Malawi obtained using publicly available geospatial auxiliary data. This study compares
linear EBP models with three different types of machine learning models: Extreme Gradient Boosting,

                                                     7
Boosted Regression Forests, and Cubist regression. Unlike Krennmair et al., the machine learning models
do not include a conditional random effect. Despite that, the results indicate that all three machine
learning methods generate substantially more accurate estimates than the linear EBP model,
particularly out of sample. Of the machine learning methods, boosted regression forests and extreme
gradient boosting perform equally well except in Sri Lanka, where extreme gradient boosting produces
slightly more accurate estimates. The random effect block residual bootstrap approach proposed by
Krennmair et al (2022) also works well when applied to gradient boosting, providing coverage rates
ranging from 94 to 97 percent.

Finally, Lee and Braithewaite (2022) propose an innovative iterative process that combines tree-based
machine learning and deep learning to generate wealth estimates. This paper first trains a model using
extreme gradient boosting to predict the probability of being in different wealth classes using
interpretable geospatial features, and then uses the predicted values from this procedure to train a
convolutional neural network on satellite imagery. The predicted probabilities of being in different
wealth classes from the convolutional neural network are subsequently fed back into the boosting
model. This process is repeated iteratively until there is convergence. This improves upon the
methodology in Jean et al. (2016) by avoiding the use of night-time lights to train models. The authors
report that the average out-of-sample R2 from withheld countries is 0.90.

In general, several studies suggest that small area estimates generated by combining survey and
geospatial data are more accurate than those based solely on survey data, sometimes by significant
margins. This is notable because small area estimates based on geospatial data are subject to model
bias; for example, a model that uses night-time lights as a predictor may underestimate poverty in a
poor area that happens to contain a highway, if the high level of night-time lights associated with
highways makes the area look less poor from the sky than it actually is. However, at least when
predicting poverty rates at higher levels such as subdistricts, the evidence so far indicates that model-
based estimates based on geospatial indicators are more accurate than direct survey estimates. This
implies that the benefits of reducing sampling error by supplementing survey data with model-based
predictions derived from geospatial indicators outweighs the introduction of model bias.

c. Geospatial data are a second-best option when recent census data are unavailable

Although geospatial data are strongly correlated with welfare across space, recent census data remain
the gold standard for auxiliary data for small area estimation. Unfortunately, in many cases census data
are old or unavailable, which creates two problems. First, because survey and census data cannot
typically be linked at the household level to preserve confidentiality, it is standard when estimating a
unit-level model to assume that the predictors follow the same distribution in the survey and the census
data. This assumption becomes less tenable as the temporal gap between the census and survey
increases. If the census and survey distributions are sufficiently different, estimating a linked model with
primary-sampling-unit-level (PSU) aggregates from the census is preferable to assuming a common
distribution (Lange, Pape, and Putz, 2018). A second and more important problem is that old census
data, even if they are linked directly to the survey data, cannot reflect any changes that affect wealth or
poverty that occurred since the census. Current geospatial data may be more likely to reflect current
conditions than old census data, especially when including geospatial indicators such as precipitation
and vegetation that better predict short-run shocks.

                                                     8
When it comes to census data, how old is too old? Or, put another way, at what age do census-based
predictions become less accurate than current geospatial predictions? It is difficult to know, but
Newhouse et al. (2022) offer one small piece of evidence on this point. When evaluated against Mexican
2015 small area poverty estimated based on the intercensus, 2010 census-based estimates are more
accurate than 2015 estimates based on geospatial indicators (correlation of 0.91 vs 0.86). However, this
is only representative of one context, and regional patterns of poverty in Mexico may have been more
static during this time than in other contexts.

d. Prediction accuracy is very sensitive to the training data

Many of the papers discussed above predict poverty rates or average asset indices estimated at the
village level. Since these are often derived from surveys with a limited number of observations per
village, this raises the issue of noise in the dependent variable. In fact, correlations between predicted
values and census-based estimates depend critically on the extent of noise in the training data, which
will reduce measured accuracy. For example, Engstrom et al. (2022) considered how the accuracy of
predictions depends on the size and nature of the sample used to estimate average per capita
consumption at the GN division level. That analysis correlated interpretable geospatial features with
predicted per capita consumption imputed into a census. Model R2 fell from 0.61 when using the mean
over all census households, to 0.55 when using the mean over thirty households, and further to 0.40
when taking the mean over 8 households per enumeration area. Differences in the extent of noise
present in the training data, as well as the reference evaluation measure, will not necessarily affect the
ranking of different types of models within the same context. But it explains much of the wide variation
in R2 observed in Table 1 across different studies, which underscores the benefits of evaluation studies
that compare different methods in the same context, using the same reference measure.

Gualavisi and Newhouse (2022) offer another stark example of how sensitive predictive accuracy is to
the source of training data. Using a census extract from 10 districts in Malawi, the analysis compared
estimates of average village welfare imputed into a household census with estimates derived from
combining a survey with publicly available geospatial indicators. However, it also considers a third
option, which involves hypothetically supplementing the survey with a partial registry, a вЂњmicrocensusвЂќ
that interviews all households in a randomly selected 450 of the 4,500 villages with geolocated data.
This involves a two-step approach, where welfare is first predicted into the partial registry and then a
geospatial model is trained against the partial registry predictions.

Using a partial registry in this way yields an R2 of 0.35, as opposed to 0.01 for the geospatial poverty
map based on survey data and 0.02 for the wealth estimates from Chi et al. (2021). These R2 values are
much lower than those cited in the previous section. The weak correlation between the Chi et al (2021)
estimates and these census-based predictions reflects the challenge of distinguishing between village
welfare levels in this context. In particular, the sample consists of 4,500 villages, in 10 poor Malawian
districts, for which names could be matched between the census and the Unified Beneficiary Registry
data containing household geocoordinates. In this context, the Chi et al. (2021) estimates may perform
poorly because they come from a model trained to wealth while the benchmark reference indicator is
per capita consumption; these are conceptually different measures of welfare that tend to diverge more
in rural areas and among the extreme poor (Ngo and Christiaensen, 2019). For the standard geospatial
poverty map based on the survey, the low R2 is also due to the paucity of survey data, as there are only

                                                     9
16 households per enumeration area in the survey sample. Besides the disappointing performance of
the standard geospatial poverty estimation and the Meta wealth index estimates in this challenging
context, this exercise also illustrates how much the inclusion of additional data from the partial registry
improves the performance of the prediction. The partial registry effectively adds valuable information to
the training data when using geospatial data for small area estimation. This enables the development of
a much more accurate prediction model, using proxy welfare variables that are cheaper and easier to
collect.

The importance of training data raises the question of whether estimating different models for different
geographies, such as urban and rural areas, improves prediction accuracy. Estimating separate models
may improve the accuracy of the estimates by better accounting for heterogeneity across regions.
However, these models also utilize less training data, which reduces the richness of the prediction
model in the typical case when the sample is used to select or tune models. Newhouse et al (2022)
provide some evidence on this question, comparing monetary poverty estimates in Mexico derived from
a national model, separate models for urban and rural areas, and separate models for each of six state
groupings. Compared to a baseline national model, estimating models separately for urban and rural
areas leads to a minor improvement in accuracy, raising the correlation with census-based estimates
from 0.86 to 0.87 in sampled areas and from 0.70 to 0.71 in non-sampled areas. Estimating separate
models for six groups of states led to a similar minor improvement. These findings for Mexico. however,
do not necessarily generalize to other contexts, and additional evidence comparing the accuracy of
models specified at different levels would be useful. Tree-based machine learning methods, such as
those discussed below, provide more flexible alternatives that explicitly model interactions between
predictors. Use of these models should therefore mitigate any benefit from estimating multiple models
in disaggregated geographic regions or areas.



e. Out-of-sample predictions are significantly less accurate than in-sample predictions

Informative sampling occurs when sampling probabilities for Primary Sampling Units vary in a way that is
correlated with the outcome of interest, such as welfare. This is typically the case in two-stage samples
in which primary sampling units are sampled with probability proportional to size, since population size
is correlated with most outcomes of interest. Informative samples, if not appropriately adjusted,
produce biased estimates. The standard approach to adjust for informative sampling is to weight
observations by the inverse probability of selection. This is usually straightforward to do when
estimating EBP models with household surveys containing sample weights, for example when using the
R EMDI and Stata SAE software packages. This protects against bias from informative sampling within
sampled areas, but not against bias in predictions for areas that are not included in the sample
(Pfefferman and Sverchkov, 2009). The bias in estimates for non-sampled areas can be severe. Table 2
shows evidence on the extent to which informative sampling leads to less accurate predictions in non-
sampled areas. The extent of this bias appears to depend on the country context and the nature of the
sample. In the examples reported in Table 2, the estimates in Burkina Faso appear as a particular outlier,
with an out-of-sample R2 of 0.21. It is difficult to know exactly why the out-of-sample estimates for
Burkina Faso are so inaccurate, but this likely reflects large differences in the relationship between
covariates and predictors between sampled and non-sampled areas. In addition, the source of validation

                                                    10
data in that case is model-based EBP estimates derived from the census, which are also subject to bias
due to informative sampling when predicting out of sample.


Table 2: Predictions using geospatial data are much less accurate in non-sampled areas

 Country             Indicator Source of survey                Source of            R2 for              R2 for non-
                               data                            validation data      sampled             sampled areas
                                                                                    areas

 Burkina Faso        Poverty       2018 EHCVM                  Census-based         0.76                0.21
                                                               EBP estimates
 Madagascar          Wealth        2018 census                 Census               0.83                0.62
                                   (simulated samples)
 Mexico              Poverty       2014 MCS-ENIGH               EBP estimates       0.74                0.49
                                                               using
                                                               Intercensus
 Mexico              Poverty       2015 Intracensus            Design-based         0.88                0.64
                                                               simulation
                                                               using
                                                               Intercensus
 Malawi              Wealth        2018 census                 Census               0.79                0.53
                                   (simulated samples)
 Malawi              Poverty       2018 census                 Derived from         0.76                0.64
                                   (simulated samples)         census-based
                                                               predictions
 Mozambique          Wealth        2019 census                 Census               0.85                0.71
                                   (simulated samples)
 Sri Lanka           Wealth        2012 census                 Census               0.90                0.80
                                   (simulated samples)
Results from household-level EBP models using geospatial auxiliary data. Sources: Edochie et al. (forthcoming), Newhouse et al.
(2022), Merfeld and Newhouse (2023)

Pfefferman and Sverchkov (2007) proposed a bias correction for out-of-sample areas when the
probability of selection of an area into the sample is known, but first stage sampling probabilities are
rarely available to analysts, and this correction has not yet to our knowledge been implemented in any
small area estimation software package. Estimating a separate model with inverse probability weighting
for out-of-sample areas may have the potential to improve the accuracy of out-of-sample estimates, by
giving greater weight to areas that were less likely to be sampled, based on observable characteristics,
when estimating model parameters. As discussed in more detail below, tree-based machines learning
methods are also more robust to this source of bias and therefore tend to generate more accurate out-
of-sample predictions. Until either the use of tree-based machine learning or explicit bias correction
becomes routine, however, predictions for out-of-sample areas based on publicly available geospatial
indicators should be treated with great caution.




                                                              11
f. Predictions across time are much less accurate than predictions across space

In contrast to the many studies that have used satellite imagery to predict welfare levels across space,
only two published studies to our knowledge have evaluated intertemporal predictions using geospatial
data. Yeh et al. (2020) attempt to use daytime imagery to predict changes in wealth measured in the
Demographic and Health Surveys. The CNN was only able to explain 15 to 17 percent of the estimated
changes across African villages, however. When using self-reported changes in assets from the most
recent survey, the figure rose to 35 percent. When aggregating up to districts, and using self-reported
changes from the endline survey, imagery can explain about half of the variation in self-reported
changes. However, self-reported changes are subject to recall error and may only capture major changes
in assets.

Meanwhile, Khachiyan et al. (2022) look at variation across census blocks in the US, and find that a CNN
trained on daytime imagery predicted half of the change in population density between 2000 and 2020,
and 42 percent of the variation in income change between 2000 and 2017. Both Khachiyan et al (2022)
and Yeh et al (2022) use convolutional neural networks for prediction. Further studies would be very
useful to test the ability of different methods and data sources to predict changes over time.

g. Summary of key lessons on using geospatial data for small area prediction

Overall, the main conclusion from this nascent literature is that geospatial data are strongly predictive of
geographic variation in wealth and poverty. Exactly how predictive accuracy varies depending on a
myriad number of factors. In general, wealth is easier to predict than consumption. Nonetheless, when
national geospatial estimates of poverty or wealth are compared against census-based estimates for
sampled areas, the R2 appear to consistently range from about 0.74 to 0.95, as shown in Table 1.

Because geospatial data are particularly predictive of population density (Leasure et al, 2020, Engstrom
et al, 2020) and population density is systematically related to economic welfare (Castaneda et al, 2018),
geospatial data can help вЂњfill inвЂќ two-stage household surveys with model-based predictions, boosting
efficiency and accuracy. Most of the studies that have compared geospatial estimates with direct survey
estimates find that the model-based estimates are more accurate, although the comparisons are not
shown here. There is also evidence that prediction accuracy is highly dependent on the strength of the
training data, suggesting that partial registries that collect proxy indicators may be a valuable
supplement to household survey data when publicly available geospatial data can be linked. Finally, out-
of-sample estimates are generally less accurate than in-sample estimates, and occasionally very
inaccurate. This suggests that there are benefits from including as many target areas as possible in the
sample, and from additional research and tools to improve out-of-sample prediction accuracy.

At the same time, the early literature has utilized a dizzying number of approaches to integrating survey
and geospatial data. Many of these papers train a CNN model directly to imagery, others use machine
learning methods applied to specific features, while others use linear mixed models. Those that either
use deep learning or tree-based machine learning either ignore or may not properly estimate or
evaluate uncertainty, and many of the papers ignore the well-established statistical literature on small
area estimation. The following section explores these issues in greater detail.


                                                    12
    3. Statistical methods for prediction

    a. Interpretable features predict at least as well as deep learning from imagery

One aspect in which the existing literature on geospatial data fusion has diverged involves the nature of
the geospatial indicators used. While several studies obtain predictions directly from imagery using deep
learning techniques such as CNNs, others instead generate predictions from interpretable features such
as land classification types, night-time light luminosity, building density, and so on. When using
interpretable features, predictions can be obtained using linear models вЂ“ which may be regularized вЂ“ or
a tree-based machine learning algorithm such as extreme gradient boosting. It is possible to use both, as
demonstrated in Lee and Braithwaithe (2022), but using deep learning entails several additional costs.
Deep learning models are complex and effectively a вЂњblack boxвЂќ to users. Furthermore, they require
specialized skills to understand and deploy, and thousands of training data points to perform well. On
the other hand, the number of EAs in survey data are typically less than five hundred. Pre-trained CNNs
can help circumvent the need for more data, but little is currently understood about how the specific
nature of the architecture or pre-training affects prediction accuracy or bias.

A few existing studies shed light on the relative predictive power of deep learning and interpretable
features. As noted above, Babenko et al. (2017) compare the predictive power of poverty predictions
obtained from a CNN, as well as with those obtained from land classification, and both together. The
correlation with the benchmark measure of truth, derived from the census, was equally correlated with
the direct CNNs and land classifications, and the correlation improved moderately when both were
included.

Engstrom et al. (2022) directly compare CNN-based estimates of headcount poverty rates with feature-
based estimates in Sri Lanka. A variety of features were used, including roof type, shadows, cars, and
road types. In that context, feature-based prediction is more accurate, with an R2 of 0.61 and a mean
absolute error of 3.2 pp, as opposed to an R2 of 0.39 and a mean absolute error of 5.5 pp for the CNN-
based estimates.

Ayush et al. (2022) also derive several interpretable features such as trucks, maritime vessels, vehicles,
aircraft, etc. They then compare predictions obtained from these features in a gradient boosting model
with those obtained from a CNN trained to nightlights data, as in Jean et al. (2016). When predicting
average per capita consumption across villages, the out of sample R2 of the interpretable features is
0.54, as compared with 0.41 when using the CNN trained to night-time lights. While this is a lower
bound measure of accuracy of the CNN, since it was trained to night-time lights, it is consistent with
Engstrom et al. (2022).

Finally, as noted above, Lee and Braithewaite (2022) employ an innovative approach that first uses
interpretable features to predict the DHS wealth index, and then supplements that with poverty
predicted from a deep learning model. This approach, however, shows limited benefits from adding
direct CNN estimates to feature-based estimates for specific countries, with increases in R2 of only about
0 to 2 points across four of the five countries. The exception is South Africa, where there is an increase

                                                    13
of about 8 points. These results also suggest that in many contexts adding deep learning estimates to
existing predictions based on gradient boosting offers limited additional improvement in accuracy.

      b. Household or sub-area models are usually preferable to area-level models when
         sub-area data are available.

The pros and cons of different models for small area estimation in different contexts has been a subject
of contention for many years and there is not yet a consensus between different statisticians and
practitioners. For the purposes of this discussion, we assume that recent census data are unavailable,
necessitating the use of geospatial data. In most cases, geospatial data are available in the form of zonal
statistics at the вЂњsub-area levelвЂќ, where the sub-area is a geographic area such as a grid or village. A
zonal statistic, for example, could be the average night-time luminosity in the village or grid. Linking
survey and geospatial data at the level of the enumeration area or village is quite common in practice.
The Demographic and Health Surveys publicly release jittered geocoordinates for each EA in most cases,
facilitating this type of linking. Meanwhile, the target area is typically a more aggregate administrative
unit, such as a district or subdistrict. Finally, we define the regional level as a level above the target area
for which the survey is considered to be representative.

Many analysts use Bayesian modeling for small area estimation. We focus here, however, on empirical
best predictor models (Jiang and Lahiri, 2006, Molina and Rao, 2010), rather than purely Bayesian
models, for three reasons. First, when they have been compared, EBP models give similar results to
hierarchical Bayesian models (Guadarrama et al, 2016). Second, national statistics offices may be more
comfortable with empirical Bayesian than Bayesian methods, partly due to discomfort with assuming a
prior distribution. Finally, multiple well-documented and user-friendly software packages employ EBP
methods, such as the EMDI and SAE packages in R, and the SAE package in Stata (Kreutzmann et al,
2019, Molina and Marhuenda, 2015, Nguyen et al, 2018).

Here, I focus on EBP models because they incorporate a conditional random effect that conditions on
the sample, which is effectively used as a prior estimate. This distinguishes EBP models from two other
popular alternatives: M-quantile (Chambers and Tzavidis, 2006) and ELL (Elbers, Lanjouw, and Lanjouw,
2003). Including a conditional random effect is particularly important when the auxiliary data are linked
to the survey at the sub-area or area-level (Masaki et al, 2022). In this case, the sample contains more
information relative to the auxiliary data than when using a typical household census, because the
auxiliary data is the same for all households within a sub-area. This mechanically introduces correlation
across households in a village, increasing the variance of the area effect. This in turn increases the
weight given to the sample relative to the prediction in the empirical best predictor model when using
aggregate predictors, as opposed to household-level predictors.

Within the set of EBP models, three main classes of model can be used to generate area-level poverty
estimates using existing publicly available software packages.

The first type of EBP model for poverty estimation is a household-level model, as follows:

(2) н µн°є(н µн±Њн µн±џн µн±Ћн µн± н µн±– ) = н µн»Ѕ1 н µн±‹н µн±џн µн±Ћн µн±  + н µн»Ѕ2 н µн±‹н µн±џн µн±Ћ + н µн»Ѕ3 н µн°јн µн±џ + н µнј‚н µн±џн µн±Ћ + н µнјЂн µн±џн µн±Ћн µн± н µн±– ,

where н µн°є(н µн±Њн µн±џн µн±Ћн µн± н µн±– ) is a transformed measure of household economic welfare of household i, typically
measured as capita income or consumption. Household i lives in subarea s, target area a, and region r.

                                                                              14
н µн±‹н µн±џн µн±Ћн µн±  is a vector of predictors aggregated to the sub-area level, so that they are constant within a
particular sub-area. н µн±‹н µн±џн µн±Ћ is a vector of predictors aggregated to the area level. н µн°јн µн±џ is a vector of state
dummies. н µнј‚н µн±џн µн±Ћ is a conditional random effect specified at the area level, and н µнјЂн µн±џн µн±Ћн µн± н µн±– is a stochastic error
term. In practice, н µнј‚н µн±џн µн±Ћ and н µнјЂн µн±џн µн±Ћн µн± н µн±– are typically assumed to be normal. While it is possible to relax the
assumption of normally distributed random effects (Diallo and Rao, 2018), we know of no software
package that implements empirical best predictor models with non-normal error terms.

The assumption that the stochastic error terms are distributed normally necessitates transforming the
dependent variable (Tzavidis et al, 2018). For household models, a log functional form has often
traditionally been used, dating back to Elbers, Lanjouw, and Lanjouw (2003). But recently a number of
adaptive transformations that select a parameter to best fit the data have become increasingly popular,
following their implementation in the publicly available EMDI software package. Examples of such
adaptive transformations include the log-shift and Box-Cox transformations, in which a transformation
parameter is selected through restricted information maximum likelihood. Masaki et al. (2022) and
Newhouse et al. (2022) take a different approach, employing a rank order transformation that forces the
dependent variable to follow a normal distribution, following Peterson and Cavanough (2019). While
this does not guarantee that the residuals are normal, it brings them far closer to normality in those
contexts. This transformation can be reversed under additional assumptions, although a back-
transformation is not necessary for estimating headcount poverty.

The second type of model is a вЂњsub-area modelвЂќ specified at the lowest level at which the geospatial
auxiliary data can be linked to the household model. This model assumes the form
     М‚н µн±џн µн±Ћн µн±  = н µн»Ѕ1 н µн±‹н µн±џн µн±Ћн µн±  + н µн»Ѕ2 н µн±‹н µн±џн µн±Ћ + н µн»Ѕ3 н µн±‹н µн±џ + н µнј‚н µн±џн µн±Ћ + н µнјЂн µн±џн µн±Ћн µн±  ,
(3) н µн±ѓ

where н µн±ѓМ‚н µн±џн µн±Ћн µн±  is the estimated poverty rate for each subarea. This is effectively a unit-level model with the
unit as a sub-area. Because the dependent variable is a proportion, it can be transformed using the
arcsin transformation, which is part of the EMDI and EMDIplus software packages.

The final option is to specify the model at the area level, which is the target area, following the long and
distinguished literature spawned by Fay and Herriot (1979):
     М‚н µн±џн µн±Ћ = н µн»Ѕ1 н µн±‹н µн±џн µн±Ћ + н µнј‚н µн±џн µн±Ћ + н µнјЂн µн±џн µн±Ћ .
(4) н µн±ѓ

When testing the area-level models, we follow the recommendation of most software packages by
specifying a simple version, namely a linear model with no variance smoothing. In particular, we obtain
variance estimates for target areas by using the Horvitz-Thompson variance approximation. However,
direct estimates of the variance at the target area level are imprecise, motivating the use of model-
based small area estimation in general. Smoothing these variance estimates prior to estimation would
likely generate more accurate predictions (Bell, 2008, You, 2021). In addition, accounting spatial and/or
temporal correlation in a Fay-Herriot model can also improve prediction accuracy (Singh et al, 2005,
Chandra, Salvati and Chambers, 2015). A variant of the Fay-Herriot model that allows for spatial
autocorrelation based on Petrucci and Salvati (2006) has been implemented in the R EMDI package,
while the R SAErobust package also implements the spatio-temporal correlation models proposed by
Rao and Yu (1994) and Marhuenda et al (2013). Testing area-level models with variance smoothing and
spatial and temporal correlation structures against household and sub-area models that use sub-area
predictors is a useful area for further research.


                                                                            15
It is impossible to develop a general rule about the relative accuracy of different models, because their
relative accuracy depends on the nature of the data. This is especially true when comparing across
different sources of auxiliary data. For example, census data aggregated to the target area level may be
preferable to geospatial data available at the sub-area level because it is more predictive of welfare, but
current geospatial data may be preferable to old census data even if the latter is available at the
household level.

Nonetheless, when considering a single source of auxiliary data, the household and sub-area models
enjoy the important advantage of using auxiliary data at a more disaggregated level. The use of more
spatially disaggregated data can lead to more accurate estimates of non-linear functions such as
headcount poverty rates, though these gains in accuracy may be negligible in some contexts. An
additional benefit from using more spatially disaggregated estimates is the additional precision gained
by exploiting the additional variation across sub-areas within areas. Whether the benefit comes in the
form of increased accuracy, precision, or both, when considering a single source of auxiliary data it is
generally preferable to use a household or sub-area model rather than an area-level model when
possible.

Although the general principal of using the most spatially disaggregated auxiliary data possible seems
straightforward, there is not yet full consensus on this point in the literature. For example, Corral et al.
(2021) argue that, when considering model-based bias, household models with aggregate variables
suffer from omitted variable bias, and therefore recommend using an area-level model rather than a
household-level model when the auxiliary data consist solely of sub-area or area-level means. Newhouse
et al. (2022), however, show that this source of omitted variable bias is equally present in area-level and
household models that use auxiliary data drawn from the population, such as census or administrative
data. Furthermore, this source of model-based bias disappears when considering design-model bias,
taking the expected value of the predictions prior to drawing the sample. Omitted variable bias is
therefore not a relevant concern when selecting between different types of models.

Corral et al. (2022) nonetheless recommend the use of an area-level model rather than a household-
level model when the predictors are available at the sub-area level, largely on the basis of results from a
particular model-based simulation. This model-based simulation takes a simple random sample of
households from all PSUs in the population. When the model-based simulation is altered to use a more
realistic two-stage sample design in which a subset of PSUs are selected, the household model with sub-
area means generates more accurate predictions than the area-level model.3 Using a two-stage sample
effectively increases sampling error in the direct estimates, which in turn increases the benefit of using
more geographically disaggregated auxiliary data to fill in the geographic gaps of the sample. Using a
one-stage sample, on the other hand, makes the direct survey estimates more accurate, which favors
area-level models in this case. This partly illustrates why it is important to be careful before inferring
general results from particular model-based simulations (Tzavidis et al, 2018).

Relative to the area-level model, the household model benefits from using auxiliary data at the sub-area
level. The greater variation of more spatially disaggregated data is particularly important when using
algorithmic variable selection methods such as LASSO or stepwise regression to select models, which is
increasingly common among practitioners. The availability of sub-area variation also becomes more


3
    Code is available upon request from the author.

                                                      16
important when forcing the model to include dummy variables at the regional level, the level for which
the sample survey is considered to be representative. Forcing the inclusion of regional dummies in the
model selection process generally increases the predictive accuracy of the model, by controlling for fixed
characteristics of the region. In addition, including regional dummies enables model selection algorithms
such as LASSO and stepwise to prioritize variables that best explain within-regional variation for
inclusion in the model. Finally, since these algorithms use the sample to determine how many variables
to select, using more disaggregated predictors ensures that a richer and more accurate predictive model
is selected. The household model may also slightly benefit from predicting a continuous welfare
variable, rather than discarding informationвЂ”about how close a household is to the poverty lineвЂ”by
first converting it to an estimated headcount poverty rate, although it is not clear that this difference is
important empirically.

Table 3 shows empirical comparisons of accuracy, as measured by R2, from selected evaluations. R2 is
shown because it is commonly reported and it tends to track closely with Spearman rank correlation,
which is in turn useful for evaluating targeting performance. In most cases, unfortunately, the evidence
reported below is based on a single real-life survey. Only in Mexico, to our knowledge, is there
simulation evidence comparing area and household-level models using geospatial auxiliary data. In each
case, models are selected using LASSO and regional dummies are included.
Table 3: R2 by method and in or out of sample, relative to validation data

 Country       Indicator            Sample           Validation          In-sample          Out-of-sample
                                                     data
 Model                                                            Household- Area-          Household- Area-
                                                                  level      level          level      level
 Burkina       Headcount            Single           Census-based 0.76       0.56               0.21     0.26
 Faso          poverty              survey           EBP
                                                     estimates
 Sri      Non-                      Single           Census       0.77       0.71              N/A          N/A
 Lanka    monetary                  survey
          poverty
 Tanzania Non-                      Single           Census              0.77        0.78      N/A          N/A
          monetary                  survey
          poverty
 Mexico   Headcount                 Single           Census-based 0.74               0.63      0.49      0.44
          poverty                   sample           EBP
                                                     estimates (in-
                                                     sample)
 Mexico        Labor income         Design-          Intracensus    0.89             0.88      0.64      0.56
               poverty              based
                                    simulation
Source: Edochie et al. (forthcoming), Masaki et al. (2022), Newhouse et al. (2022). The results for design-based
simulation for Mexico reported in the bottom row are based on area-level predictors while all other rows are
based on sub-area level predictors.


In general, the household model predicts more accurately than the area-level model in contexts where
they have been directly compared. The one notable exception is out-of-sample areas in Burkina Faso.

                                                               17
However, for in-sample areas the household level model is significantly more accurate, such that the
household-level model is more accurate overall (results not shown). In Tanzania, the area-level model
also generates slightly more accurate predictions than the household-level model, although the
difference is negligible. Interestingly, in the one simulation comparison in Mexico, the two models
perform essentially equally well in-sample but the household model is moderately more accurate out-
of-sample. Mexico also is different than the other cases in using a relatively small number of proprietary
geospatial variables, which may also partly explain why estimates are far more accurate out-of-sample
in Mexico than Burkina Faso. In addition, the simulation results reported for Mexico are based on area-
level aggregates instead of sub-area level aggregates, due to the lack of sub-area identifiers in the
census data. This may explain why the household model and area-level model perform equally well in
sampled areas in this context.

The comparisons reported in Table 3 are far from conclusive and should be interpreted with caution,
since all except for one case are based on a single sample. In addition, the area-level models estimated
here use direct survey estimates of variance as inputs into the model, obtained using the Horvitz-
Thompson approximation of variance. As noted above, using smoothed variance estimates and
accounting for spatial correlation should increase the accuracy of area-level models. Finally, the
evaluation metrics are often themselves EBP estimates based on household census data, since official
welfare measures are never observed in the census. Nonetheless, despite the limited evidence so far,
the household level model appears to generate more accurate predictions than the area-level model in
the majority of cases, sometimes by substantial margins. Additional evidence would be useful to get a
better sense of the conditions under which household models or area-level models generate more
accurate estimates.

As noted above, a key benefit of incorporating sub-area level auxiliary data in a household model
framework is increased efficiency. In Burkina Faso, the mean estimated mean-squared error for sampled
areas was half as small when estimating a household model with sub-area predictors (Edochie et al,
forthcoming). A similar 45 percent reduction in mean MSE was observed for in-sample municipalities in
Mexico when incorporating sub-area level predictors (Newhouse et al, 2022). While these are only two
contexts, they suggest a large efficiency gain when using a household model with sub-area level
predictors, relative to an area-level model.

Another option is the sub-area level model given in equation (2), in which the unit of analysis is the sub-
area and the dependent variable is sub-area-level poverty rates, in the spirit of Torabi and Rao (2014).
Unfortunately, there is virtually no empirical evidence to our knowledge on the relative accuracy of
estimates produced by a sub-area vs. a household model. In Mexico, when using a single household
survey sample, the R2 was equal to 0.70 relative to the evaluation benchmark, less than the 0.74 value
for the household-level model. However, this only pertains to one context, and further research is
needed to rigorously evaluate these different types of models.

Many of the same studies also compare coverage rates across the models, which are useful for
evaluating the accuracy of uncertainty estimates. If estimates of uncertainty are unbiased, coverage
rates should be approximately equal to 95 percent. Table 4 lists estimated coverage rates from selected
studies. When estimating the household model, confidence intervals are calculated based on mean
squared error, under the assumption that the point estimates are unbiased. For estimating coverage
rates, generally the area-level model fares better than the household model, providing coverage rates of

                                                    18
92% in Sri Lanka and Tanzania as opposed to 84% and 75% for the household model. The differences
were more muted in Mexico, although unfortunately no coverage statistics were provided for the
design-based simulation. The moderate underestimation of uncertainty in the household model may be
due to the omission of a random effect at the sub-area level, although the direction of this bias due to
omitting this sub-area random effect depends on the structure of the data (Marhuenda et al, 2018). One
way to get a sense of the magnitude of this downward bias in estimated uncertainty is to compare
coverage rates with those of direct estimates. The standard cluster-robust variance estimator for
surveys also underestimates uncertainty because it fails to account for the correlation in poverty status
across enumeration areas within target areas. This leads to similarly low coverage rates for direct
estimates in Sri Lanka and Tanzania, and much lower coverage rates in Mexico. When compared with
the downward bias in standard direct variance estimates, the magnitude of the downward bias in the
household model MSE estimates does not seem large enough to warrant serious concern.


Table 4: Coverage rates, relative to validation data

 Country       Indicator        Sample          Validation     In-sample                   Out-of-sample
                                                data
 Model                                                         Hh-     Area-level   Direct Hh-     Area-level
                                                               level                       level
 Sri      Non-                  Single          Census         84%     92%          76%      N/A      N/A
 Lanka    monetary              survey
          poverty
 Tanzania Non-                  Single          Census         75%     92%          76%     N/A       N/A
          monetary              survey
          poverty
 Mexico   Headcount             Single          Census-        77%     82%          39%     83%       80%
          poverty               sample          based EBP
                                                estimates
                                                (in-sample)
Source: Edochie et al. (forthcoming), Masaki et al. (2022), Newhouse et al. (2022)



     c. Tree-based machine learning methods appear to predict more accurately than
        linear models

One of the key differences across the different studies listed in Table 1 is the choice of prediction
method. Many of the studies used linear mixed models, drawing on the traditional methods popular in
small area estimation. On the other hand, others use more sophisticated tree-based machine learning
approaches such as Regression Forests or Gradient Boosting. Regression Forests are a generalization of
decision trees. The take the average predictions of a continuous dependent variable over many decision
trees, which are each derived from repeated random samples of the data and of candidate variables, a
process known as bagging (Breiman, 2001). Bagging helps make random forests much more robust to
small perturbations in the data than single decision trees. Extreme Gradient Boosting (XGboost),
meanwhile, is a generalization of random forests (Chen and Guestrin, 2016). This method generates

                                                              19
predictions based on the sum of a sequence of regression forests, which are estimated by iteratively
predicting the residual from the sum of the previous regression forests.

To our knowledge, Krennmair et al. (2022) is the first paper to specify a model that combines a
conditional random effect with tree-based machine learning, specifically random forests. When applied
to Austrian income data, the mixed effect random forest model tends to generate more accurate
predictions of mean income than traditional EBP. The authors conclude that random forest models offer
substantial advantages over linear models in the presence of complex and non-linear interactions
between covariates.

Merfeld and Newhouse (2023) compare estimates produced using gradient boosting applied to
geospatial data to those from a linear mixed effect EBP model. This differs from Krennmair et al. (2022)
in two ways: It uses extreme gradient boosting instead of a single random forest model, and it assumes
an unconditional random area effect instead of a conditional random area effect. The predictions are
evaluated against census aggregates (for the asset index) or whether predicted welfare in the census
falls below a threshold (for poverty in Malawi). Uncertainty is estimated using a block random effects
booststrap approach, as proposed by Chambers and Chandra (2013) and applied in Krennmair et al.
(2022). The results in Table 5 show that XGboost, even without the conditional random effect, is more
accurate than EBP in each case. For out-of-sample areas, the greater flexibility of the XGboost algorithm
eliminates much of the selection bias associated with out-of-sample prediction using linear models
under informative sampling. In results not shown here, coverage rates vary from 94 to 97 percent,
reflecting the success of the bootstrap procedure in estimating uncertainty accurately.
Table 5: R2 by method and in or out of sample

 Country                         Indicator             In-sample                Out-of-Sample
                                                       EBP         XGboost      EBP        XGboost
 Madagascar                      Asset Index           0.83        0.87         0.62       0.76
 Malawi                          Asset Index           0.79        0.91         0.54       0.71
 Mozambique                      Asset Index           0.84        0.91         0.70       0.81
 Sri Lanka                       Asset Index           0.80        0.83         0.64       0.81
 Malawi                          Estimated headcount   0.75        0.91         0.63       0.78
                                 poverty



    4. Small area estimates using geospatial data can help improve on
       poor targeting systems

A key application of small area estimation is assisting the identification of the poorest households
through geographic targeting. Traditionally, cash transfer programs use Proxy Mean Tests (PMT) as a
way to identify the poorest households, which utilize a registry of verifiable characteristics to assign
each household a score (Coady, Grosh, and Hoddinot 2004). The weight applied to each characteristic is
typically determined through by regressing these proxy welfare indicators on log per capita
consumption. However, registries are typically very costly and time-consuming to collect and update. In
part because of this, recent research has explored the use of satellite and phone data as an alternative
to identify poor households.

A recent paper (Aiken et al, 2022) evaluates an innovative two-step approach to identify the poorest
households in Togo, which was applied to the Novissi cash transfer program. The first step entailed using
the Meta relative wealth index from Chi et al. (2021) to identify the poorest hundred Cantons, which are
the third administrative level in Togo, out of 397 total Cantons. Within these identified Cantons, the
team used CDR data to identify poor households, using a model trained against per capita consumption
collected as part of a phone survey in September 2020. Targeting accuracy was then evaluated against a
Proxy Means Test constructed from an independent phone survey representative of all cell phone
subscribers in the country.

The headline result is that this two-step targeting approach outperformed two hypothetical feasible
alternatives based on geographic targeting, as shown in Table 6. The first feasible alternative considered
is a transfer of equal value to all persons within the poorest prefectures, which is the second
administrative level in Togo, out of 40 total prefectures in Togo. The second alternative provided a
transfer of equal value to all individuals within the poorest Cantons. Both of these hypothetical
alternatives simulated transferring cash to all households in the poorest geographic areas, Prefectures in
the first case and Cantons in the second, until 29 percent of the population was covered. This 29 percent
threshold selected to cover the same percentage as the two-step approach that combined the relative
wealth index with CDR data.

The Meta relative wealth index, however, is a measure of asset wealth rather than a measure of
household-size adjusted consumption. This raises the question of whether targeting could be further
improved if the 100 poorest cantons were identified using small area estimates of poverty, derived from
combining survey and geospatial auxiliary data along the lines of the studies discussed above, instead of
small area estimates of wealth. The paper does not address this question directly, because Canton-level
poverty estimates derived from combining survey data on per capita consumption with geospatial data
was not considered in the set of feasible options.

The first feasible alternative considered simulated the provision of a uniform transfer to all individuals
within the poorest prefectures. There are only 40 prefectures in Togo, however, as opposed to 397
cantons. This means that the simulated prefecture transfer differs in two ways from the simulated
Canton procedure. First, it is based on many fewer prefectures, which reduces targeting accuracy
because no attempt is made to distinguish hypothetical recipients within prefectures. Secondly, the
prefecture estimates are estimates of predicted per capita consumption instead of wealth, which all else
equal should improve targeting, since a measure of predicted per capita consumption is used for
evaluation. Comparing the bottom two rows in Table 6 indicates that in this case, targeting based on less
wealthy Cantons is more accurate than targeting based on poor prefectures, as the benefits of more
disaggregated targeting outweighs the disadvantage of targeting based on predicted wealth. However,
this difference is not large, given the much smaller number of prefectures, as the difference in rank
correlation is only 0.04. This suggests that geographic targeting could be further improved if the 100
poorest cantons were determined based on estimates of per capita consumption rather than wealth.
Table 6: Spearman rank correlation alternative targeting mechanisms relative to independent PMT estimates

                                                        Spearman rank                   Area under the curve
                                                        correlation
 100 Poorest Cantons plus machine learning              0.45                            0.73
 with phone data

 Feasible alternatives
            Uniform targeting within poorest 0.34                                       0.66
                                 prefectures
   Uniform targeting within poorest cantons 0.38                                        0.68

 Infeasible alternatives
                                    Asset index 0.51                                    0.75
                  Progress out of Poverty Index 0.63                                    0.81
                             Proxy Means Test 0.72                                      0.85
Source: Aiken et al. (2022)


    5. Conclusion

Thirty-five years after the publication of Battese, Harter, and Fuller (1988) and seven years after the
publication of Jean et al. (2016), the literature on combining survey and geospatial data to predict
wealth and poverty is maturing rapidly. It is clear that indicators derived from geospatial data are
strongly predictive of wealth and poverty across space in several contexts, although the extent of this
correlation depends on many factors. The accuracy of predictions, however, is particularly sensitive to
the nature of the household data on welfare or wealth used to train the model. It is also clear that the
coverage of the training sample also matters.

Estimates for out-of-sample areas are almost always less accurate than for in-sample areas, because
informative sampling introduces bias, and because Bayesian and empirical Bayesian methods do not
benefit from sample-based priors. In addition, the evaluation benchmark in out-of-sample areas, if it is a
model-based estimate, will also be biased due to informative sampling. Finally, the few studies that have
considered the prediction of changes over time have found that this is much harder than prediction
across space, even for medium to long-run changes. Seeing short-run changes in welfare from space
may prove challenging but is a challenge worth taking on.

The recent literature has offered more discussion than evidence regarding the pros and cons of different
methodologies. One dividing line has been the choice of вЂњdirect CNNsвЂќ trained directly to survey data as
opposed to utilizing interpretable geospatial features in a mixed linear or tree-based machine learning
model. In Uganda and Sri Lanka, where both approaches have been compared, it seems that the
interpretable features approach does at least as well as directly training CNNs, but the evidence on this
question remains scant. A second fault line has emerged over the level at which to specify linear
models. As a general rule, predictions typically benefit from using the most spatially disaggregated data
possible, so if sub-area level data are available, area-level models should be only used as a last resort.
This is particularly true when considering an evaluation criterion that combines accuracy and precision,
such as mean squared error, since the use of more granular auxiliary data appears to have a larger
beneficial effect on precision than accuracy. More research can shed more light on the pros and cons of
вЂњsub-area modelsвЂќ that predict poverty rates with the sub-area as the unit of observation, vis-Г -vis
household level models with sub-area mean predictors, where repeated simulations are used to
generate poverty estimates.

Finally, there are new developments applying tree-based machine learning techniques. These generally
offer greater predictive power than linear models at the cost of parsimony and transparency (Efron,
2020). Tree-based machine learning models are more robust to outliers, and less susceptible to bias
arising from informative sampling when predicting out of sample. Two important obstacles to more
widespread adoption of machine learning methods have recently been surmounted. The first was the
lack of asymptotic theory. Asymptotic theory has, however, recently provided for generalized random
forests, a large class of methods that includes boosted regression forests, which is a type of gradient
boosting (Athey et al, 2018). The second obstacle was the lack of an accepted method for uncertainty
estimation. Although Athey et al. (2018) developed methods to estimate uncertainty for boosted
regression forests, the random effect residual bootstrap developed by Chambers and Chandra (2013)
and first applied by Krennmair and Schmid (2022) is also an attractive and simple option that appears to
work well for wealth prediction using extreme gradient boosting in multiple contexts (Merfeld and
Newhouse, 2023).

While the potential of regularly pairing survey data with geospatial data is clear, more work on research
and tools is needed to further instill confidence in the estimates and facilitate use. Research could
benefit from more comparative work on methods, ideally utilizing design-based simulations using
georeferenced census data. These can examine several outstanding research questions, including the
relative benefit of convolutional neural networks versus simpler estimation approaches, quantifying the
benefits of including conditional random effects when using machine learning models, probing the
robustness of methods for estimating the uncertainty associated with tree-based machine learning
estimates, experimenting with different geospatial features, and determining the age threshold at which
census-based estimates become less accurate than geospatial estimates in different contexts. The
question of which features best predict changes is also important. For example, changes in the rate of
building construction, the characteristics of new buildings, or changes in crop types and forecasted
yields have not yet to our knowledge been tested as correlates of welfare changes.

In addition to further research, further improvements to open-source tools are critical to make these
techniques more accessible and educate users. This is particularly important to facilitate the adoption of
more sophisticated methods in developing countries given the financial and technical constraints faced
by national statistical offices and other practitioners. The R EMDI, SAE, SAEforest, and GRF packages
available on CRAN are all examples of good practice, as the documentation for each is clear,
comprehensive, and up to date. No such comparable user-friendly package exists for neural network
models at this time. No matter how many new and improved methods are published by statisticians,
they are unlikely to be used in practice without software that is accessible to non-specialists and
thoroughly documented. User-friendly features automating common diagnostics and parallelizing across
multiple cores to speed estimation, as implemented in the EMDI package, are very valuable. Finally,
software that makes it simple to obtain and link publicly available geospatial indicators with survey data
will also help facilitate data integration. As these tools are developed, small area estimates that combine
survey data with publicly available geospatial data will inevitably become more popular worldwide,
belatedly fulfilling the promise demonstrated thirty-five years ago by Battese, Harter, and Fuller (1988).
References

Aiken, E., Bellue, S., Karlan, D., Udry, C., & Blumenstock, J. E. (2022). Machine learning and phone data
can improve targeting of humanitarian aid. Nature, 603(7903), 864-870.

Athey, S., Tibsherani, J., & Wager, S. (2019). Generalized Random Forests. The Annals of
Statistics, 47(2), 1148-1178.

Ayush, K., Uzkent, B., Tanmay, K., Burke, M., Lobell, D., & Ermon, S. (2021, May). Efficient poverty
mapping from high resolution remote sensing images. In Proceedings of the AAAI Conference on Artificial
Intelligence (Vol. 35, No. 1, pp. 12-20).

Babenko, B., Hersh, J., Newhouse, D., Ramakrishnan, A., & Swartz, T. (2017). Poverty mapping using
convolutional neural networks trained on high and medium resolution satellite images, with an application
in Mexico. arXiv preprint arXiv:1711.06323.

Battese, G. E., Harter, R. M., & Fuller, W. A. (1988). An error-components model for prediction of county
crop areas using survey and satellite data. Journal of the American Statistical Association, 83(401), 28-
36.

Bell, W. R. (2008, August). Examining sensitivity of small area inferences to uncertainty about sampling
error variances. In Proceedings of the American Statistical Association, Survey Research Methods
Section (Vol. 327, p. 334).

Breiman, L. (2001). Random forests. Machine learning, 45, 5-32.

Burke, M., Driscoll, A., Lobell, D. B., & Ermon, S. (2021). Using satellite imagery to understand and
promote sustainable development. Science, 371(6535), eabe8628.

Chambers, R., & Chandra, H. (2013). A random effect block bootstrap for clustered data. Journal of
Computational and Graphical Statistics, 22(2), 452-470.

Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the
22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794).

Chi, G., Fang, H., Chatterjee, S., & Blumenstock, J. E. (2022). Microestimates of wealth for all low-and
middle-income countries. Proceedings of the National Academy of Sciences , 119(3), e2113658119.

Coady, D., Grosh, M. E., & Hoddinott, J. (2004). Targeting of transfers in developing countries: Review of
lessons and experience.

Corral, P., Himelein, K., McGee, K., & Molina, I. (2021). A Map of the Poor or a Poor
Map?. Mathematics, 9(21), 2780.

Corral, P., Molina, I., Cojocaru, A., & Segovia, S. (2022). Guidelines to Small Area Estimation for Poverty
Mapping, World Bank.

Diallo, M. S., & Rao, J. N. K. (2018). Small area estimation of complex parameters under unit вЂђlevel
models with skewвЂђnormal errors. Scandinavian Journal of Statistics, 45(4), 1092-1116.

Dooley, C. A., Boo, G., Leasure, D. R., & Tatem, A. J. (2020). Gridded maps of building patterns
throughout sub-Saharan Africa, version 1.1. University of Southampton: Southampton, UK.

Edochie, I., D. Newhouse, T. Schmid, N. Tzavidis, E. Foster, A. Ouedraogo, A. Sanoh, and A. Savadogo
(forthcoming), вЂњSmall area Estimates of Poverty in Four West African countriesвЂќ, mimeo
Efron, B. (2020). Prediction, estimation, and attribution. International Statistical Review, 88, S28-S59.

Elbers, C., Lanjouw, J. O., & Lanjouw, P. (2003). Micro-level estimation of poverty and
inequality. Econometrica, 71(1), 355-364.

Engstrom, R., Newhouse, D., & Soundararajan, V. (2020). Estimating small-area population density in Sri
Lanka using surveys and Geo-spatial data. PloS one, 15(8), e0237063.

Engstrom, R., Hersh, J., & Newhouse, D. (2022). Poverty from space: Using high resolution satellite
imagery for estimating economic well-being. The World Bank Economic Review, 36(2), 382-412.

Erciulescu, A. L., Cruze, N. B., & Nandram, B. (2019). Model-based county level crop estimates
incorporating auxiliary sources of information. Journal of the Royal Statistical Society: Series A (Statistics
in Society), 182(1), 283-303.

Esch, T., Brzoska, E., Dech, S., Leutner, B., Palacios-Lopez, D., Metz-Marconcini, A., ... & Zeidler, J.
(2022). World Settlement Footprint 3D-A first three-dimensional survey of the global building
stock. Remote sensing of environment, 270, 112877.

Fay III, R. E., & Herriot, R. A. (1979). Estimates of income for small places: an application of James-Stein
procedures to census data. Journal of the American Statistical Association, 74(366a), 269-277.

GonzГЎlez-Manteiga, W., LombardГ­a, M. J., Molina, I., Morales, D., & SantamarГ­a, L. (2008). Bootstrap
mean squared error of a small-area EBLUP. Journal of Statistical Computation and Simulation, 78(5),
443-462.

Guadarrama, M., Molina, I., & Rao, J. N. K. (2016). A comparison of small area estimation methods for
poverty mapping. Statistics in Transition new series, 17(1), 41-66.

Gualavisi, M., & Newhouse, D. L. (2022). Integrating Survey and Geospatial Data to Identify the Poor and
Vulnerable, Policy Research Working Paper no. 10257.

Jean, N., Burke, M., Xie, M., Davis, W. M., Lobell, D. B., & Ermon, S. (2016). Combining satellite imagery
and machine learning to predict poverty. Science, 353(6301), 790-794.

Jiang, J., & Lahiri, P. (2006). Mixed model prediction and small area estimation. Test, 15, 1-96.

Krennmair, P., Wurz, N., & Schmid, T. (2022). Tree-Based Machine Learning in Small Area Estimation,
The Survey Statistician vol. 86, 22-31.

Kreutzmann, A. K., Pannier, S., Rojas-Perilla, N., Schmid, T., Templ, M., & Tzavidis, N. (2019). The R
package emdi for estimating and mapping regionally disaggregated indicators. Journal of Statistical
Software, 91.

Lange, S., Pape, U. J., & PГјtz, P. (2022). Small area estimation of poverty under structural
change. Review of Income and Wealth, 68, S264-S281.

Leasure, D. R., Jochem, W. C., Weber, E. M., Seaman, V., & Tatem, A. J. (2020). National population
mapping from sparse survey data: A hierarchical Bayesian modeling framework to account for
uncertainty. Proceedings of the National Academy of Sciences , 117(39), 24173-24179.

Lee, K., & Braithwaite, J. (2022). High-resolution poverty maps in sub-saharan africa. World
Development, 159, 106028.

Khachiyan, A., Thomas, A., Zhou, H., Hanson, G., Cloninger, A., Rosing, T., & Khandelwal, A. K. (2022).
Using Neural Networks to Predict Microspatial Economic Growth. American Economic Review:
Insights, 4(4), 491-506.
Krennmair, P., Wurz, N., & Schmid, T. (2022). Tree-Based Machine Learning in Small Area Estimation,
Survey Statistician

Krennmair, P., & Schmid, T. (2022). Flexible domain prediction using mixed effects random
forests. Journal of the Royal Statistical Society Series C, 71(5), 1865-1894.

Liu, E., Meng, C., Kolodner, M., Sung, E. J., Chen, S., Burke, M., ... & Ermon, S. (2023). Building
Coverage Estimation with Low-resolution Remote Sensing Imagery. arXiv preprint arXiv:2301.01449.

Lobell, D. B., Azzari, G., Burke, M., Gourlay, S., Jin, Z., Kilic, T., & Murray, S. (2020). Eyes in the sky,
boots on the ground: Assessing satelliteвЂђand groundвЂђbased approaches to crop yield measurement and
analysis. American Journal of Agricultural Economics, 102(1), 202-219.

Lobell, D. B., Di Tommaso, S., Burke, M., & Kilic, T. (2021). Twice Is Nice: The Benefits of Two Ground
Measures for Evaluating the Accuracy of Satellite-Based Sustainability Estimates. Remote
Sensing, 13(16), 3160.

Marhuenda, Y., Molina, I., Morales, D., & Rao, J. N. K. (2017). Poverty mapping in small areas under a
twofold nested error regression model. Journal of the Royal Statistical Society. Series A (Statistics in
Society), 1111-1136.

Masaki, T., Newhouse, D., Silwal, A. R., Bedada, A., & Engstrom, R. (2022). Small area estimation of
non-monetary poverty with geospatial data. Statistical Journal of the IAOS, (Preprint), 1-17.

McBride, L., Barrett, C. B., Browne, C., Hu, L., Liu, Y., Matteson, D. S., ... & Wen, J. (2022). Predicting
poverty and malnutrition for targeting, mapping, monitoring, and early warning. Applied Economic
Perspectives and Policy, 44(2), 879-892.

Merfeld, J. D., Newhouse, D., Weber, M., & Lahiri, P. (2022). Combining Survey and Geospatial Data Can
Significantly Improve Gender-Disaggregated Estimates of Labor Market Outcomes, World Bank Policy
Research Working Paper no. 10175.

Merfeld, J.D, and Newhouse D. (2023) Improving Estimates of Mean Welfare and Uncertainty for Small
Areas in Developing Countries

Molina, I., and Marhuenda, Y. (2015). sae: an R package for small area estimation. R Journal, 7(1), 81.

Molina, I., and J. N. K. Rao. "Small area estimation of poverty indicators." Canadian Journal of
statistics 38.3 (2010): 369-385.

Newhouse, D., Merfeld, J., Ramakrishnan, A. P., Swartz, T., & Lahiri, P. (2022). Small Area Estimation of
Monetary Poverty in Mexico using Satellite Imagery and Machine Learning.

Ngo, D. K., & Christiaensen, L. (2019). The performance of a consumption augmented asset index in
ranking households and identifying the poor. Review of Income and Wealth, 65(4), 804-833.

Nguyen, M., Corral Rodas, P. A., Azevedo, J. P., & Zhao, Q. (2018). sae: A stata package for unit level
small area estimation. World Bank Policy Research Working Paper, (8630).

Peterson, R. A., & Cavanaugh, J. E. (2020). Ordered quantile normalization: a semiparametric
transformation built for the cross-validation era. Journal of applied statistics, 47(13-15), 2312-2327.

Pfeffermann, D., & Sverchkov, M. (2007). Small-area estimation under informative probability sampling of
areas and within the selected areas. Journal of the American Statistical Association, 102(480), 1427-
1439.

Pfeffermann, D., & Sverchkov, M. (2009). Inference under informative sampling. In Handbook of
statistics (Vol. 29, pp. 455-487). Elsevier.
Pokhriyal, N., & Jacques, D. C. (2017). Combining disparate data sources for improved poverty prediction
and mapping. Proceedings of the National Academy of Sciences, 114(46), E9783-E9792.

Rao, J. N., & Molina, I. (2015). Small area estimation. John Wiley & Sons.

Chandra, H., Salvati, N., & Chambers, R. (2015). A spatially nonstationary Fay-Herriot model for small
area estimation. Journal of Survey Statistics and Methodology, 3(2), 109-135.

Singh, B. B., Shukla, G. K., & Kundu, D. (2005). Spatio-temporal models in small area estimation. Survey
Methodology, 31(2), 183.

Steele, J. E., SundsГёy, P. R., Pezzulo, C., Alegana, V. A., Bird, T. J., Blumenstock, J., ... & Bengtsson, L.
(2017). Mapping poverty using mobile phone and satellite data. Journal of The Royal Society
Interface, 14(127), 20160690.

Tonneau, M., Adjodah, D., Palotti, J., Grinberg, N., & Fraiberger, S. (2022). Multilingual Detection of
Personal Employment Status on Twitter. arXiv preprint arXiv:2203.09178.

Tzavidis, N., Salvati, N., Pratesi, M., & Chambers, R. (2008). M-quantile models with application to
poverty mapping. Statistical Methods and Applications, 17, 393-411.

Tzavidis, N., Zhang, L. C., Luna, A., Schmid, T., Rojas-Perilla, N., Gordon, L. R., ... & Zimmermann, T.
(2018). From start to finish. Journal of the Royal Statistical Society. Series A (Statistics in
Society), 181(4), 927-979.

Van Der Weide, R., Blankespoor, B., Elbers, C., & Lanjouw, P. (2022). How Accurate Is a Poverty Map
Based on Remote Sensing Data?.World Bank Policy Research Working Paper 10171.

Yeh, C., Perez, A., Driscoll, A., Azzari, G., Tang, Z., Lobell, D., ... & Burke, M. (2020). Using publicly
available satellite imagery and deep learning to understand economic well-being in Africa. Nature
communications, 11(1), 2583.

You, Y. (2021). Small area estimation using Fay-Herriot area level model with sampling variance
smoothing and modeling. Survey Methodology, 47(2), 361-371.