Policy Research Working Paper 10512 Small Area Estimation of Poverty and Wealth Using Geospatial Data What Have We Learned So Far? David Newhouse Development Economics Development Data Group June 2023 Policy Research Working Paper 10512 Abstract This paper offers a nontechnical review of selected applica- sampled areas are significantly more accurate than those for tions that combine survey and geospatial data to generate non-sampled areas due to informative sampling. In general, small area estimates of wealth or poverty. Publicly available estimates benefit from using geospatial data at the most data from satellites and phones predicts poverty and wealth disaggregated level possible. Tree-based machine learning accurately across space, when evaluated against census data, methods appear to generate more accurate estimates than and their use in model-based estimates improve the accu- linear mixed models. Small area estimates using geospatial racy and efficiency of direct survey estimates. Although data can improve the design of social assistance programs, the evidence is scant, models based on interpretable fea- particularly when the existing targeting system is poorly tures appear to predict at least as well as estimates derived designed. from Convolutional Neural Networks. Estimates for This paper is a product of the Development Data Group, Development Economics. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The author may be contacted at dnewhouse@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Small Area Estimation of Poverty and Wealth Using Geospatial Data: What Have We Learned So Far?1 David Newhouse (World Bank Group) 1 JEL codes: C53, I32. Keywords: poverty, small area estimation, poverty mapping, satellite data, machine learning We thank Partha Lahiri for his encouragement to write this article, William Bell, Chris Elbers, Carolina Franco, and Josh Merfeld for helpful comments on a previous draft, participants at the 2022 Small Area Estimation conference at the University of Maryland College Park, and Haishan Fu and Keith Garrett for their support and encouragement. 1. Introduction Using geospatial data as auxiliary data for small area estimation is an old idea. Proof of concept was initially demonstrated thirty-five years ago by Battese, Harter, and Fuller (1988), who combined survey data with early imagery from the Landsat satellite to predict the area under corn and soybean production in 11 counties in Iowa. That paper is widely cited in the field of small area estimation statistics, with nearly 1,100 cites on Google Scholar as of May 2023. But the paper is better known for another seminal contribution, as it was the first to develop and apply the well-known nested-error unit- level model, with a conditional random effect specified at the target area level, to estimate means for small areas. From 1988 to about 2015, economists and statisticians devoted considerable effort to refining this model in various ways, with a particularly important innovation introduced by Molina and Rao (2010) to estimate indicators other than means, such as poverty headcount rates, using simulation techniques. In the meantime, the publication of Elbers, Lanjouw, and Lanjouw (2003), which used a slightly different unit-level model, popularized the use of small area estimation at the World Bank. Nonetheless, until relatively recently virtually all applications during this time used census or other administrative data as auxiliary data, ignoring geospatial data as a potential source of auxiliary data from which surveys could “borrow strength” to improve the measurement of socioeconomic data. Geospatial data was rediscovered as a potential source of auxiliary data in the mid 2010s, as advances in computing power and storage enabled geospatial data to become publicly available at a wide scale; as surveys began to be regularly implemented on tablets that collect geocoordinates; and as a new generation of data scientists, economists, and statisticians discovered the potential of geospatial data to improve socioeconomic measurement. This, in turn, sparked interest in combining survey and satellite indicators for the purposes of small area estimation. Using appropriate methods for this type of “data fusion” is important because small area poverty estimates have implications for the targeting and evaluation of public interventions and can shed light on economic geography more generally. At the same time, in part because of recent advances in machine learning algorithms, different disciplines and authors have taken very different methodological approaches to combining geospatial data and survey data for the purposes of small area estimation. This paper provides a non-technical review of selected evidence from this relatively new literature. It builds on two recent reviews (Burke, 2021, McBride et al, 2022) but focuses exclusively on the small area estimation of wealth and poverty, devoting particular attention to differences in statistical methodology across studies. In particular, it ignores some of the excellent recent work on agricultural crops and yields (Lobell et al, 2020, Erciulescu et al, 2019), labor (Merfeld et al, 2022), and other indicators. There is now a robust literature documenting that estimates of wealth and poverty derived from survey and geospatial data are correlated with benchmarks derived from surveys or censuses. The strength of these correlations varies widely and depends on a myriad number of factors, including the country context, the method used for prediction, the target area for prediction, the exact indicator being predicted, the choice of geospatial variables, and the nature of the training and evaluation data. Because the literature is relatively new, no consensus has yet emerged around the optimal prediction method in different contexts. Furthermore, comparisons of alternative prediction methods in the same geographic context remains rare, and some of the few examples of these comparisons have not yet 1 been published in peer reviewed journals. Therefore, most of the evidence presented below on comparisons across alternative models should be interpreted as tentative priors based on limited evidence from specific contexts. This review is divided into three sections. The first section begins by very briefly describing some of the many publicly available geospatial indicators. It then reviews selected studies from a rapidly growing literature evaluating the accuracy of small area estimates of wealth and poverty using geospatial data, documenting strong correlations across several studies when compared with census-based estimates. I then briefly touch on three related issues: The sensitivity of accuracy to the nature of the training data; the more limited ability of geospatial data to predict variation across time in welfare than variation across space; and the important distinction between sampled and non-sampled target areas when considering the accuracy of estimates. The second section focuses on comparisons between different types of statistical methods for cross-sectional predictions, including the nature of the geospatial features and different types of models used for prediction. The third section briefly discusses an important recent paper describing how survey and geospatial data were combined to target poor households in Togo (Aiken et al, 2022). The final section concludes with a summary of key points and suggestions for further research. 2. Small area estimates of poverty and wealth with geospatial data a. What type of geospatial features are publicly available? Geospatial data are typically obtained from satellites, mobile phones, or internet activity. Satellite indicators have a few key advantages over mobile phones and internet activity, however, including the public availability of a large number of indicators, in many cases derived from publicly available imagery provided by the Sentinel 2 and Landsat satellites. Proprietary high-resolution satellite imagery--from companies such as Maxar, Planet, Airbus, and others--can also either be used directly as an input into deep learning models, or as inputs to derive interpretable features such as building footprints, roads, and vehicles. Unlike call detail records, satellite-based indicators typically cover the entire country and therefore avoid selection bias. Call Detail Records (CDR) from mobile phones, in addition to only representing mobile phone users, are also more difficult to obtain for privacy reasons. However, CDR can in some contexts provide more informative indicators such as location information, cell phone behavior, and connection quality and device type. Internet records such as Twitter usage can also be informative (Tonneau et al, 2022). Information from online platforms also suffers from selection bias, however, since only a portion of the population uses it in developing countries, and it is difficult to estimate the extent to which this source of bias affects estimates. A rich variety of geospatial indicators derived from satellite imagery have become publicly available and can be found in Google Earth Engine, Microsoft Planetary Computer, and other freely accessible websites. These offer access to several climate-related variables as well as a host of predictive features such as night-time lights, land classification, year of switch from pervious to impervious surface, estimates of net primary production, cell phone placement, a wide variety of climate and temperature variables, pollution estimates from the Sentinel 5-P satellite, a variety of soil quality measures, and countless other geospatial indicators. Meta has also publicly released the Relative Wealth Index, based on the pioneering work of Chi et al. (2021). Modeled population estimates from Worldpop, Meta, or 2 Google are also critical inputs into small area estimation, as they are both strong predictors of welfare and also essential for aggregating predictions to higher administrative levels. Information on building footprints is also valuable when they can be obtained. Worldpop has made statistical information on building footprints available for much of Africa (Dooley et al, 2020); these are derived by Ecopia using Maxar imagery. The Microsoft planetary computer also now contains building footprint data for a variety of countries, including most of Europe and the Americas, and parts of Africa and Southeast Asia. Google recently released a new version of its Open buildings layer covering Africa and Southeast Asia, and the German Aerospace Center recently released the World Settlement Footprint global database of 3-D building footprints (Esch et al, 2023). Liu et al. (2023) recently showed that building footprints can be modeled accurately using Sentinel 1 and Sentinel 2 imagery, but the resulting indicator data have not yet been publicly released. Dynamic information on building footprints should become increasingly available in the near future. In addition, a variety of data pertaining to agriculture and food security are posted online through FAO’s hand-in-hand geospatial platform, which contains information on food security, crops and vegetation. Recent subnational estimates of crop type and yield estimates are only available for a few countries at this time, but coverage will likely expand significantly in the coming years. Overall, an impressive amount of geospatial imagery and indicators is already publicly available, and more should be coming online in the next few years. b. Geospatial data predicts poverty and wealth accurately across space Several studies have examined how predictions of wealth or poverty derived from linking survey and geospatial data compare with either survey or census-based measures of poverty and welfare. Accuracy is often assessed using R2, defined as: ̂ )2 ∑( − (1) 2 = 1 − ̅)2 ∑( − ̂ is the Where is the target area, is the reference measure of poverty or welfare for target area , predicted value for target area i, and ̅ is the mean across target areas. Some studies instead report the Pearson correlation between the predicted and reference measure, which can be squared to obtain R2. Table 1 lists the actual or implied R2 values reported by several studies. An important early paper on how geospatial data predict poverty and wealth was Jean et al (2016). This paper used “deep learning” in the form of a convolutional neural network (CNN) to predict welfare, using daytime imagery taken from Google Earth and the luminosity of night-time lights in several countries in Sub-Saharan Africa. Each layer of the CNN successively filters the original image into more and more condensed abstract features, until the final layer represents the predicted luminosity value. Jean et al. (2016) transfer the features from the penultimate layer of the CNN to a ridge regression that estimates the value of an asset index or per capita consumption in withheld villages. In Jean et al (2016), the target areas are survey clusters, the reference measure is the predicted average values of a wealth index or per capita consumption for cluster taken from the household survey and withheld from the training sample, and ̂ are predictions generated from convolutional 2 neural network models. Out-of-sample R is assessed through survey cross-validation and varied from 0.37 to 0.55 for per capita consumption, and from 0.55 to 0.75 for asset wealth. However, because the 3 Table 1: Comparison of accuracy across different sources Country Target area Indicator Survey data Source of validation data Estimation R2 against validation Source Method2 data Bangladesh Upazilla Predicted 2014 Census-based ELL BSEM 0.95 Steele et al. (2017) consumption estimates Burkina Faso Commune Predicted poverty 2018 EHCVM Census-based EBP EBP 0.63 Edochie et al. estimates (forthcoming) Madagascar Commune Aseet Index Census Design-based simulation XGB 0.80 Merfeld and Newhouse using census (2023) Malawi Traditional Poverty rates 2019 HIS Census-based EBP BSEM 0.81 Van der Weide et al. Authority estimates (2022) Malawi Traditional Asset Index Census Design-based simulation XGB 0.79 Merfeld and Newhouse Authority using census (2023) Malawi Traditional Poverty rates Census Poverty rates based on XGB 0.84 Merfeld and Newhouse Authority predicted welfare in census (2023) Mexico AGEB Per capita income 2014 MCS- Census estimates CNN 0.47 Babenko et al. (2017) Enigh Mexico Municipality Per capita income 2014 MCS- EBP estimates using EBP 0.74 in-sample Newhouse et al. (2022) ENIGH Intercensus 0.49 out-of-sample Mexico Municipality Per capita labor 2015 Design-based simulation EBP 0.86 in-sample Newhouse et al. (2022) income Intracensus using Intercensus 0.64 out-of-sample Mozambique Locality Asset Index 2018 census Design-based simulation XGB 0.85 Merfeld and Newhouse using census (2023) Senegal Commune Non-monetary Census Cross-validation GPR 0.83 Pokhriyal and Jacques poverty (2018) Sri Lanka DS Division Predicted HIES 2012 ELL estimates from census OLS 0.61 Engstrom et al. (2022) consumption Sri Lanka DS Division Predicted HIES 2012 ELL estimates from census CNN 0.39 Engstrom et al. (2022) consumption Sri Lanka DS Division Non-monetary Census 2012 Census EBP 0.77 Masaki et al. (2022) poverty 2 BSEM = Bayesian Structural Equation Model, EBP = Empirical Best Predictor, GPR = Gaussian process Regression, OLS = Ordinary Least Squares, CNN = Convolutional Neural Network, CNNTL = Convolutional Neural Network with Transfer Learning, XGB = XGBoost. 4 Country Target area Indicator Survey data Source of validation data Estimation R2 against validation Source Method2 data Sri Lanka DS Division Asset Index Census Design-based simulation XGB 0.83 Merfeld and Newhouse using census (2023) Sub-Saharan Village Asset Index DHS Cross-validation CNN 0.70 Yeh et al. (2020) Africa Sub-Saharan Village Consumption LSMS Cross-validation CNNTL 0.37 to 0.55 Jean et al. (2016) Africa Sub-Saharan Village Asset Index DHS Cross-validation CNNTL 0.55 to 0.75 Jean et al. (2016) Africa Sub-Saharan Village Asset Index DHS Cross-validation XGB 0.56 Chi et al. (2021) Africa Tanzania District Non-monetary 2018 Census EBP 0.77 Masaki et al. (2022) poverty Northeastern District Non-monetary Simulated Census EBP 0.96 Masaki et al. (2022) Tanzania census sample Togo Canton Asset Index Pooled DHS Independent survey XGB 0.84 Chi et al. (2021) 5 model was trained on night-time lights, accuracy declined precipitously when the model attempted to predict within the lower portion of the per capita consumption distribution. In the African countries considered, most poor households live in dark rural areas, and a model trained to predict only night- time lights cannot distinguish welfare levels among them. Babenko et al. (2017) improved upon this method by using daytime imagery to train a CNN model directly using survey data on per capita income. They trained the CNN model to predict the share of the population in extreme and moderate poverty in different Area Geo Estadistica Basicas (AGEBs) – small areas analogous to a census block -- based on per capita income data collected in the 2014 MCS-ENIGH household survey. The prediction of AGEB-level poverty rates achieved an R2 of 0.47 when compared with survey estimates from withheld AGEBs. Interestingly, this is within the range reported Jean et al. (2016) for per capita consumption, despite the difference in welfare measure (income vs. consumption) and country context. A prediction model based solely on land-cover classification predicted equally well, however, and models that used both predicted poverty (from the CNN) and the land-cover classification achieved an R2 of 0.57. This suggests that the CNN-based estimates of poverty in this case did not capture all of the imagery features correlated with average household per capita income. We further touch on the differences between interpretable and CNN-derived features below. Newhouse et al. (2022) use the CNN poverty estimates and land cover classification from Babenko et al. (2017) as inputs into Empirical Best Predictor (EBP) models to predict poverty at the municipality level. The EBP model provides a simple framework for combining the two features in a linear mixed model, in addition to offering a well-established parametric bootstrap method for estimating uncertainty (Gonzalez-Manteiga et al, 2008). Official estimates for municipalities developed by the government based on the household-level intercensus were used as the reference for comparison. The R2 of the estimates was 0.74 for sampled municipalities, but only 0.49 for non-sampled municipalities. Because this is only based on one sample, the paper also performed design-based simulations using a measure of per capita labor income taken from the intercensus. In the simulations, the R2 was 0.86 for sampled areas and 0.64 for out-of-sample areas. I further discuss the difference in accuracy between sampled and non-sampled areas below. Steele et al. (2017) predict a wealth index and per capita expenditure in Bangladesh using a Hierarchical Bayes Structural Equation Model, allowing for spatial covariance. The paper differs from many others by also incorporating call detail record (CDR) features from mobile phones in addition to satellite features. Results were validated both using cross-validation and using poverty estimates derived from the 2012 census. The results showed that it is much easier to predict wealth than per capita consumption, a finding consistent with Jean et al. (2016). When using cross-validation for evaluation, out-of-sample R2 for village-level estimates was 0.76 for wealth as opposed to 0.36 for consumption. However, when comparing Upazilla-level (sub-district level) estimates with previous estimates derived using traditional small area estimates from the 2010 census, R2 was a much higher 0.95. Similarly, Pokhriyal and Jacques (2017) combine CDR and satellite data from Senegal with census data to predict non-monetary poverty across communes in Senegal. They use Gaussian process regression, a non-parametric machine learning method. They evaluate their estimates using 10-fold cross-validation. The estimates for non-monetary poverty across Communes achieved an out-of-sample R2 of 0.83 and a rank correlation of 0.87. 6 Chi et al. (2021) used several Demographic and Health Surveys (DHS) and a mix of publicly available and proprietary geospatial data to predict an asset index for 2.4 km grids across 135 countries. The authors trained the model on the asset index available in the DHS, using data for 56 countries. Proprietary predictors include internet connectivity information obtained from Meta. In validation exercises, R2 varied greatly, depending on the context, level of geographic disaggregation, and comparison indicator. When validated using cross-validation across enumeration areas in the survey data, the R2 of the estimates was 0.56, similar to the 0.6 value reported by Yeh et al. (2020) for Africa. Meanwhile, when validated against independent wealth measures, R2 was 0.60 across rural Kenyan villages, 0.70 across Nigerian Local Government Areas, and 0.84 across Togolese Cantons. But when validated against the predicted probability of being poor, or predicted per capita consumption or income, R2 values are much lower: 0.04 across Malawian villages, 0.17 in rural Kenya, and approximately 0.3 across Mexican municipalities (Gualavisi and Newhouse, 2022, Chi et al, 2021, Newhouse et al, 2022). This is due to key differences between wealth and predicted per capita consumption, including the fact that the latter is expressed in per capita terms. Masaki et al. (2022) also consider the prediction of non-monetary poverty in Tanzania and Sri Lanka. Their study used census data from both countries to construct a non-monetary welfare index, and classified households whose index fell below a percentile threshold roughly equal to the prevailing national poverty rate as non-monetarily poor. The analysis combines survey-based estimates with publicly available geospatial indicators using an Empirical Best Predictor model following Molina and Rao (2010). Relative to direct survey estimates, the correlation with the census rose from 0.72 to 0.88 in Sri Lanka and from 0.77 to 0.88 in Tanzania. Unlike previous papers integrating survey and geospatial data, this one estimates the efficiency gain as well as the gain in accuracy due to incorporating geospatial data and finds that it is roughly equivalent to expanding the size of the sample by a factor between three and five. Van der Weide et al. (2022) generate small area estimates of monetary poverty in Malawi for Traditional Authorities by combining survey data with publicly available geospatial features. Like Steele et al. (2017), their study uses a Bayesian structural equation model that accounts for spatial correlation across areas and validates the prediction against census-based estimates. It finds a correlation between the geospatial features and census estimates above 0.9, although some individual target areas show substantial discrepancies between the census and geospatial-based estimates due to differences in the data used for prediction. Krennmair and Schmid (2022) propose a mixed effect random forest model, which is tested in a design- based simulation using household-level covariates in Mexico. The results demonstrate the benefits of applying machine learning methods over traditional linear models. Design-based simulations from the Mexican state of Nuevo León indicate that this approach reduces median relative bias by about 20 percent relative to the more typical approach of applying an empirical best predictor (EBP) model with a transformation. The paper also evaluates a random effect block residual bootstrap approach to estimating uncertainty and finds that it performs well. Merfeld and Newhouse (2023) evaluate small area estimates of an asset index for four countries: Madagascar, Malawi, Mozambique, and Sri Lanka. In addition, the paper evaluates small area estimates of poverty for Malawi obtained using publicly available geospatial auxiliary data. This study compares linear EBP models with three different types of machine learning models: Extreme Gradient Boosting, 7 Boosted Regression Forests, and Cubist regression. Unlike Krennmair et al., the machine learning models do not include a conditional random effect. Despite that, the results indicate that all three machine learning methods generate substantially more accurate estimates than the linear EBP model, particularly out of sample. Of the machine learning methods, boosted regression forests and extreme gradient boosting perform equally well except in Sri Lanka, where extreme gradient boosting produces slightly more accurate estimates. The random effect block residual bootstrap approach proposed by Krennmair et al (2022) also works well when applied to gradient boosting, providing coverage rates ranging from 94 to 97 percent. Finally, Lee and Braithewaite (2022) propose an innovative iterative process that combines tree-based machine learning and deep learning to generate wealth estimates. This paper first trains a model using extreme gradient boosting to predict the probability of being in different wealth classes using interpretable geospatial features, and then uses the predicted values from this procedure to train a convolutional neural network on satellite imagery. The predicted probabilities of being in different wealth classes from the convolutional neural network are subsequently fed back into the boosting model. This process is repeated iteratively until there is convergence. This improves upon the methodology in Jean et al. (2016) by avoiding the use of night-time lights to train models. The authors report that the average out-of-sample R2 from withheld countries is 0.90. In general, several studies suggest that small area estimates generated by combining survey and geospatial data are more accurate than those based solely on survey data, sometimes by significant margins. This is notable because small area estimates based on geospatial data are subject to model bias; for example, a model that uses night-time lights as a predictor may underestimate poverty in a poor area that happens to contain a highway, if the high level of night-time lights associated with highways makes the area look less poor from the sky than it actually is. However, at least when predicting poverty rates at higher levels such as subdistricts, the evidence so far indicates that model- based estimates based on geospatial indicators are more accurate than direct survey estimates. This implies that the benefits of reducing sampling error by supplementing survey data with model-based predictions derived from geospatial indicators outweighs the introduction of model bias. c. Geospatial data are a second-best option when recent census data are unavailable Although geospatial data are strongly correlated with welfare across space, recent census data remain the gold standard for auxiliary data for small area estimation. Unfortunately, in many cases census data are old or unavailable, which creates two problems. First, because survey and census data cannot typically be linked at the household level to preserve confidentiality, it is standard when estimating a unit-level model to assume that the predictors follow the same distribution in the survey and the census data. This assumption becomes less tenable as the temporal gap between the census and survey increases. If the census and survey distributions are sufficiently different, estimating a linked model with primary-sampling-unit-level (PSU) aggregates from the census is preferable to assuming a common distribution (Lange, Pape, and Putz, 2018). A second and more important problem is that old census data, even if they are linked directly to the survey data, cannot reflect any changes that affect wealth or poverty that occurred since the census. Current geospatial data may be more likely to reflect current conditions than old census data, especially when including geospatial indicators such as precipitation and vegetation that better predict short-run shocks. 8 When it comes to census data, how old is too old? Or, put another way, at what age do census-based predictions become less accurate than current geospatial predictions? It is difficult to know, but Newhouse et al. (2022) offer one small piece of evidence on this point. When evaluated against Mexican 2015 small area poverty estimated based on the intercensus, 2010 census-based estimates are more accurate than 2015 estimates based on geospatial indicators (correlation of 0.91 vs 0.86). However, this is only representative of one context, and regional patterns of poverty in Mexico may have been more static during this time than in other contexts. d. Prediction accuracy is very sensitive to the training data Many of the papers discussed above predict poverty rates or average asset indices estimated at the village level. Since these are often derived from surveys with a limited number of observations per village, this raises the issue of noise in the dependent variable. In fact, correlations between predicted values and census-based estimates depend critically on the extent of noise in the training data, which will reduce measured accuracy. For example, Engstrom et al. (2022) considered how the accuracy of predictions depends on the size and nature of the sample used to estimate average per capita consumption at the GN division level. That analysis correlated interpretable geospatial features with predicted per capita consumption imputed into a census. Model R2 fell from 0.61 when using the mean over all census households, to 0.55 when using the mean over thirty households, and further to 0.40 when taking the mean over 8 households per enumeration area. Differences in the extent of noise present in the training data, as well as the reference evaluation measure, will not necessarily affect the ranking of different types of models within the same context. But it explains much of the wide variation in R2 observed in Table 1 across different studies, which underscores the benefits of evaluation studies that compare different methods in the same context, using the same reference measure. Gualavisi and Newhouse (2022) offer another stark example of how sensitive predictive accuracy is to the source of training data. Using a census extract from 10 districts in Malawi, the analysis compared estimates of average village welfare imputed into a household census with estimates derived from combining a survey with publicly available geospatial indicators. However, it also considers a third option, which involves hypothetically supplementing the survey with a partial registry, a “microcensus” that interviews all households in a randomly selected 450 of the 4,500 villages with geolocated data. This involves a two-step approach, where welfare is first predicted into the partial registry and then a geospatial model is trained against the partial registry predictions. Using a partial registry in this way yields an R2 of 0.35, as opposed to 0.01 for the geospatial poverty map based on survey data and 0.02 for the wealth estimates from Chi et al. (2021). These R2 values are much lower than those cited in the previous section. The weak correlation between the Chi et al (2021) estimates and these census-based predictions reflects the challenge of distinguishing between village welfare levels in this context. In particular, the sample consists of 4,500 villages, in 10 poor Malawian districts, for which names could be matched between the census and the Unified Beneficiary Registry data containing household geocoordinates. In this context, the Chi et al. (2021) estimates may perform poorly because they come from a model trained to wealth while the benchmark reference indicator is per capita consumption; these are conceptually different measures of welfare that tend to diverge more in rural areas and among the extreme poor (Ngo and Christiaensen, 2019). For the standard geospatial poverty map based on the survey, the low R2 is also due to the paucity of survey data, as there are only 9 16 households per enumeration area in the survey sample. Besides the disappointing performance of the standard geospatial poverty estimation and the Meta wealth index estimates in this challenging context, this exercise also illustrates how much the inclusion of additional data from the partial registry improves the performance of the prediction. The partial registry effectively adds valuable information to the training data when using geospatial data for small area estimation. This enables the development of a much more accurate prediction model, using proxy welfare variables that are cheaper and easier to collect. The importance of training data raises the question of whether estimating different models for different geographies, such as urban and rural areas, improves prediction accuracy. Estimating separate models may improve the accuracy of the estimates by better accounting for heterogeneity across regions. However, these models also utilize less training data, which reduces the richness of the prediction model in the typical case when the sample is used to select or tune models. Newhouse et al (2022) provide some evidence on this question, comparing monetary poverty estimates in Mexico derived from a national model, separate models for urban and rural areas, and separate models for each of six state groupings. Compared to a baseline national model, estimating models separately for urban and rural areas leads to a minor improvement in accuracy, raising the correlation with census-based estimates from 0.86 to 0.87 in sampled areas and from 0.70 to 0.71 in non-sampled areas. Estimating separate models for six groups of states led to a similar minor improvement. These findings for Mexico. however, do not necessarily generalize to other contexts, and additional evidence comparing the accuracy of models specified at different levels would be useful. Tree-based machine learning methods, such as those discussed below, provide more flexible alternatives that explicitly model interactions between predictors. Use of these models should therefore mitigate any benefit from estimating multiple models in disaggregated geographic regions or areas. e. Out-of-sample predictions are significantly less accurate than in-sample predictions Informative sampling occurs when sampling probabilities for Primary Sampling Units vary in a way that is correlated with the outcome of interest, such as welfare. This is typically the case in two-stage samples in which primary sampling units are sampled with probability proportional to size, since population size is correlated with most outcomes of interest. Informative samples, if not appropriately adjusted, produce biased estimates. The standard approach to adjust for informative sampling is to weight observations by the inverse probability of selection. This is usually straightforward to do when estimating EBP models with household surveys containing sample weights, for example when using the R EMDI and Stata SAE software packages. This protects against bias from informative sampling within sampled areas, but not against bias in predictions for areas that are not included in the sample (Pfefferman and Sverchkov, 2009). The bias in estimates for non-sampled areas can be severe. Table 2 shows evidence on the extent to which informative sampling leads to less accurate predictions in non- sampled areas. The extent of this bias appears to depend on the country context and the nature of the sample. In the examples reported in Table 2, the estimates in Burkina Faso appear as a particular outlier, with an out-of-sample R2 of 0.21. It is difficult to know exactly why the out-of-sample estimates for Burkina Faso are so inaccurate, but this likely reflects large differences in the relationship between covariates and predictors between sampled and non-sampled areas. In addition, the source of validation 10 data in that case is model-based EBP estimates derived from the census, which are also subject to bias due to informative sampling when predicting out of sample. Table 2: Predictions using geospatial data are much less accurate in non-sampled areas Country Indicator Source of survey Source of R2 for R2 for non- data validation data sampled sampled areas areas Burkina Faso Poverty 2018 EHCVM Census-based 0.76 0.21 EBP estimates Madagascar Wealth 2018 census Census 0.83 0.62 (simulated samples) Mexico Poverty 2014 MCS-ENIGH EBP estimates 0.74 0.49 using Intercensus Mexico Poverty 2015 Intracensus Design-based 0.88 0.64 simulation using Intercensus Malawi Wealth 2018 census Census 0.79 0.53 (simulated samples) Malawi Poverty 2018 census Derived from 0.76 0.64 (simulated samples) census-based predictions Mozambique Wealth 2019 census Census 0.85 0.71 (simulated samples) Sri Lanka Wealth 2012 census Census 0.90 0.80 (simulated samples) Results from household-level EBP models using geospatial auxiliary data. Sources: Edochie et al. (forthcoming), Newhouse et al. (2022), Merfeld and Newhouse (2023) Pfefferman and Sverchkov (2007) proposed a bias correction for out-of-sample areas when the probability of selection of an area into the sample is known, but first stage sampling probabilities are rarely available to analysts, and this correction has not yet to our knowledge been implemented in any small area estimation software package. Estimating a separate model with inverse probability weighting for out-of-sample areas may have the potential to improve the accuracy of out-of-sample estimates, by giving greater weight to areas that were less likely to be sampled, based on observable characteristics, when estimating model parameters. As discussed in more detail below, tree-based machines learning methods are also more robust to this source of bias and therefore tend to generate more accurate out- of-sample predictions. Until either the use of tree-based machine learning or explicit bias correction becomes routine, however, predictions for out-of-sample areas based on publicly available geospatial indicators should be treated with great caution. 11 f. Predictions across time are much less accurate than predictions across space In contrast to the many studies that have used satellite imagery to predict welfare levels across space, only two published studies to our knowledge have evaluated intertemporal predictions using geospatial data. Yeh et al. (2020) attempt to use daytime imagery to predict changes in wealth measured in the Demographic and Health Surveys. The CNN was only able to explain 15 to 17 percent of the estimated changes across African villages, however. When using self-reported changes in assets from the most recent survey, the figure rose to 35 percent. When aggregating up to districts, and using self-reported changes from the endline survey, imagery can explain about half of the variation in self-reported changes. However, self-reported changes are subject to recall error and may only capture major changes in assets. Meanwhile, Khachiyan et al. (2022) look at variation across census blocks in the US, and find that a CNN trained on daytime imagery predicted half of the change in population density between 2000 and 2020, and 42 percent of the variation in income change between 2000 and 2017. Both Khachiyan et al (2022) and Yeh et al (2022) use convolutional neural networks for prediction. Further studies would be very useful to test the ability of different methods and data sources to predict changes over time. g. Summary of key lessons on using geospatial data for small area prediction Overall, the main conclusion from this nascent literature is that geospatial data are strongly predictive of geographic variation in wealth and poverty. Exactly how predictive accuracy varies depending on a myriad number of factors. In general, wealth is easier to predict than consumption. Nonetheless, when national geospatial estimates of poverty or wealth are compared against census-based estimates for sampled areas, the R2 appear to consistently range from about 0.74 to 0.95, as shown in Table 1. Because geospatial data are particularly predictive of population density (Leasure et al, 2020, Engstrom et al, 2020) and population density is systematically related to economic welfare (Castaneda et al, 2018), geospatial data can help “fill in” two-stage household surveys with model-based predictions, boosting efficiency and accuracy. Most of the studies that have compared geospatial estimates with direct survey estimates find that the model-based estimates are more accurate, although the comparisons are not shown here. There is also evidence that prediction accuracy is highly dependent on the strength of the training data, suggesting that partial registries that collect proxy indicators may be a valuable supplement to household survey data when publicly available geospatial data can be linked. Finally, out- of-sample estimates are generally less accurate than in-sample estimates, and occasionally very inaccurate. This suggests that there are benefits from including as many target areas as possible in the sample, and from additional research and tools to improve out-of-sample prediction accuracy. At the same time, the early literature has utilized a dizzying number of approaches to integrating survey and geospatial data. Many of these papers train a CNN model directly to imagery, others use machine learning methods applied to specific features, while others use linear mixed models. Those that either use deep learning or tree-based machine learning either ignore or may not properly estimate or evaluate uncertainty, and many of the papers ignore the well-established statistical literature on small area estimation. The following section explores these issues in greater detail. 12 3. Statistical methods for prediction a. Interpretable features predict at least as well as deep learning from imagery One aspect in which the existing literature on geospatial data fusion has diverged involves the nature of the geospatial indicators used. While several studies obtain predictions directly from imagery using deep learning techniques such as CNNs, others instead generate predictions from interpretable features such as land classification types, night-time light luminosity, building density, and so on. When using interpretable features, predictions can be obtained using linear models – which may be regularized – or a tree-based machine learning algorithm such as extreme gradient boosting. It is possible to use both, as demonstrated in Lee and Braithwaithe (2022), but using deep learning entails several additional costs. Deep learning models are complex and effectively a “black box” to users. Furthermore, they require specialized skills to understand and deploy, and thousands of training data points to perform well. On the other hand, the number of EAs in survey data are typically less than five hundred. Pre-trained CNNs can help circumvent the need for more data, but little is currently understood about how the specific nature of the architecture or pre-training affects prediction accuracy or bias. A few existing studies shed light on the relative predictive power of deep learning and interpretable features. As noted above, Babenko et al. (2017) compare the predictive power of poverty predictions obtained from a CNN, as well as with those obtained from land classification, and both together. The correlation with the benchmark measure of truth, derived from the census, was equally correlated with the direct CNNs and land classifications, and the correlation improved moderately when both were included. Engstrom et al. (2022) directly compare CNN-based estimates of headcount poverty rates with feature- based estimates in Sri Lanka. A variety of features were used, including roof type, shadows, cars, and road types. In that context, feature-based prediction is more accurate, with an R2 of 0.61 and a mean absolute error of 3.2 pp, as opposed to an R2 of 0.39 and a mean absolute error of 5.5 pp for the CNN- based estimates. Ayush et al. (2022) also derive several interpretable features such as trucks, maritime vessels, vehicles, aircraft, etc. They then compare predictions obtained from these features in a gradient boosting model with those obtained from a CNN trained to nightlights data, as in Jean et al. (2016). When predicting average per capita consumption across villages, the out of sample R2 of the interpretable features is 0.54, as compared with 0.41 when using the CNN trained to night-time lights. While this is a lower bound measure of accuracy of the CNN, since it was trained to night-time lights, it is consistent with Engstrom et al. (2022). Finally, as noted above, Lee and Braithewaite (2022) employ an innovative approach that first uses interpretable features to predict the DHS wealth index, and then supplements that with poverty predicted from a deep learning model. This approach, however, shows limited benefits from adding direct CNN estimates to feature-based estimates for specific countries, with increases in R2 of only about 0 to 2 points across four of the five countries. The exception is South Africa, where there is an increase 13 of about 8 points. These results also suggest that in many contexts adding deep learning estimates to existing predictions based on gradient boosting offers limited additional improvement in accuracy. b. Household or sub-area models are usually preferable to area-level models when sub-area data are available. The pros and cons of different models for small area estimation in different contexts has been a subject of contention for many years and there is not yet a consensus between different statisticians and practitioners. For the purposes of this discussion, we assume that recent census data are unavailable, necessitating the use of geospatial data. In most cases, geospatial data are available in the form of zonal statistics at the “sub-area level”, where the sub-area is a geographic area such as a grid or village. A zonal statistic, for example, could be the average night-time luminosity in the village or grid. Linking survey and geospatial data at the level of the enumeration area or village is quite common in practice. The Demographic and Health Surveys publicly release jittered geocoordinates for each EA in most cases, facilitating this type of linking. Meanwhile, the target area is typically a more aggregate administrative unit, such as a district or subdistrict. Finally, we define the regional level as a level above the target area for which the survey is considered to be representative. Many analysts use Bayesian modeling for small area estimation. We focus here, however, on empirical best predictor models (Jiang and Lahiri, 2006, Molina and Rao, 2010), rather than purely Bayesian models, for three reasons. First, when they have been compared, EBP models give similar results to hierarchical Bayesian models (Guadarrama et al, 2016). Second, national statistics offices may be more comfortable with empirical Bayesian than Bayesian methods, partly due to discomfort with assuming a prior distribution. Finally, multiple well-documented and user-friendly software packages employ EBP methods, such as the EMDI and SAE packages in R, and the SAE package in Stata (Kreutzmann et al, 2019, Molina and Marhuenda, 2015, Nguyen et al, 2018). Here, I focus on EBP models because they incorporate a conditional random effect that conditions on the sample, which is effectively used as a prior estimate. This distinguishes EBP models from two other popular alternatives: M-quantile (Chambers and Tzavidis, 2006) and ELL (Elbers, Lanjouw, and Lanjouw, 2003). Including a conditional random effect is particularly important when the auxiliary data are linked to the survey at the sub-area or area-level (Masaki et al, 2022). In this case, the sample contains more information relative to the auxiliary data than when using a typical household census, because the auxiliary data is the same for all households within a sub-area. This mechanically introduces correlation across households in a village, increasing the variance of the area effect. This in turn increases the weight given to the sample relative to the prediction in the empirical best predictor model when using aggregate predictors, as opposed to household-level predictors. Within the set of EBP models, three main classes of model can be used to generate area-level poverty estimates using existing publicly available software packages. The first type of EBP model for poverty estimation is a household-level model, as follows: (2) ( ) = 1 + 2 + 3 + + , where ( ) is a transformed measure of household economic welfare of household i, typically measured as capita income or consumption. Household i lives in subarea s, target area a, and region r. 14 is a vector of predictors aggregated to the sub-area level, so that they are constant within a particular sub-area. is a vector of predictors aggregated to the area level. is a vector of state dummies. is a conditional random effect specified at the area level, and is a stochastic error term. In practice, and are typically assumed to be normal. While it is possible to relax the assumption of normally distributed random effects (Diallo and Rao, 2018), we know of no software package that implements empirical best predictor models with non-normal error terms. The assumption that the stochastic error terms are distributed normally necessitates transforming the dependent variable (Tzavidis et al, 2018). For household models, a log functional form has often traditionally been used, dating back to Elbers, Lanjouw, and Lanjouw (2003). But recently a number of adaptive transformations that select a parameter to best fit the data have become increasingly popular, following their implementation in the publicly available EMDI software package. Examples of such adaptive transformations include the log-shift and Box-Cox transformations, in which a transformation parameter is selected through restricted information maximum likelihood. Masaki et al. (2022) and Newhouse et al. (2022) take a different approach, employing a rank order transformation that forces the dependent variable to follow a normal distribution, following Peterson and Cavanough (2019). While this does not guarantee that the residuals are normal, it brings them far closer to normality in those contexts. This transformation can be reversed under additional assumptions, although a back- transformation is not necessary for estimating headcount poverty. The second type of model is a “sub-area model” specified at the lowest level at which the geospatial auxiliary data can be linked to the household model. This model assumes the form ̂ = 1 + 2 + 3 + + , (3) where ̂ is the estimated poverty rate for each subarea. This is effectively a unit-level model with the unit as a sub-area. Because the dependent variable is a proportion, it can be transformed using the arcsin transformation, which is part of the EMDI and EMDIplus software packages. The final option is to specify the model at the area level, which is the target area, following the long and distinguished literature spawned by Fay and Herriot (1979): ̂ = 1 + + . (4) When testing the area-level models, we follow the recommendation of most software packages by specifying a simple version, namely a linear model with no variance smoothing. In particular, we obtain variance estimates for target areas by using the Horvitz-Thompson variance approximation. However, direct estimates of the variance at the target area level are imprecise, motivating the use of model- based small area estimation in general. Smoothing these variance estimates prior to estimation would likely generate more accurate predictions (Bell, 2008, You, 2021). In addition, accounting spatial and/or temporal correlation in a Fay-Herriot model can also improve prediction accuracy (Singh et al, 2005, Chandra, Salvati and Chambers, 2015). A variant of the Fay-Herriot model that allows for spatial autocorrelation based on Petrucci and Salvati (2006) has been implemented in the R EMDI package, while the R SAErobust package also implements the spatio-temporal correlation models proposed by Rao and Yu (1994) and Marhuenda et al (2013). Testing area-level models with variance smoothing and spatial and temporal correlation structures against household and sub-area models that use sub-area predictors is a useful area for further research. 15 It is impossible to develop a general rule about the relative accuracy of different models, because their relative accuracy depends on the nature of the data. This is especially true when comparing across different sources of auxiliary data. For example, census data aggregated to the target area level may be preferable to geospatial data available at the sub-area level because it is more predictive of welfare, but current geospatial data may be preferable to old census data even if the latter is available at the household level. Nonetheless, when considering a single source of auxiliary data, the household and sub-area models enjoy the important advantage of using auxiliary data at a more disaggregated level. The use of more spatially disaggregated data can lead to more accurate estimates of non-linear functions such as headcount poverty rates, though these gains in accuracy may be negligible in some contexts. An additional benefit from using more spatially disaggregated estimates is the additional precision gained by exploiting the additional variation across sub-areas within areas. Whether the benefit comes in the form of increased accuracy, precision, or both, when considering a single source of auxiliary data it is generally preferable to use a household or sub-area model rather than an area-level model when possible. Although the general principal of using the most spatially disaggregated auxiliary data possible seems straightforward, there is not yet full consensus on this point in the literature. For example, Corral et al. (2021) argue that, when considering model-based bias, household models with aggregate variables suffer from omitted variable bias, and therefore recommend using an area-level model rather than a household-level model when the auxiliary data consist solely of sub-area or area-level means. Newhouse et al. (2022), however, show that this source of omitted variable bias is equally present in area-level and household models that use auxiliary data drawn from the population, such as census or administrative data. Furthermore, this source of model-based bias disappears when considering design-model bias, taking the expected value of the predictions prior to drawing the sample. Omitted variable bias is therefore not a relevant concern when selecting between different types of models. Corral et al. (2022) nonetheless recommend the use of an area-level model rather than a household- level model when the predictors are available at the sub-area level, largely on the basis of results from a particular model-based simulation. This model-based simulation takes a simple random sample of households from all PSUs in the population. When the model-based simulation is altered to use a more realistic two-stage sample design in which a subset of PSUs are selected, the household model with sub- area means generates more accurate predictions than the area-level model.3 Using a two-stage sample effectively increases sampling error in the direct estimates, which in turn increases the benefit of using more geographically disaggregated auxiliary data to fill in the geographic gaps of the sample. Using a one-stage sample, on the other hand, makes the direct survey estimates more accurate, which favors area-level models in this case. This partly illustrates why it is important to be careful before inferring general results from particular model-based simulations (Tzavidis et al, 2018). Relative to the area-level model, the household model benefits from using auxiliary data at the sub-area level. The greater variation of more spatially disaggregated data is particularly important when using algorithmic variable selection methods such as LASSO or stepwise regression to select models, which is increasingly common among practitioners. The availability of sub-area variation also becomes more 3 Code is available upon request from the author. 16 important when forcing the model to include dummy variables at the regional level, the level for which the sample survey is considered to be representative. Forcing the inclusion of regional dummies in the model selection process generally increases the predictive accuracy of the model, by controlling for fixed characteristics of the region. In addition, including regional dummies enables model selection algorithms such as LASSO and stepwise to prioritize variables that best explain within-regional variation for inclusion in the model. Finally, since these algorithms use the sample to determine how many variables to select, using more disaggregated predictors ensures that a richer and more accurate predictive model is selected. The household model may also slightly benefit from predicting a continuous welfare variable, rather than discarding information—about how close a household is to the poverty line—by first converting it to an estimated headcount poverty rate, although it is not clear that this difference is important empirically. Table 3 shows empirical comparisons of accuracy, as measured by R2, from selected evaluations. R2 is shown because it is commonly reported and it tends to track closely with Spearman rank correlation, which is in turn useful for evaluating targeting performance. In most cases, unfortunately, the evidence reported below is based on a single real-life survey. Only in Mexico, to our knowledge, is there simulation evidence comparing area and household-level models using geospatial auxiliary data. In each case, models are selected using LASSO and regional dummies are included. Table 3: R2 by method and in or out of sample, relative to validation data Country Indicator Sample Validation In-sample Out-of-sample data Model Household- Area- Household- Area- level level level level Burkina Headcount Single Census-based 0.76 0.56 0.21 0.26 Faso poverty survey EBP estimates Sri Non- Single Census 0.77 0.71 N/A N/A Lanka monetary survey poverty Tanzania Non- Single Census 0.77 0.78 N/A N/A monetary survey poverty Mexico Headcount Single Census-based 0.74 0.63 0.49 0.44 poverty sample EBP estimates (in- sample) Mexico Labor income Design- Intracensus 0.89 0.88 0.64 0.56 poverty based simulation Source: Edochie et al. (forthcoming), Masaki et al. (2022), Newhouse et al. (2022). The results for design-based simulation for Mexico reported in the bottom row are based on area-level predictors while all other rows are based on sub-area level predictors. In general, the household model predicts more accurately than the area-level model in contexts where they have been directly compared. The one notable exception is out-of-sample areas in Burkina Faso. 17 However, for in-sample areas the household level model is significantly more accurate, such that the household-level model is more accurate overall (results not shown). In Tanzania, the area-level model also generates slightly more accurate predictions than the household-level model, although the difference is negligible. Interestingly, in the one simulation comparison in Mexico, the two models perform essentially equally well in-sample but the household model is moderately more accurate out- of-sample. Mexico also is different than the other cases in using a relatively small number of proprietary geospatial variables, which may also partly explain why estimates are far more accurate out-of-sample in Mexico than Burkina Faso. In addition, the simulation results reported for Mexico are based on area- level aggregates instead of sub-area level aggregates, due to the lack of sub-area identifiers in the census data. This may explain why the household model and area-level model perform equally well in sampled areas in this context. The comparisons reported in Table 3 are far from conclusive and should be interpreted with caution, since all except for one case are based on a single sample. In addition, the area-level models estimated here use direct survey estimates of variance as inputs into the model, obtained using the Horvitz- Thompson approximation of variance. As noted above, using smoothed variance estimates and accounting for spatial correlation should increase the accuracy of area-level models. Finally, the evaluation metrics are often themselves EBP estimates based on household census data, since official welfare measures are never observed in the census. Nonetheless, despite the limited evidence so far, the household level model appears to generate more accurate predictions than the area-level model in the majority of cases, sometimes by substantial margins. Additional evidence would be useful to get a better sense of the conditions under which household models or area-level models generate more accurate estimates. As noted above, a key benefit of incorporating sub-area level auxiliary data in a household model framework is increased efficiency. In Burkina Faso, the mean estimated mean-squared error for sampled areas was half as small when estimating a household model with sub-area predictors (Edochie et al, forthcoming). A similar 45 percent reduction in mean MSE was observed for in-sample municipalities in Mexico when incorporating sub-area level predictors (Newhouse et al, 2022). While these are only two contexts, they suggest a large efficiency gain when using a household model with sub-area level predictors, relative to an area-level model. Another option is the sub-area level model given in equation (2), in which the unit of analysis is the sub- area and the dependent variable is sub-area-level poverty rates, in the spirit of Torabi and Rao (2014). Unfortunately, there is virtually no empirical evidence to our knowledge on the relative accuracy of estimates produced by a sub-area vs. a household model. In Mexico, when using a single household survey sample, the R2 was equal to 0.70 relative to the evaluation benchmark, less than the 0.74 value for the household-level model. However, this only pertains to one context, and further research is needed to rigorously evaluate these different types of models. Many of the same studies also compare coverage rates across the models, which are useful for evaluating the accuracy of uncertainty estimates. If estimates of uncertainty are unbiased, coverage rates should be approximately equal to 95 percent. Table 4 lists estimated coverage rates from selected studies. When estimating the household model, confidence intervals are calculated based on mean squared error, under the assumption that the point estimates are unbiased. For estimating coverage rates, generally the area-level model fares better than the household model, providing coverage rates of 18 92% in Sri Lanka and Tanzania as opposed to 84% and 75% for the household model. The differences were more muted in Mexico, although unfortunately no coverage statistics were provided for the design-based simulation. The moderate underestimation of uncertainty in the household model may be due to the omission of a random effect at the sub-area level, although the direction of this bias due to omitting this sub-area random effect depends on the structure of the data (Marhuenda et al, 2018). One way to get a sense of the magnitude of this downward bias in estimated uncertainty is to compare coverage rates with those of direct estimates. The standard cluster-robust variance estimator for surveys also underestimates uncertainty because it fails to account for the correlation in poverty status across enumeration areas within target areas. This leads to similarly low coverage rates for direct estimates in Sri Lanka and Tanzania, and much lower coverage rates in Mexico. When compared with the downward bias in standard direct variance estimates, the magnitude of the downward bias in the household model MSE estimates does not seem large enough to warrant serious concern. Table 4: Coverage rates, relative to validation data Country Indicator Sample Validation In-sample Out-of-sample data Model Hh- Area-level Direct Hh- Area-level level level Sri Non- Single Census 84% 92% 76% N/A N/A Lanka monetary survey poverty Tanzania Non- Single Census 75% 92% 76% N/A N/A monetary survey poverty Mexico Headcount Single Census- 77% 82% 39% 83% 80% poverty sample based EBP estimates (in-sample) Source: Edochie et al. (forthcoming), Masaki et al. (2022), Newhouse et al. (2022) c. Tree-based machine learning methods appear to predict more accurately than linear models One of the key differences across the different studies listed in Table 1 is the choice of prediction method. Many of the studies used linear mixed models, drawing on the traditional methods popular in small area estimation. On the other hand, others use more sophisticated tree-based machine learning approaches such as Regression Forests or Gradient Boosting. Regression Forests are a generalization of decision trees. The take the average predictions of a continuous dependent variable over many decision trees, which are each derived from repeated random samples of the data and of candidate variables, a process known as bagging (Breiman, 2001). Bagging helps make random forests much more robust to small perturbations in the data than single decision trees. Extreme Gradient Boosting (XGboost), meanwhile, is a generalization of random forests (Chen and Guestrin, 2016). This method generates 19 predictions based on the sum of a sequence of regression forests, which are estimated by iteratively predicting the residual from the sum of the previous regression forests. To our knowledge, Krennmair et al. (2022) is the first paper to specify a model that combines a conditional random effect with tree-based machine learning, specifically random forests. When applied to Austrian income data, the mixed effect random forest model tends to generate more accurate predictions of mean income than traditional EBP. The authors conclude that random forest models offer substantial advantages over linear models in the presence of complex and non-linear interactions between covariates. Merfeld and Newhouse (2023) compare estimates produced using gradient boosting applied to geospatial data to those from a linear mixed effect EBP model. This differs from Krennmair et al. (2022) in two ways: It uses extreme gradient boosting instead of a single random forest model, and it assumes an unconditional random area effect instead of a conditional random area effect. The predictions are evaluated against census aggregates (for the asset index) or whether predicted welfare in the census falls below a threshold (for poverty in Malawi). Uncertainty is estimated using a block random effects booststrap approach, as proposed by Chambers and Chandra (2013) and applied in Krennmair et al. (2022). The results in Table 5 show that XGboost, even without the conditional random effect, is more accurate than EBP in each case. For out-of-sample areas, the greater flexibility of the XGboost algorithm eliminates much of the selection bias associated with out-of-sample prediction using linear models under informative sampling. In results not shown here, coverage rates vary from 94 to 97 percent, reflecting the success of the bootstrap procedure in estimating uncertainty accurately. Table 5: R2 by method and in or out of sample Country Indicator In-sample Out-of-Sample EBP XGboost EBP XGboost Madagascar Asset Index 0.83 0.87 0.62 0.76 Malawi Asset Index 0.79 0.91 0.54 0.71 Mozambique Asset Index 0.84 0.91 0.70 0.81 Sri Lanka Asset Index 0.80 0.83 0.64 0.81 Malawi Estimated headcount 0.75 0.91 0.63 0.78 poverty 4. Small area estimates using geospatial data can help improve on poor targeting systems A key application of small area estimation is assisting the identification of the poorest households through geographic targeting. Traditionally, cash transfer programs use Proxy Mean Tests (PMT) as a way to identify the poorest households, which utilize a registry of verifiable characteristics to assign each household a score (Coady, Grosh, and Hoddinot 2004). The weight applied to each characteristic is typically determined through by regressing these proxy welfare indicators on log per capita consumption. However, registries are typically very costly and time-consuming to collect and update. In part because of this, recent research has explored the use of satellite and phone data as an alternative to identify poor households. A recent paper (Aiken et al, 2022) evaluates an innovative two-step approach to identify the poorest households in Togo, which was applied to the Novissi cash transfer program. The first step entailed using the Meta relative wealth index from Chi et al. (2021) to identify the poorest hundred Cantons, which are the third administrative level in Togo, out of 397 total Cantons. Within these identified Cantons, the team used CDR data to identify poor households, using a model trained against per capita consumption collected as part of a phone survey in September 2020. Targeting accuracy was then evaluated against a Proxy Means Test constructed from an independent phone survey representative of all cell phone subscribers in the country. The headline result is that this two-step targeting approach outperformed two hypothetical feasible alternatives based on geographic targeting, as shown in Table 6. The first feasible alternative considered is a transfer of equal value to all persons within the poorest prefectures, which is the second administrative level in Togo, out of 40 total prefectures in Togo. The second alternative provided a transfer of equal value to all individuals within the poorest Cantons. Both of these hypothetical alternatives simulated transferring cash to all households in the poorest geographic areas, Prefectures in the first case and Cantons in the second, until 29 percent of the population was covered. This 29 percent threshold selected to cover the same percentage as the two-step approach that combined the relative wealth index with CDR data. The Meta relative wealth index, however, is a measure of asset wealth rather than a measure of household-size adjusted consumption. This raises the question of whether targeting could be further improved if the 100 poorest cantons were identified using small area estimates of poverty, derived from combining survey and geospatial auxiliary data along the lines of the studies discussed above, instead of small area estimates of wealth. The paper does not address this question directly, because Canton-level poverty estimates derived from combining survey data on per capita consumption with geospatial data was not considered in the set of feasible options. The first feasible alternative considered simulated the provision of a uniform transfer to all individuals within the poorest prefectures. There are only 40 prefectures in Togo, however, as opposed to 397 cantons. This means that the simulated prefecture transfer differs in two ways from the simulated Canton procedure. First, it is based on many fewer prefectures, which reduces targeting accuracy because no attempt is made to distinguish hypothetical recipients within prefectures. Secondly, the prefecture estimates are estimates of predicted per capita consumption instead of wealth, which all else equal should improve targeting, since a measure of predicted per capita consumption is used for evaluation. Comparing the bottom two rows in Table 6 indicates that in this case, targeting based on less wealthy Cantons is more accurate than targeting based on poor prefectures, as the benefits of more disaggregated targeting outweighs the disadvantage of targeting based on predicted wealth. However, this difference is not large, given the much smaller number of prefectures, as the difference in rank correlation is only 0.04. This suggests that geographic targeting could be further improved if the 100 poorest cantons were determined based on estimates of per capita consumption rather than wealth. Table 6: Spearman rank correlation alternative targeting mechanisms relative to independent PMT estimates Spearman rank Area under the curve correlation 100 Poorest Cantons plus machine learning 0.45 0.73 with phone data Feasible alternatives Uniform targeting within poorest 0.34 0.66 prefectures Uniform targeting within poorest cantons 0.38 0.68 Infeasible alternatives Asset index 0.51 0.75 Progress out of Poverty Index 0.63 0.81 Proxy Means Test 0.72 0.85 Source: Aiken et al. (2022) 5. Conclusion Thirty-five years after the publication of Battese, Harter, and Fuller (1988) and seven years after the publication of Jean et al. (2016), the literature on combining survey and geospatial data to predict wealth and poverty is maturing rapidly. It is clear that indicators derived from geospatial data are strongly predictive of wealth and poverty across space in several contexts, although the extent of this correlation depends on many factors. The accuracy of predictions, however, is particularly sensitive to the nature of the household data on welfare or wealth used to train the model. It is also clear that the coverage of the training sample also matters. Estimates for out-of-sample areas are almost always less accurate than for in-sample areas, because informative sampling introduces bias, and because Bayesian and empirical Bayesian methods do not benefit from sample-based priors. In addition, the evaluation benchmark in out-of-sample areas, if it is a model-based estimate, will also be biased due to informative sampling. Finally, the few studies that have considered the prediction of changes over time have found that this is much harder than prediction across space, even for medium to long-run changes. Seeing short-run changes in welfare from space may prove challenging but is a challenge worth taking on. The recent literature has offered more discussion than evidence regarding the pros and cons of different methodologies. One dividing line has been the choice of “direct CNNs” trained directly to survey data as opposed to utilizing interpretable geospatial features in a mixed linear or tree-based machine learning model. In Uganda and Sri Lanka, where both approaches have been compared, it seems that the interpretable features approach does at least as well as directly training CNNs, but the evidence on this question remains scant. A second fault line has emerged over the level at which to specify linear models. As a general rule, predictions typically benefit from using the most spatially disaggregated data possible, so if sub-area level data are available, area-level models should be only used as a last resort. This is particularly true when considering an evaluation criterion that combines accuracy and precision, such as mean squared error, since the use of more granular auxiliary data appears to have a larger beneficial effect on precision than accuracy. More research can shed more light on the pros and cons of “sub-area models” that predict poverty rates with the sub-area as the unit of observation, vis-à-vis household level models with sub-area mean predictors, where repeated simulations are used to generate poverty estimates. Finally, there are new developments applying tree-based machine learning techniques. These generally offer greater predictive power than linear models at the cost of parsimony and transparency (Efron, 2020). Tree-based machine learning models are more robust to outliers, and less susceptible to bias arising from informative sampling when predicting out of sample. Two important obstacles to more widespread adoption of machine learning methods have recently been surmounted. The first was the lack of asymptotic theory. Asymptotic theory has, however, recently provided for generalized random forests, a large class of methods that includes boosted regression forests, which is a type of gradient boosting (Athey et al, 2018). The second obstacle was the lack of an accepted method for uncertainty estimation. Although Athey et al. (2018) developed methods to estimate uncertainty for boosted regression forests, the random effect residual bootstrap developed by Chambers and Chandra (2013) and first applied by Krennmair and Schmid (2022) is also an attractive and simple option that appears to work well for wealth prediction using extreme gradient boosting in multiple contexts (Merfeld and Newhouse, 2023). While the potential of regularly pairing survey data with geospatial data is clear, more work on research and tools is needed to further instill confidence in the estimates and facilitate use. Research could benefit from more comparative work on methods, ideally utilizing design-based simulations using georeferenced census data. These can examine several outstanding research questions, including the relative benefit of convolutional neural networks versus simpler estimation approaches, quantifying the benefits of including conditional random effects when using machine learning models, probing the robustness of methods for estimating the uncertainty associated with tree-based machine learning estimates, experimenting with different geospatial features, and determining the age threshold at which census-based estimates become less accurate than geospatial estimates in different contexts. The question of which features best predict changes is also important. For example, changes in the rate of building construction, the characteristics of new buildings, or changes in crop types and forecasted yields have not yet to our knowledge been tested as correlates of welfare changes. In addition to further research, further improvements to open-source tools are critical to make these techniques more accessible and educate users. This is particularly important to facilitate the adoption of more sophisticated methods in developing countries given the financial and technical constraints faced by national statistical offices and other practitioners. The R EMDI, SAE, SAEforest, and GRF packages available on CRAN are all examples of good practice, as the documentation for each is clear, comprehensive, and up to date. No such comparable user-friendly package exists for neural network models at this time. No matter how many new and improved methods are published by statisticians, they are unlikely to be used in practice without software that is accessible to non-specialists and thoroughly documented. User-friendly features automating common diagnostics and parallelizing across multiple cores to speed estimation, as implemented in the EMDI package, are very valuable. Finally, software that makes it simple to obtain and link publicly available geospatial indicators with survey data will also help facilitate data integration. As these tools are developed, small area estimates that combine survey data with publicly available geospatial data will inevitably become more popular worldwide, belatedly fulfilling the promise demonstrated thirty-five years ago by Battese, Harter, and Fuller (1988). References Aiken, E., Bellue, S., Karlan, D., Udry, C., & Blumenstock, J. E. (2022). Machine learning and phone data can improve targeting of humanitarian aid. Nature, 603(7903), 864-870. Athey, S., Tibsherani, J., & Wager, S. (2019). Generalized Random Forests. The Annals of Statistics, 47(2), 1148-1178. Ayush, K., Uzkent, B., Tanmay, K., Burke, M., Lobell, D., & Ermon, S. (2021, May). Efficient poverty mapping from high resolution remote sensing images. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 1, pp. 12-20). Babenko, B., Hersh, J., Newhouse, D., Ramakrishnan, A., & Swartz, T. (2017). Poverty mapping using convolutional neural networks trained on high and medium resolution satellite images, with an application in Mexico. arXiv preprint arXiv:1711.06323. Battese, G. E., Harter, R. M., & Fuller, W. A. (1988). An error-components model for prediction of county crop areas using survey and satellite data. Journal of the American Statistical Association, 83(401), 28- 36. Bell, W. R. (2008, August). Examining sensitivity of small area inferences to uncertainty about sampling error variances. In Proceedings of the American Statistical Association, Survey Research Methods Section (Vol. 327, p. 334). Breiman, L. (2001). Random forests. Machine learning, 45, 5-32. Burke, M., Driscoll, A., Lobell, D. B., & Ermon, S. (2021). Using satellite imagery to understand and promote sustainable development. Science, 371(6535), eabe8628. Chambers, R., & Chandra, H. (2013). A random effect block bootstrap for clustered data. Journal of Computational and Graphical Statistics, 22(2), 452-470. Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794). Chi, G., Fang, H., Chatterjee, S., & Blumenstock, J. E. (2022). Microestimates of wealth for all low-and middle-income countries. Proceedings of the National Academy of Sciences , 119(3), e2113658119. Coady, D., Grosh, M. E., & Hoddinott, J. (2004). Targeting of transfers in developing countries: Review of lessons and experience. Corral, P., Himelein, K., McGee, K., & Molina, I. (2021). A Map of the Poor or a Poor Map?. Mathematics, 9(21), 2780. Corral, P., Molina, I., Cojocaru, A., & Segovia, S. (2022). Guidelines to Small Area Estimation for Poverty Mapping, World Bank. Diallo, M. S., & Rao, J. N. K. (2018). Small area estimation of complex parameters under unit ‐level models with skew‐normal errors. Scandinavian Journal of Statistics, 45(4), 1092-1116. Dooley, C. A., Boo, G., Leasure, D. R., & Tatem, A. J. (2020). Gridded maps of building patterns throughout sub-Saharan Africa, version 1.1. University of Southampton: Southampton, UK. Edochie, I., D. Newhouse, T. Schmid, N. Tzavidis, E. Foster, A. Ouedraogo, A. Sanoh, and A. Savadogo (forthcoming), “Small area Estimates of Poverty in Four West African countries”, mimeo Efron, B. (2020). Prediction, estimation, and attribution. International Statistical Review, 88, S28-S59. Elbers, C., Lanjouw, J. O., & Lanjouw, P. (2003). Micro-level estimation of poverty and inequality. Econometrica, 71(1), 355-364. Engstrom, R., Newhouse, D., & Soundararajan, V. (2020). Estimating small-area population density in Sri Lanka using surveys and Geo-spatial data. PloS one, 15(8), e0237063. Engstrom, R., Hersh, J., & Newhouse, D. (2022). Poverty from space: Using high resolution satellite imagery for estimating economic well-being. The World Bank Economic Review, 36(2), 382-412. Erciulescu, A. L., Cruze, N. B., & Nandram, B. (2019). Model-based county level crop estimates incorporating auxiliary sources of information. Journal of the Royal Statistical Society: Series A (Statistics in Society), 182(1), 283-303. Esch, T., Brzoska, E., Dech, S., Leutner, B., Palacios-Lopez, D., Metz-Marconcini, A., ... & Zeidler, J. (2022). World Settlement Footprint 3D-A first three-dimensional survey of the global building stock. Remote sensing of environment, 270, 112877. Fay III, R. E., & Herriot, R. A. (1979). Estimates of income for small places: an application of James-Stein procedures to census data. Journal of the American Statistical Association, 74(366a), 269-277. González-Manteiga, W., Lombardía, M. J., Molina, I., Morales, D., & Santamaría, L. (2008). Bootstrap mean squared error of a small-area EBLUP. Journal of Statistical Computation and Simulation, 78(5), 443-462. Guadarrama, M., Molina, I., & Rao, J. N. K. (2016). A comparison of small area estimation methods for poverty mapping. Statistics in Transition new series, 17(1), 41-66. Gualavisi, M., & Newhouse, D. L. (2022). Integrating Survey and Geospatial Data to Identify the Poor and Vulnerable, Policy Research Working Paper no. 10257. Jean, N., Burke, M., Xie, M., Davis, W. M., Lobell, D. B., & Ermon, S. (2016). Combining satellite imagery and machine learning to predict poverty. Science, 353(6301), 790-794. Jiang, J., & Lahiri, P. (2006). Mixed model prediction and small area estimation. Test, 15, 1-96. Krennmair, P., Wurz, N., & Schmid, T. (2022). Tree-Based Machine Learning in Small Area Estimation, The Survey Statistician vol. 86, 22-31. Kreutzmann, A. K., Pannier, S., Rojas-Perilla, N., Schmid, T., Templ, M., & Tzavidis, N. (2019). The R package emdi for estimating and mapping regionally disaggregated indicators. Journal of Statistical Software, 91. Lange, S., Pape, U. J., & Pütz, P. (2022). Small area estimation of poverty under structural change. Review of Income and Wealth, 68, S264-S281. Leasure, D. R., Jochem, W. C., Weber, E. M., Seaman, V., & Tatem, A. J. (2020). National population mapping from sparse survey data: A hierarchical Bayesian modeling framework to account for uncertainty. Proceedings of the National Academy of Sciences , 117(39), 24173-24179. Lee, K., & Braithwaite, J. (2022). High-resolution poverty maps in sub-saharan africa. World Development, 159, 106028. Khachiyan, A., Thomas, A., Zhou, H., Hanson, G., Cloninger, A., Rosing, T., & Khandelwal, A. K. (2022). Using Neural Networks to Predict Microspatial Economic Growth. American Economic Review: Insights, 4(4), 491-506. Krennmair, P., Wurz, N., & Schmid, T. (2022). Tree-Based Machine Learning in Small Area Estimation, Survey Statistician Krennmair, P., & Schmid, T. (2022). Flexible domain prediction using mixed effects random forests. Journal of the Royal Statistical Society Series C, 71(5), 1865-1894. Liu, E., Meng, C., Kolodner, M., Sung, E. J., Chen, S., Burke, M., ... & Ermon, S. (2023). Building Coverage Estimation with Low-resolution Remote Sensing Imagery. arXiv preprint arXiv:2301.01449. Lobell, D. B., Azzari, G., Burke, M., Gourlay, S., Jin, Z., Kilic, T., & Murray, S. (2020). Eyes in the sky, boots on the ground: Assessing satellite‐and ground‐based approaches to crop yield measurement and analysis. American Journal of Agricultural Economics, 102(1), 202-219. Lobell, D. B., Di Tommaso, S., Burke, M., & Kilic, T. (2021). Twice Is Nice: The Benefits of Two Ground Measures for Evaluating the Accuracy of Satellite-Based Sustainability Estimates. Remote Sensing, 13(16), 3160. Marhuenda, Y., Molina, I., Morales, D., & Rao, J. N. K. (2017). Poverty mapping in small areas under a twofold nested error regression model. Journal of the Royal Statistical Society. Series A (Statistics in Society), 1111-1136. Masaki, T., Newhouse, D., Silwal, A. R., Bedada, A., & Engstrom, R. (2022). Small area estimation of non-monetary poverty with geospatial data. Statistical Journal of the IAOS, (Preprint), 1-17. McBride, L., Barrett, C. B., Browne, C., Hu, L., Liu, Y., Matteson, D. S., ... & Wen, J. (2022). Predicting poverty and malnutrition for targeting, mapping, monitoring, and early warning. Applied Economic Perspectives and Policy, 44(2), 879-892. Merfeld, J. D., Newhouse, D., Weber, M., & Lahiri, P. (2022). Combining Survey and Geospatial Data Can Significantly Improve Gender-Disaggregated Estimates of Labor Market Outcomes, World Bank Policy Research Working Paper no. 10175. Merfeld, J.D, and Newhouse D. (2023) Improving Estimates of Mean Welfare and Uncertainty for Small Areas in Developing Countries Molina, I., and Marhuenda, Y. (2015). sae: an R package for small area estimation. R Journal, 7(1), 81. Molina, I., and J. N. K. Rao. "Small area estimation of poverty indicators." Canadian Journal of statistics 38.3 (2010): 369-385. Newhouse, D., Merfeld, J., Ramakrishnan, A. P., Swartz, T., & Lahiri, P. (2022). Small Area Estimation of Monetary Poverty in Mexico using Satellite Imagery and Machine Learning. Ngo, D. K., & Christiaensen, L. (2019). The performance of a consumption augmented asset index in ranking households and identifying the poor. Review of Income and Wealth, 65(4), 804-833. Nguyen, M., Corral Rodas, P. A., Azevedo, J. P., & Zhao, Q. (2018). sae: A stata package for unit level small area estimation. World Bank Policy Research Working Paper, (8630). Peterson, R. A., & Cavanaugh, J. E. (2020). Ordered quantile normalization: a semiparametric transformation built for the cross-validation era. Journal of applied statistics, 47(13-15), 2312-2327. Pfeffermann, D., & Sverchkov, M. (2007). Small-area estimation under informative probability sampling of areas and within the selected areas. Journal of the American Statistical Association, 102(480), 1427- 1439. Pfeffermann, D., & Sverchkov, M. (2009). Inference under informative sampling. In Handbook of statistics (Vol. 29, pp. 455-487). Elsevier. Pokhriyal, N., & Jacques, D. C. (2017). Combining disparate data sources for improved poverty prediction and mapping. Proceedings of the National Academy of Sciences, 114(46), E9783-E9792. Rao, J. N., & Molina, I. (2015). Small area estimation. John Wiley & Sons. Chandra, H., Salvati, N., & Chambers, R. (2015). A spatially nonstationary Fay-Herriot model for small area estimation. Journal of Survey Statistics and Methodology, 3(2), 109-135. Singh, B. B., Shukla, G. K., & Kundu, D. (2005). Spatio-temporal models in small area estimation. Survey Methodology, 31(2), 183. Steele, J. E., Sundsøy, P. R., Pezzulo, C., Alegana, V. A., Bird, T. J., Blumenstock, J., ... & Bengtsson, L. (2017). Mapping poverty using mobile phone and satellite data. Journal of The Royal Society Interface, 14(127), 20160690. Tonneau, M., Adjodah, D., Palotti, J., Grinberg, N., & Fraiberger, S. (2022). Multilingual Detection of Personal Employment Status on Twitter. arXiv preprint arXiv:2203.09178. Tzavidis, N., Salvati, N., Pratesi, M., & Chambers, R. (2008). M-quantile models with application to poverty mapping. Statistical Methods and Applications, 17, 393-411. Tzavidis, N., Zhang, L. C., Luna, A., Schmid, T., Rojas-Perilla, N., Gordon, L. R., ... & Zimmermann, T. (2018). From start to finish. Journal of the Royal Statistical Society. Series A (Statistics in Society), 181(4), 927-979. Van Der Weide, R., Blankespoor, B., Elbers, C., & Lanjouw, P. (2022). How Accurate Is a Poverty Map Based on Remote Sensing Data?.World Bank Policy Research Working Paper 10171. Yeh, C., Perez, A., Driscoll, A., Azzari, G., Tang, Z., Lobell, D., ... & Burke, M. (2020). Using publicly available satellite imagery and deep learning to understand economic well-being in Africa. Nature communications, 11(1), 2583. You, Y. (2021). Small area estimation using Fay-Herriot area level model with sampling variance smoothing and modeling. Survey Methodology, 47(2), 361-371.