Small Area Estimation of Non-Monetary Poverty with Geospatial Data

This paper evaluates the benefits of combining household surveys with satellite and other geospatial data to generate small area estimates of non-monetary poverty. Using data from Tanzania and Sri Lanka and applying a household-level empirical best (EB) predictor mixed model, we find that combining survey data with geospatial data significantly improves both the precision and accuracy of our non-monetary poverty estimates. While the EB predictor model moderately underestimates standard errors of those point estimates, coverage rates are similar to standard survey-based standard errors that assume independent outcomes across clusters.


Policy Research Working Paper 9383
This paper uses data from Sri Lanka and Tanzania to evaluate the benefits of combining household surveys with geographically comprehensive geospatial indicators to generate small area estimates of non-monetary poverty. The preferred estimates are generated by utilizing subarea-level geospatial indicators in a household-level empirical best predictor mixed model with a normalized welfare measure. Mean squared errors are estimated using a parametric bootstrap procedure. The resulting estimates are highly correlated with non-monetary poverty calculated from the full census in both countries, and the gain in precision is comparable to increasing the size of the sample by a factor of three in Sri Lanka and five in Tanzania. The empirical best predictor model moderately underestimates uncertainty, but coverage rates are similar to standard survey-based estimates that assume independent outcomes across clusters. A variety of checks, including adding noise to the welfare measure and model-based and design-based simulations, confirm that the main results are robust. The results demonstrate that combining household survey data with subarea-level geospatial indicators can greatly increase the precision of survey estimates of non-monetary poverty at comparatively low cost. This paper is a product of the Poverty and Equity Global Practice. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at dnewhouse@worldbank.org.

Introduction
The proliferation of big data obtained from satellites and mobile devices, as well as research demonstrating that satellite and mobile phone records are strongly correlated with household welfare, has sparked great interest in statistical methods that combine big data with survey data. 1 A key motivation for complementing survey data with comprehensive big data is the potential for small area estimation, which can produce more precise and granular estimates of socioeconomic indicators by combining household-level survey data in surveyed enumeration areas with predicted values from all enumeration areas, whether sampled or not. Established methods for small area estimation typically utilize household survey data and contemporaneous household census data, by first estimating a prediction model of welfare in the survey data and then using the estimated model parameters to simulate welfare in the census. 2 Because traditional small area estimation links household survey with census auxiliary data through the estimated model parameters, the set of potential predictor variables is limited to those present in both the survey and the census. Furthermore, when the wording of questions or timing of the data collection differs significantly between the census and the survey, small area estimates using the standard method can suffer from bias. Finally, because this method relies on census data, a new census is required to credibly estimate welfare changes for small areas. Census data are typically dated and collected once per decade at best, which presents a significant hurdle to deriving timely small area poverty estimates.
To overcome these challenges, we link survey data with geospatial and remote-sensing auxiliary data at the subarea level. 3 Subarea refers to the lowest geographic level at which spatial auxiliary data can be merged with household surveys containing data on welfare and poverty, while area refers to the target level for the small area poverty estimates. In recent household surveys, GPS coordinates of households, enumeration areas, or villages are often available, which allows analysts to link survey data with other auxiliary sources of remote-sensing or geospatial data and improve the precision of poverty estimates.
Although linking surveys with geospatial auxiliary data generates less precise small area estimates than using a census, it offers several important benefits. Geographic linking allows the relationship between the spatial auxiliary indicators and economic welfare to be estimated directly in the prediction model, avoiding the restrictive assumption that predictors must be identically distributed in the survey and auxiliary data. In addition, some spatial auxiliary data are collected continuously and are highly predictive of subarea poverty, and both the quality and availability of geospatial data are improving rapidly. Compared to other forms of "big data" such as mobile phone data, indicators derived from satellite imagery are attractive because of the comparative ease of acquiring data and the absence of selection bias.
Despite the potential value of this approach, only a couple of existing studies have quantified the gain in the precision of small area estimates achieved by supplementing survey data with geospatial data, in the 3 context of estimated crop acreage. 4 Existing studies that validate small area estimates of non-monetary poverty generated from mobile phone or geospatial data have, by and large, focused on extrapolation, or the use of big data to predict welfare in countries that are not surveyed. This differs from small area estimation, in which the objective is to predict poverty within the area covered by the sample, at a more granular level. To do this efficiently, it is crucial to use the survey data not only to train the prediction model, but also as a direct input into the estimates.
This study therefore makes three main contributions. First, it applies the prevailing framework for small area estimation to combine household survey data with geographically comprehensive geospatial indicators in an efficient way. Second, it evaluates the extent to which incorporating geospatial variables at the subarea level improves the precision of small-area poverty estimates. Finally, it assesses which of the commonly used SAE models -unit-level models Lanjouw 2003. Molina andRao, 2010), and the Fay-Herriot area-level model (Fay and Herriot, 1979) -are best suited for combining survey and geospatial data to produce efficient and accurate estimates of both area-level poverty rates and the uncertainty associated with them.
We test these models in the context or generating small area estimates of non-monetary poverty in two developing countries, Sri Lanka and Tanzania. These countries were selected due to the availability of both census and survey data with subarea-level identifiers and matching polygon shapefiles close to the village level. The welfare prediction model uses survey data to estimate the value of a household welfare index as a function of subarea characteristics, and therefore differs from standard small area estimation models that predict welfare using household characteristics. The resulting estimates provide a large efficiency gain compared with direct survey estimates.
We mainly consider Empirical Best Predictor (EBP) models, which have a long history in small area estimation and can accommodate subarea-level auxiliary data to produce estimates of poverty rates and their mean squared error for target areas. EBP models differ from the "ELL method" (Elbers, Lanjouw, and Lanjouw 2003), that has traditionally been used for poverty mapping by the World Bank because they incorporate additional information from the survey by conditioning the area effect on survey residuals. 5 This can lead to substantial efficiency gains when the variation in welfare or poverty contributed by the sample is large, relative to the variation contributed by the auxiliary data. On the other hand, a drawback of EBP models is the required assumption that the stochastic error terms in the model follow a normal distribution. While this normality assumption can be tested by examining the residuals from the prediction model, violating the assumption will introduce bias into the estimates.
In the traditional case when the auxiliary data are household-level data from a census, the sample is a small fraction of the size of the census and EBP models typically lead to minor efficiency gains. 6 In these cases, it is not obvious that EBP models are preferred to simulations based on standard random effect models that do not condition the area effect on the sample but allow for departures from normality. When the auxiliary data are only available at the subarea level, however, the effective size and statistical power of the auxiliary data are much less than that of a full household census. The share of variation contributed by the sample therefore becomes larger, and as shown in results presented below, EBP 4 models provide a large gain in efficiency compared with traditional ELL methods that do not incorporate Empirical Best methods.
We compare the estimates generated by a household EBP model with direct estimates obtained solely from the survey, as well as the well-known  The latter sacrifices precision by discarding the variation in the geospatial indicators across subareas within target areas. Once the area estimates are obtained from the household EBP and Fay-Herriot models, we compare them to nonmonetary poverty rates calculated directly from the full census. The availability of census data provides a credible benchmark for assessing how different methods and their variants perform, both in terms of the accuracy of both the small area point estimates and their confidence intervals. We compare the predictions from different methods in terms of their precision, their accuracy, and the extent to which the estimated confidence intervals contain the true value, which is known as the coverage rate.
The main result is that incorporating remote sensing data in an EBP framework substantially improves the accuracy and precision of small area estimates of non-monetary poverty, largely by incorporating information from non-sampled subareas. 8 This comes at no cost to coverage rates in Sri Lanka and a moderate cost in Tanzania, compared with standard direct estimates. The corresponding efficiency improvement is comparable to roughly tripling the size of the survey in Sri Lanka and quintupling it in Tanzania. 9 The small area estimates moderately underestimate mean squared error by, first, failing to account for uncertainty in estimated variance parameters from the model, second, by incorrectly assuming that sample observations are independent within areas, and third, by failing to incorporate sample weights in model estimation. Estimated coverage rates remain respectable, however, at 75 percent in Tanzania and 84 percent in Sri Lanka when benchmarked. This is comparable to the 76 percent coverage rate in both countries when using standard direct estimates. In Tanzania, the estimates from the unit level model are roughly as accurate and moderately more efficient than those from the area-level Fay-Herriot model. In Sri Lanka, where the poverty rate is low and the Fay-Herriot model is less predictive, the estimates from the unit-level model are substantially more accurate and efficient than the Fay-Herriot estimates.
These results are robust to a variety of robustness checks that explore alternative implementation options, including the absence of benchmarking and the use of a noisier welfare measure. When using the noisier welfare measure, the gain in efficiency is not as large in Sri Lanka, on the order of doubling the size of the sample, but the predictions remain accurate and coverage rates remain high. Finally, to further probe the robustness of the results, we implement both a model-based simulation and a designbased simulation of 250 randomly selected samples covering three regions in Northeast Tanzania. The small area estimates perform exceedingly well, while the design-based simulation also shows that the small area estimates substantially improve on the direct estimates in terms of efficiency and accuracy while maintaining high coverage rates.

5
The reminder of the paper is organized as follows. Section II describes the data. Section III describes the methodology and estimators that are evaluated. Section IV assesses results in terms of efficiency, accuracy, and coverage. Section V considers a variety of robustness checks of the main method. Section VI considers the results of model-based and design-based simulations, and section VII concludes.

II. Description of the data
A. Constructing measures of non-monetary welfare and poverty in the census The analysis is conducted using measures of non-monetary poverty constructed from the Sri Lankan and Tanzanian census, which were each conducted in 2012. Non-monetary poverty, unlike monetary poverty, is directly observed in the census population and can therefore provide a benchmark for evaluation. For each country, we identified a set of household asset and demographic proxies for welfare. Principal component analysis, weighted according to household size, was used to derive a score for each proxy welfare indicator. The welfare proxies used in the index and their estimated scores, or loading factors, are listed in Table 1. Each household's non-monetary welfare measure is obtained by summing the product of each loading factor and the household's value of the associated variable. No subsequent adjustment for household size was made, reflecting the non-rival nature of the index variables within the household, although household size is included as a welfare proxy in the index for both countries. The results are broadly reasonable, as higher levels of education for the household head are associated with higher non-monetary welfare in both counties, assets and dwelling conditions are positively associated with welfare in Sri Lanka, and non-agricultural work is strongly associated with nonmonetary welfare in Tanzania.
This measure of non-monetary welfare can be compared with a poverty line threshold to classify each household as poor or non-poor. We set the threshold at the 4 th percentile of non-monetary welfare in Sri Lanka and the 20 th percentile in Tanzania, to reflect prevailing national monetary poverty rates. In other words, households were classified as non-monetarily poor if their non-monetary welfare, defined as the score of the first principal component, fell in the bottom four percentiles in Sri Lanka and the bottom quintile in Tanzania. Below in the robustness checks section, we also consider results when the poverty rates of the two countries are swapped.

B. Constructing synthetic household surveys
With a measure of non-monetary welfare and poverty for each census household in hand, we turn to drawing a synthetic household survey from each country's census. The synthetic survey, along with the auxiliary geospatial data, are key inputs into the small area estimation procedures. To draw the synthetic survey, we utilize the actual two-stage sample conducted by the National Statistics Offices for two household budget surveys: The 2018 Tanzania Household Budget Survey, and the 2016 Sri Lanka Household income and Expenditure Survey. These surveys were merged with the census at the subarea level, which is the GN Division in Sri Lanka and the village in Tanzania. After retaining the GN Divisions and EAs present in the budget survey, we randomly select census households in each matching EA to match the number of households in each EA for each survey. Finally, we merged the sample weights from the household budget surveys for each subarea. Essentially, this procedure draws a survey that mimics as much as possible the sample drawn by the NSO for the budget surveys.

C. Remote sensing data
The auxiliary data for the small area estimation exercise are drawn from a large candidate pool of satellite-based information, most of which is derived from publicly available layers and imagery. These include night-time lights from the Visible Infrared Imaging Remote Sensor (VIIRS), at a spatial resolution of 15 arc-seconds, precipitation data from the Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS), elevation and slope taken from the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) satellite, global forest cover change from Hansen (2013) and estimates of built-up area from the Global Human Settlement Layer (GHSL). From this last layer, we compute the percentage of total built-up area observed in 2014 that was constructed prior to 1975 or during 1975-1990, 1975-1990, and 2000-2014. The Sri Lanka indicators were also supplemented by a variety of spatial "texture" features derived from a cloud-free mosaic of 2017-2018 Sentinel-2 imagery, which is collected every 5 days by Sentinel 2 sensors on board two satellites, Sentinel 2A and 2B. This imagery is made publicly available by the European Space Agency and is the highest spatial resolution (10m per pixel) optical data that are publicly available. The "texture" features were calculated using Jordan Graesser's Sp.Feas package, an open-source Python library for processing contextual image features from satellite imagery. These contextual features have been shown to be strongly correlated with poverty and population density (Engstrom et al, 2017 and2019a). The following contextual features were considered as candidate predictors: o Fourier Transform (FT) which is used to detect high or low frequency of lines. o Gabor Filter, a linear Gaussian filter used for edge detection (Mehrota et al, 1992) o Histogram of Oriented Gradients (HOG), which captures the orientation and magnitude of the shades of the image. (Dalal and Triggs, 2005) o Lacunarity (LAC), which describes the extent of gaps and holes in a texture. Low-lacunarity geometric objects are homogeneous because all gap sizes are the same, whereas highlacunarity objects are heterogeneous. (Myint et al, 2006) o Line Support Regions (LSR), which characterize line attributes (Ünsalan and Boyer, 2004) o Normalized Difference Vegetation Index (NDVI), the most widely used vegetation index, which provides information about the health and amount of vegetation o PanTex, which is a built-up presence index derived from the grey-level co-occurrence matrix (Pesaresi et al, 2008) o Structural Feature Sets (SFS), which are statistical measures to extract the structural features of direction lines (Huang et al, 2007) These spatial features are calculated by comparing pixels with their neighbors and then reporting this value back to the individual pixel (in this case 10m). The number of neighboring pixels considered in the comparison is the scale, which varies by the feature being calculated. For most features, we use scales of 3, 5, 7, which correspond to squares of 3 pixels by 3 pixels, 5 pixels by 5 pixels, and 7 pixels by 7 pixels.
The remote sensing data used for Tanzania, while largely derived from publicly available imagery, was supplemented by proprietary data on building footprints produced by Ecopia and Maxar. 10 The publicly available layers considered for Tanzania included: o Night-time lights, taken from the same VIIRS imagery used for Sri Lanka o Population estimates taken from WorldPop, which is in turn derived from the 2011 census and distributed using measures of built-up area. o Precipitation estimates taken from Weilcott and Matsuura (2018) o Elevation and Slope taken from Jarvis (2008) and downloaded from AidData's geoquery database o Built-up area from the same GHSL data used for Sri Lanka, for the same periods. o Climactic region, obtained from Kottek et al (2006). This indicator divides land into five main climate groups (tropical, dry, temperate, continental, and polar), with each group being divided based on seasonal precipitation and temperature patterns, based on a raster version of the map available at 0.5° resolution. The full set of satellite indicators used for each country are listed below in Table 2.

D. Geographic structure of Sri Lanka and Tanzania
Before describing the methodology used in the study, we briefly review the geographic structure of the two countries. Table 3 shows the subareas, areas, and super-areas for each country. Subareas are the lowest level for which shapefiles were obtained, and are therefore the most disaggregated level for which geospatial data could be linked with the survey. Areas are the target domains for the small area estimates, which are at a higher level than the subareas, but below the level at which the survey is considered to be representative. Finally, super-areas are the lowest level for which the household budget survey is considered to be representative.

III. Methods
This section describes the three methods that are evaluated: Direct survey estimates, the Fay-Herriot area-level model, and the household-level model. It also discusses important methodological considerations involved when estimating the household-level model, such as differences between software packages, transforming the dependent variable, the model selection procedure, and benchmarking. The final subsection considers the criteria used to evaluate the performance of different estimators.

A. Direct survey estimates
Direct estimates obtained from the survey serve both as a benchmark against which to assess the increase in precision from incorporating satellite data, and as an input into the Fay-Herriot area-level models. We use two methods to obtain direct survey-based estimates. For both methods, the mean poverty estimate for each area is the weighted average, across households, of the dummy variable indicating whether each household is poor. Each household is weighted according to corresponding population weight in the survey. The two methods used to obtain direct estimates differ in how they estimate the variance of the means. The typical method of obtaining variance estimates clusters the residuals by enumeration area, which accounts for positive correlation within enumeration areas but assumes that disturbances across enumeration areas are independent. This assumption underestimates the variance of area-level poverty estimates, particularly if welfare is highly correlated across enumeration areas within target areas.
The Horvitz-Thompson (H-T) approximation offers an alternative, more conservative, approach to estimate the variance of the area poverty rates (Horvitz and Thompson, 1952). This approach is more conservative because it relaxes the standard assumption that households' poverty statuses are independent if they live in different enumeration areas within the same target area. This comes, however, at the cost of imposing an alternative assumption, which is that the probability of one household appearing in the sample is independent of the probability of any other household appearing in the sample. This assumption is clearly violated in standard two-stage sample designs, as the probability of a household appearing in the sample is greater if a household in its same enumeration area also appears in the sample. Nonetheless, the violation of this second assumption underestimates variance by less than the standard assumption of clustered standard errors, and therefore yields less biased variance estimates than the standard approach. We include results for direct estimates based on both the Horvitz-Thompson approach and the standard approach and in results presented below in Table 9, find substantially higher coverage rates for the former.

B. Fay-Herriot area-level model
The Fay-Herriot model was first introduced by Fay and Herriot (1979) to model incomes for small areas of fewer than 1,000 persons in the United States, and is perhaps the best-known and widely used small area estimator. It was derived as an application of the James-Stein (1961) estimator using data aggregated to the target area level, and is also the Empirical Best Linear Unbiased Predictor (EBLUP) of the area level regression. The small area estimates are a weighted average of the synthetic regression prediction and the direct survey estimate, with the weights inversely related to the estimated variance of the direct and synthetic estimates. Many variants of the Fay-Herriot estimator have been developed that employ transformations and account for spatial and intertemporal correlation, among other refinements.
We estimate the following basic area-level Fay-Herriot model: (1) � = + + Where � is the direct estimate of headcount poverty in area a obtained from the survey.
Xa is a vector of area-level aggregate remote sensing variables, created by taking population-weighted average of subarea level aggregates.
is a random effect, assumed to be distributed normally with mean 0 and variance σ 2 ua ea is the sampling error, which is also assumed to be distributed normal with mean 0 and variance σ 2 ea . The variance σ 2 ea is estimated using the estimates of the variance obtained from the sample as described above.
The variance of the direct survey estimate is estimated as the variance of the direct estimate, calculated using the Horvitz-Thompson approximation, as recommended by Halbmeier et al (2019).
As noted above, the predictions resulting from the Fay-Herriot model can be expressed as a weighted average of the direct estimates and the model predictions. ( ⁄ , or the shrinkage factor. The shrinkage factor is the weight given in each area to the direct estimator relative to the model prediction. This declines as the variance of the direct sample estimate � 2 increases, because the direct sample estimates is less precise. Conversely, the shrinkage factor increases as the estimated random effect becomes less precise and � 2 increases, giving greater weight to the direct sample estimate. The Fay-Herriot estimator, unfortunately, has several weaknesses for the purposes of generating small area estimates of poverty using spatial auxiliary data. The main shortcoming is that the Fay-Herriot model requires aggregating the auxiliary data to the area level, which discards variation in the auxiliary data at the sub-area level that would increase the precision of the estimates. Implementing a sub-area model that appropriately adapts the Fay-Herriot model to address this issue, such as the one proposed by Torabi and Rao (2014), can incorporate auxiliary data at the sub-area level but is currently constrained by the lack of available software. In addition, the Fay-Herriot model is based on direct estimate of poverty rates, and therefore discards information on the variation in welfare within the poor and non-poor portions of the welfare distribution. Another issue with the Fay-Herriot model is that, when no transformation is applied, the poverty rate is assumed to be a linear function of the predictors. The Fay-Herriot model also faces challenges when no households are poor in a significant number of areas. For these areas, either the prediction from the auxiliary data must be ignored, the sample must be ignored, or the area must be dropped from the estimation. Finally, the Fay-Herriot model does not account for uncertainty in the survey-based estimates of variance, which tends to underestimate the variance of the prediction.
Despite these limitations, the Fay-Herriot model is much simpler to apply and explain than a unit level model. It generally provides credible estimates and is a standard workhorse of small area estimation. We estimate a Fay-Herriot model by aggregating all indicators to the area level, weighting by population. The Fay-Herriot models are estimated using the Stata FHSAE command (Corral, et al, 2018). We utilize the "Chandra method", which estimates the prediction model using GLS (Chandra et al, 2013).

C. Household-level model
In contrast to area level models, unit level models are specified at the level of the individual unit, which in this case is the household. 11 Although the auxiliary data only varies at the sub-area level, predicting welfare at the household level fully utilizes the information in the sample survey on the level and variability of household welfare. In addition, unlike sub-area models, well-established software packages are available that utilize empirical best methods to estimate unit-level models with area-level mixed effects.
The EBP model can be written as follows: 12 (3) G(Y ) = 1 sa + 2 a + 3 Xr+ a + Where: Yi is the non-monetary welfare for household i, which in each country is the principal component score constructed using the loading factors in Table 1. G(Yi) is a transformed version of welfare. Traditionally, ELL uses the log function to transform welfare, but any monotonic transformation can be used. As described in greater detail below, we use an ordered quantile normal transformation for the primary set of results.
Xsa is a set of satellite-derived indicators that vary at the sub-area (village or GN Division) level.
Xa is a set of satellite-derived indicators that vary at the area level. The areas sub-districts in Sri Lanka and districts in Tanzania. These predictors are important to include to reduce the variance of the area random effect and thereby mitigate bias in the estimated mean squared error (Elbers, Lanjouw, and Leite, 2008).
Xr is a set of regional dummy variables indicating super-areas for which the survey is considered to be representative. The Sri Lanka specification also includes dummies for the rural and estate sectors, while the Tanzania specification includes a rural dummy. 13 a is a mixed effect specified at the area level, assumed to be distributed normally. Although traditionally ELL has specified the area effect at the cluster or enumeration area level, doing so underestimates the mean squared error by failing to account for correlation in model error across clusters within areas (Marhuendra et al, 2018, Das andChambers 2018). 14 a is conditioned on the sample values of observed welfare, denoted ys. is a household level idiosyncratic error term, also assumed to be distributed normally.
A key feature of the EBP framework is that the random area effect a is conditioned on the sample values of observed welfare ys. This, along with the assumption that σ 2 is normal, allows for the use of the following model for generating simulated welfare: γa is the shrinkage factor due to conditioning the random effect on the sample mean for area a. Na is the number of sample observations for area a. The shrinkage in variance due to conditioning the random effect on the sample will be higher when Na is large or when the variance of the residual ε is low, because in these cases the sample contributes more information about the unexplained component of welfare. Shrinkage falls as the estimated within-area correlation in welfare σ 2 , declines, because in those cases a larger share of unobserved welfare is idiosyncratic to households rather than common to areas, so the sample residuals convey less information about the true area mean. Similarly, the shrinkage factor γ declines as the area sample size Na declines, because the sample contributes less information about the welfare in the area.
This basic structure is appealing in part because it is similar to the Elbers, Lanjouw, and Lanjouw (2003) model that has been used extensively for survey-census poverty mapping for decades, with the main difference being the conditioning of the random effect on the mean sample residuals. The concept of shrinkage in the nested error EBP framework is well-established in the small area estimation context (Battese, Harter, and Fuller, 1988, Jiang and Lahiri, 2006, Molina and Rao, 2010, Van der Weide, 2014). In addition, unlike Bayesian spatial autocorrelation models, nested error random effect models do not impose additional smoothness in model errors across neighboring areas. Spatial correlation is instead modeled entirely through the area random effects, which are conditioned on the sample residuals for that area.
As noted above, it is not clear that Empirical Best methods are preferred in the traditional poverty mapping setting where a household census serves as the auxiliary data. While EB methods benefit from incorporating information from the sample, the sample is typically a small share of the census population, leading to limited benefits from using EB. Meanwhile, an important drawback of EB is the required assumptions that the components of the error term are distributed normally, which can reduce the accuracy of the estimates if the underlying distribution of the residuals is not normal. The choice of whether to use EB in a typical case, when the auxiliary data are a household census, must weigh the cost of assuming normality against the benefit of obtaining additional information from a relatively small sample.
The calculus changes, however, when the auxiliary data are specified at an aggregate geographic level, such as a subarea or area. In this case, EB estimation is essential to effectively combine the survey data with census-based predictions. Intuitively, the auxiliary data contribute less variation when they only vary across sub-areas, compared to the sample, because the sample, unlike the auxiliary data, contains household level information. In the EBP framework described above, the use of predictors aggregated to the sub-area level increases the variance of the area effect σ 2 , which also represents the covariance of unexplained welfare across households within the same area. This covariance rises when using sub-area predictors because all households within each sub-area share the same set of characteristics used to predict welfare in the model. The increase in σ 2 due to the use of regionally aggregated predictors in turn increases the shrinkage factor γ, leading to a greater increase in precision when using an EBP model versus the traditional ELL estimator.
The assumption of normality is critical in the EBP framework, as a distributional assumption is required to condition the area level random effect on the average residual of the regression in that area. In the EBP model, this conditioning is equivalent in a Bayesian framework to specifying the average of the residuals as a prior distribution for the random effect under the assumption of normality. The normality assumption is also necessary for the Monte-Carlo simulations of the parameters, which are required because headcount poverty is a non-linear function of welfare. In general, software packages generate small area poverty estimates by repeatedly drawing area effects and the household idiosyncratic error from a normal distribution and aggregating the simulation results. This results in biased estimates of poverty rates if the true error terms implied by the model are not in fact normal. Below, we discuss a monotonic transformation to the welfare measure that makes the assumptions that the area effect and the household error term are distributed normally more palatable. Table 4 summarizes the three software packages that we utilize to estimate household level models. The primary set of results reported below are estimated using a modified version of the R EMDI package (Kreutzmann et al, 2018). 15 We use this package for the main results because it is well-documented, flexible, and fast when running on powerful computers that can exploit parallel processing. In addition, it appends the sample to the census in order to generate slightly more accurate estimates. This package, like others designed for poverty mapping, estimates the mixed model (3) and then repeatedly draws both area effects and household error terms to simulate welfare. Simulated welfare is compared to the poverty line to determine the poverty status of each household in each simulation, which is aggregated across simulations to estimate poverty in each area. Estimates of mean squared error are obtained through a parametric bootstrapping approach, which is described in detail in Gonzalez-Manteiga, et al (2008) and Molina and Marhuenda (2015). We leverage the open nature of the code to make one modification to the EMDI package related to the use of weights, which allows for household size to be used as a weight when aggregating the simulated household welfare outcomes into area-level poverty estimates. 16 As a robustness check, we also utilized the second and third versions of an alternative software package, the Stata SAE package (Nguyen et al, 2018). 17 The main differences between the packages are summarized in Table 4, and discussed in Annex A. These differences in design and methodology across different packages, particularly the choice of a non-parametric versus parametric bootstrap method, affects the resulting estimates of poverty rates and their mean squared errors. Comparing simulated small area estimates against actual census data provide evidence on how these differences affect estimated poverty rates and mean squared error in two use-case scenarios.

E. Normalizing transformation
Welfare typically follows a right-skewed distribution, and it is therefore standard practice to transform the welfare indicator before implementing small area estimation models based on simulated welfare. The most common transformation is to take the logarithm of welfare, following Elbers, Lanjouw, and Lanjouw (2003). A recent review article considers the issue of transformations for small area estimation in detail (Tzavidis et al, 2018). It reports design-based simulation results using Mexican data that considered alternative transformations, including log-shift and Box-Cox transformations (Box and Cox, 1964). A log-shift transformation involves adding a constant to welfare prior to taking the log, while the Box-cox transformation can be written as: −1 for ≠ 0, and log y for λ=0. Tzavidis et al (2018) conclude on the basis of the simulation results that log-shift transformations tend to perform well. Stata has implemented two commands that can quickly estimate the parameters for the log shift and Box-Cox transformations that minimize the skewness of the welfare distribution. These commands, however, only go part of the way towards normalization, because they cannot guarantee that the resulting distribution will have a kurtosis of three, which is characteristic of a normal distribution. Furthermore, the log shift transformations is not monotonic when the underlying distribution is left-skewed. 18 Finding an appropriate monotonic transformation for the welfare variable is important, as failure to satisfy the normality assumption can lead to systematic bias in EBP models. While benchmarking can be used to at least partially correct this bias, relying on benchmarking becomes more difficult to justify as the extent of the bias increases.
For these reasons, our preferred estimates utilize a monotonic transformation called the "ordered quantile normalization" to transform welfare. This method transforms the scaled rank of the welfare variable to conform with a normal distribution. Variants of this method date back more than 70 years (Bartlett, 1947, Van der Waerden, 1952. Among several methods proposed in the literature, the ordered quantile normalization most consistently transforms an underlying variable to follow a normal distribution (Peterson and Cavanough, 2019). We utilize the Ordernorm function in the Bestnormalize R package to implement this transformation. The method produces an exactly normal distribution of the transformed variable when the original welfare measure contains no ties, irrespective of its underlying distribution. While a normally distributed outcome variable does not guarantee normality of the residuals, it helps make the distribution of residuals closer to normal, and the ordered quantile normalization leads to smaller discrepancies between the official national poverty rate estimated from the survey and the weighted mean of the small area estimates.
To calculate small area estimates of the poverty gap, poverty severity, or inequality, it is critical to retransform the estimates back to the original welfare metric, as is done in the ELL method by exponentiating log welfare in each simulation. However, this retransformation is not necessary when estimating headcount poverty rates. This is because households' poverty status is preserved under any monotonic transformation applied to both the welfare measure and the poverty line threshold. We therefore set the poverty line to the percentile of the transformed welfare measure that corresponds to the assumed national poverty rate, which is 4 percent in Sri Lanka and 20 percent in Tanzania. Because the ordered quantile normalization transformation is based on household ranks within the distribution, applying it entails information loss regarding the shape of the distribution between household ranks, and alters the estimated parameters of the model. To verify that this loss of information is minor compared with the benefits of normalizing the welfare measure, we also consider in the robustness section below a more traditional Box-Cox normalization procedure designed to minimize skewness.

F. Model selection using the Lasso
Model selection in small area estimation remains an unsettled issue and has traditionally been treated as both art and science. Practice can vary widely. Tzavidis et al (2018), for example, select only 6 covariates present in Mexican census and survey data, based in part on the Akaike Information Criterion (AIC) from a standard OLS model. Most small area applications using ELL, however, use a considerably larger set of variables. Zhao and Lanjouw (2008), for example, note that many successful applications include less than 20 household level variables. However, they also advocate including cluster-level means, which can boost the number of variables significantly past 20 or 30.
In recent years, the least absolute shrinkage and selection operator (LASSO) has become an increasingly popular tool for selecting prediction models. Small area estimation offers a natural application of the "post-lasso" procedure, in which the variables selected via the lasso are used as predictors into the mixed model used to estimate the parameters of the small area estimation model. 19 The post-lasso procedure offers the benefit of a convenient and data-driven approach to model selection that, in its most popular variant, maximizes out-of-sample predictive accuracy, while roughly equalizing in and out of sample R 2 to prevent overfitting the model to the sample. The LASSO framework formalizes and extends the advice in Zhao and Lanjouw (2008) to use out-of-sample cross-validation to ensure that the model is stable and predicts well out of sample. Several varieties of LASSO have been implemented in popular statistical software. We use the plugin lasso estimator, also known as the rigorous lasso, implemented in Stata version 16 (StataCorp, 2019). This variant of lasso is computationally faster and more parsimonious than others that use cross-validation to minimize mean out-of-sample prediction error, and the resulting models maintain strong predictive power. 20

G. Benchmarking
Benchmarking refers to the procedure of forcing model-dependent estimators to agree with the direct sample estimates for super-areas, defined as the lowest geographic level for which the sample is considered representative. Benchmarking can be desirable to ensure that the population-weighted average of small area poverty estimates, when aggregated to super-areas, match official published statistics derived from the survey. Simulations indicate that in the cross-sectional context, simple "ratio" benchmarking performs equally well as other methods and there is no significant loss of efficiency from benchmarking in general (Pfefferman et al, 2014, Wang et al, 2008. We therefore apply a simple ratio benchmarking procedure by multiplying the estimated headcount rate in each area by a scaling factor unique to each super-area, for both the Fay-Herriot and household level model estimates. The scaling factor is defined as the ratio of the estimated poverty rate obtained from the survey to the populationweighted mean poverty rate of the small area estimates, for each super-area.
This leaves open the question of how to adjust the mean squared error estimates while benchmarking. Ideally, to obtain accurate estimates of the mean squared error, benchmarking would be performed by the small area estimation package within each bootstrap replication. Unfortunately, this option is not currently implemented in available software packages. As a second-best solution, we scale the mean squared error by the square of the same scaling factor that is applied to the point estimates, which leaves the coefficient of variation unchanged by the benchmarking process. Because the point estimates are slightly underestimated in Tanzania and Sri Lanka, this procedure increases the mean squared error and improves coverage rates. However, as a robustness check, we also report results for estimates in which only the point estimates are benchmarked, and in which neither the point estimates nor the mean squared errors were benchmarked.

H. Criteria for evaluating estimates
The synthetic sample, auxiliary data, and methodology can be used to generate small area estimates of headcount poverty and their mean squared error. But how exactly does one evaluate the estimates produced by different small area estimation methods? We examine seven summary statistics, averaged across areas. We give each area equal weight in the evaluation to better reflect the goal of obtaining accurate estimates for each target area rather than estimates that can be aggregated to higher levels.
The six summary statistics are divided into four buckets as follows: • Mean Poverty Rate and Average Relative Bias. The former represents the simple average of the predicted area headcount rates across areas, prior to benchmarking. For the direct survey estimates, the mean poverty rate is the unweighted average poverty rate across municipalities. The mean of the small area estimates can differ from this unweighted average, due to bias caused by either violations of the normality assumptions or a non-representative sample. Substantial bias would necessitate a greater reliance on benchmarking, which would in turn make the imperfect adjustment for mean squared errors more problematic. Therefore, estimators with mean poverty rates closer to those of the direct estimates are, ceteris paribus, preferred. Meanwhile, the average relative bias equals the average, across areas of the ratio of the difference between the true and estimated poverty rates to the true poverty rate.
Where Na is the number of areas, Θ � is the estimated poverty rate for area a, and Θ * is the true poverty rate calculated from the census. Like the average poverty rate, it is a measure of the systematic bias in the estimates rather than accuracy. It differs from the average poverty rate, however, by reflecting any systematic bias at the area level after the benchmarking procedure.
• Average Mean Squared Error (MSE) and Average Coefficient of Variation (CV). These measure the degree of uncertainty associated with the point estimates. Estimation of the MSE for the model-based estimates are generated using a parametric bootstrap under the assumption that the model holds, and much remains to be learned about how model misspecification affects MSE estimates. The CV for each area is defined as the ratio of the square root of the estimated mean squared error to the mean estimated poverty rate, a definition sometimes referred to as CV(RMSE). We report the average CV across areas. CVs are an important indicator of uncertainty because in many cases national statistics offices will not publish statistics when CVs exceed a maximum threshold. 21 Comparisons of the average CV between the model-based and direct estimates should be interpreted with caution, because CVs of the direct estimates are not defined for areas with no poor households in the survey, leading the set of areas included in the average to differ for direct estimates.
• Correlation and Root Mean Squared Error. These measure the accuracy of the prediction. The correlation represents the simple correlation between the small area estimates and the actual headcount poverty rates calculated from the full census. The Root Mean Square Error (RMSE) is equal to the square root for the average squared difference between the estimates and the actual census poverty rates The root mean squared error in this context is a function of the predicted poverty rates and is therefore a measure of accuracy. In contrast, the average mean squared error described above is a measure of precision or uncertainty.
• Average Coverage rate. The average coverage rate represents the share of areas for which the estimated 95 percent confidence interval contains the actual census poverty rate, and are a standard metric used to assess the performance of estimators in the literature. 22 The upper and lower bounds of the confidence interval are determined by multiplying the square root of the estimated mean squared error by 1.96 and adding and subtracting it from the point estimate.

IV. Main Results
A. Fay-Herriot area model Table 5 presents the coefficient estimates from the Fay-Herriot area level model, which is used as a reference point for comparison with the direct estimates and the unit level model. 23 The area-level predictor variables were selected using the lasso plugin method, to prevent overfitting the model to the 20 sample. The set of candidate variables included all area-level remote sensing indicators as well as superarea dummies, although no super-area dummies were selected in Tanzania. In each country, the lasso selected a small number of predictor variables, only 3 in Tanzania and 8 in Sri Lanka. The dependent variable is headcount poverty, expressed as a percentage from zero to one hundred. In Sri Lanka, less annual variability in rainfall as well as a positive rainfall shock in the first quarter are associated with lower poverty rates. In Tanzania, districts with larger buildings are much less poor, as are areas with a larger share of built-up area, while more rural districts tend to be poorer. Despite containing only three variables, the Tanzania model explain half of the variation in non-monetary poverty rates. Meanwhile, the eight Sri Lanka variables explain only 26 percent of the variation in headcount poverty across subdistricts. Overall, the results suggest that the Fay-Herriot estimates, despite using predictors that only vary across area, can significantly improve on the efficiency of the direct estimates, particularly in Tanzania where the model fit is especially good. Notes: Coefficients based on Fay-Herriot estimation of poverty rates using the Chandra method. Predictor variables selected using the lasso plug-in method.

B. Household-level model
Next, we turn to the unit level model, and examine the results of the post-lasso regression used to estimate the model parameters in Table 6. The first striking result is the overall predictive power of the models. The R 2 from the model is 0.32 in Tanzania and 0.25 in Sri Lana, which is notable given that the explanatory variable, non-monetary welfare, is measured at the household level while all the independent variables vary only at the subarea level. A Shapley decomposition indicates that about half of the model's explanatory power in Sri Lanka and about a third in Tanzania are derived from super-area and sector dummies, which by construction cannot explain any variation in non-monetary areas across areas within super-areas. This illustrates one of the limitations of relying solely on R 2 as a measure of the ability of the model to discern poor areas within the super-areas for which credible survey estimates are already available. Below, we consider an alternative, more informative metric, which is the extent to which the small area estimation procedure increases the precision of area estimates relative to direct survey-based estimates.
At the subarea level, the satellite indicators explain an impressive amount of the variation in average non-monetary welfare. In sub-area level regressions, predictor variables explain 73 percent of the variation in average non-monetary welfare in Tanzania and 59 percent in Sri Lanka. The subarea model R 2 of 0.73 percent for Tanzania is substantially higher than the estimates presented in Jean et al (2016), which find that features derived from satellite data explain 58 percent of the DHS asset index in Tanzania. The stronger predictive power of the model for Tanzania likely results from the use of a richer welfare index as the dependent variable, as well as the use of interpretable features derived from imagery such as building footprints, rather than features optimized to predict night-time lights. 24 Meanwhile, the R 2 of 59 percent for Sri Lanka is remarkably similar to estimates of the explanatory power of monetary poverty reported in Engstrom et al (2017). 25   A Shapley decomposition indicates that the higher R 2 in Tanzania, compared with Sri Lanka, can be attributed to more predictive sub-area variables, which predict 13 percent of the variation in Tanzania as opposed to 6 percent in Sri Lanka. This is likely due to two main factors. The first is that the Tanzanian indicators include measures of building footprints at the subarea level. The second is that Tanzania is a less developed setting in which building density and climactic variables, which most of the satellite variables are capturing, are more strongly correlated with non-monetary welfare. 27

C. Evaluation Results
The previous section demonstrated that remote sensing indicators are predictive of both area-level poverty and household non-monetary welfare. This section turns to evaluating the small area estimates generated by different methods. Before considering the evaluation results, Table 7 reports model diagnostics for the unit-level model. The geospatial indicators in the model predict about 37 and 30 percent of the variation in household non-monetary welfare in Sri Lanka and Tanzania respectively. 28 The conditional R 2 , which reflects the gain in model fit from conditioning the mixed effect on the sample, is about 3 percentage points higher than the marginal R 2 in both countries. The distribution of both the area effect and the household error term is roughly normal, reflecting the success of the normalizing transformation. The Wilks-Shapiro test for normality of the area effect fails to reject the null hypothesis of a normally distributed area random effect in both countries. Figure 1, which plot the quantiles of both the area effects and the household residuals against the quantiles of the normal distribution for each country, confirms that the distribution of both components of the error term is approximately normal, although there are negative outliers in the left tail of the area random effect in Tanzania.

25
The results for the mean poverty predictions in Table 8 show that the non-benchmarked Fay-Herriot estimates underestimate poverty, by a slight amount in Sri Lanka but by a more significant amount in Tanzania. The difference in precision is more striking. The Fay-Herriot estimators are significantly more precise than the Horvitz-Thompson direct estimates in both countries. Relative to these more conservative direct estimates, the Fay-Herriot procedure shrinks the mean square error by nearly half in both countries. However, the household level model, compared with the Horvitz-Thompson estimates, shrinks the mean squared error substantially more, by 70 percent in Sri Lanka and 85 percent in Tanzania.  Table 9 shows the evaluation results against the area poverty rates derived from the census. Average relative bias is highest for the Fay-Herriot estimator in both countries, and lowest for the direct estimates. Of greater interest for evaluating accuracy, however, are the results for correlation and root mean squared error. In both countries, the small area estimates are far more accurate than the direct estimates, but the comparisons between the household level model and Fay-Herriot model is mixed. In Sri Lanka, the household model estimates are more highly correlated with the census than the Fay-Herriot estimates by a moderate amount (0.88 vs 0.84), but slightly less strongly correlated in Tanzania (0.876 vs. 0.883). The Fay-Herriot predictions have a lower root mean squared error in Sri Lanka, reflecting fewer large outliers, but the root mean squared error is virtually equal in Tanzania. Overall, with regards to accuracy, both the household level model and the Fay-Herriot model perform well.
The household model estimates are substantially more precise than the Fay-Herriot model and the direct estimates, but the tighter confidence intervals around the household model estimates may lower coverage rates. The last column of Table 8 shows the coverage rate, which for the household model in Sri Lanka is 84.3%. This is slightly higher than the Horvitz-Thompson estimates (82.2%), moderately higher than the clustered direct estimates (76.1%) but substantially lower than the Fay-Herriot estimates (91.8%). For Tanzania, the coverage rate for the household level model is 74.8%. This is the lowest of the four estimators but only modestly lower than that of the standard enumeration-area clustered estimates (76.1%). The Horvitz-Thompson and Fay-Herriot estimators achieve much higher coverage rates of 85.5% and 91.8%, respectively.  Figures 2 and 3 give a detailed look at the point estimates and mean squared errors for the area poverty estimates for each method. Sri Lanka is shown on the left panel and Tanzania on the right, and in each country the areas are ordered according to their non-monetary poverty rate in the census. The results clearly show that both Fay-Herriot and Household-level models greatly improve on the accuracy of the direct estimates, especially in areas with higher poverty rates. In Sri Lanka, many of the direct estimates are zero, even in high-poverty areas. The comparison between household models and Fay-Herriot models is less clear, but the Fay-Herriot estimates appear to be more prone to overestimate poverty for low-poverty areas and underestimate poverty for high-poverty areas. 29 Figure 3 presents the mean squared errors produced by the different methods. It clearly demonstrates that the household model estimates are substantially more precise than both the Fay-Herriot estimates and the direct estimates in both countries. The significantly lower mean squared errors generated by the household level model is associated with a moderate reduction in coverage. This begs the question of whether the household level model remains more precise than others after equalizing coverage rates. To shed light on this, we perform a subsequent adjustment. This adjustment artificially inflates the estimates MSEs from each method by multiplying them by a constant scale factor greater than one, with the scale factor manually selected to achieve a 95 percent coverage rates. This occurs after the benchmarking and is therefore an additional adjustment to the mean squared error.  Finally, table 11 compares the precision of the unit level model with the direct survey estimates for super-areas, which are districts in Sri Lanka and regions in Tanzania. The super-areas are levels for which the household survey is considered representative and survey-based monetary poverty rates are routinely published. In Sri Lanka, where there are 331 target areas, the mean squared errors for the subdistrict estimates are about sixty percent larger than the direct survey estimates for districts. In Tanzania, meanwhile, there are only 159 areas and the same number as super-areas as Sri Lanka. In this case, the average mean squared error of the small area district estimates is slightly lower than the direct estimates for regions, while the mean CV is modestly higher. This type of comparisons is useful to inform decisions by national statistics offices of whether small area estimates are sufficiently precise to publish.

V. Robustness checks for the household-level model
The previous section demonstrated that the household-level model produced more efficient estimates than the area level model and the direct estimates, while both the household level model and Fay-Herriot model both predicted poverty rates much more accurately than the direct estimates. The household-level model examined above was derived using a particular variant of the empirical best model, however, and it is informative to examine how the performance of the estimates varies based on different methodological choices. The fourth and fifth robustness checks experiment with disabling the heteroscedasticity correction in version 3 of the Stata package, and dropping the use of weights in the R EMDI package in the simulation stage. These are useful to check because, as noted above, the R EMDI package does not implement a heteroscedasticity correction, and the publicly available version does not allow for weights in the simulation stage. The sixth robustness check considers using a Box-Cox transformation rather than an Ordered Quantile Normalization to transform the dependent variable in the model. This gives an indication of how much the use of the Ordered Quantile Normalization improves the results and ensures that the estimates remain reasonable when using a more traditional normalization method. The seventh robustness check considers the predictions when using a stepwise procedure, set with a probability threshold of 0.01, is used to select the model instead of the plugin lasso method. The final robustness check is a simple test of the extent to which the results are robust to the prevailing poverty rate. This check swaps the country poverty rates, setting the Tanzanian poverty line at the 4 th percentile and the Sri Lankan poverty line at the 20 th percentile. These are then compared against the corresponding poverty rates calculated in the census using these new, swapped poverty rates.  Tanzania   Table 13 displays the results of these eight robustness checks. For the purposes of brevity, we report the mean poverty, MSE, correlation, and coverage rate. 31 We begin by comparing the results produced by the baseline EMDI estimates with version 2 of the Stata SAE package, which implements the traditional bootstrap method, with and without the Empirical Best option. For all three estimates, the unweighted average of the predictions prior to benchmarking are close to the actual poverty rate in the census. The Stata package with the traditional bootstrap, however, gives significantly less precise estimates in both countries, as seen by the higher mean squared errors and CV. Purely synthetic predictions obtained without the empirical best option are substantially less precise than the empirical best predictions using the traditional clustered bootstrap in Sri Lanka, and moderately less precise in Tanzania. In both countries, however, the synthetic predictions are far less precise than estimates obtained using a parametric bootstrap approach, demonstrating the high efficiency cost of failing to incorporate sample 31 information when the auxiliary data are aggregated to the subarea level. 32 The estimates produced by the Stata SAE with the traditional bootstrap are less accurate than the baseline EMDI estimates in Sri Lanka, but more accurate in Tanzania. The coverage rate of the traditional bootstrap estimates is 93 percent, which is very close to the target 95 percent rate. This high coverage rate results from two countervailing biases in the variance estimate; the traditional bootstrap overstates the variance by failing to hold the composition of the sample constant across replications, which counteracts the downward bias due to the failure to account for the positive correlation among the sample residuals. Overall, the traditional bootstrap performs quite well in these two settings, in terms of accuracy and coverage, but poorly in terms of efficiency.
The estimates produced by the latest version of Stata SAE with the parametric bootstrap are slightly less accurate, in terms of its correlation with the full census, than those produced by the EMDI package. 33 The Stata SAE estimates are also less efficient than the baseline EMDI estimate in Sri Lanka, as the mean MSE rises from 4.6 to 6.8. In Tanzania, in contrast, the Stata SAE package with the parametric bootstrap yields estimates that are slightly more efficient than the baseline estimates, but because the estimates are less accurate, the coverage rate falls significantly from 74.8 to 64.2 percent.
The fifth row of the table presents results without the heteroscedasticity correction. Disabling the correction brings the mean poverty rate closer to the true 4 percent rate in Sri Lanka, but further from the assumed poverty rate of 20 percent in Tanzania. In Sri Lanka, disabling the correction increases the accuracy of the estimates to slightly above the baseline estimates. In Tanzania, disabling the heteroscedasticity correction has minor effects on accuracy and slightly decreases precision, improving the coverage rate. The limited effects of the heteroscedasticity correction likely results from the negligible explanatory power of the "alpha model" that predicts the variance of the residuals. The R 2 of the alpha model, which like the "beta" prediction model was selected using the plugin Lasso method to avoid overfitting, is only 0.004 in Sri Lanka and 0.003 in Tanzania.
The bottom four rows of the table present a variety of robustness checks using the EMDI package. The first of these uses the publicly available version of the package that does not allow for household weights when aggregating the simulated household poverty outcomes into an area poverty rate. The estimator without weights is slightly less accurate than those with weights in each country, with a corresponding mild decline in both efficiency and coverage. The second EMDI robustness check uses a more standard Box-Cox transformation of the welfare variable instead of the ordered quantile normalization transformation. This leads to considerable downward bias in the estimates in both countries, leading us to prefer the ordered quantile normalization, although the correlation and coverage rate changes little when using the Box-Cox transformation. The third EMDI robustness check reports the results of a model selected using stepwise regression, with a probability 32 The penalty for failing to implement EB in a traditional bootstrap setting is smaller in Tanzania than Sri Lanka, although the estimates without EB are very imprecise in both countries. This is likely because the traditional bootstrap with EB performs gives less precise estimates in Tanzania than Sri Lanka, due to the smaller number of areas in Tanzania. 33 This may result from the failure of the Stata SAE package to append the sample to the census when simulating poverty. The sample constitutes about 0.5 percent of Sri Lankan households while the Tanzanian sample only contains about 0.1 percent of households, which is consistent with the greater fall in accuracy in Sri Lanka. value threshold of 1 percent. The resulting models are overfit. This leads to less accurate predictions than the Lasso-selected model in both countries and slightly reduces mean squared error in Tanzania, leading to a moderate fall in coverage rates in both countries. Finally, we report the results of the EMDI estimators when the poverty rates of the two countries are swapped. Not surprisingly, the performance of the model improves in Sri Lanka when the poverty rate is set to 20 percent, as the remote sensing indicators better predict poverty rates when they are larger. When setting the poverty rate to 4 percent in Tanzania, the EMDI estimator tends to underestimate poverty, with a mean predicted poverty rate of 2.8 percent. However, the correlation between the estimates and the true census poverty rate remains strong at 86.1 percent, and the benchmarked coverage rate of 70 percent is respectable.
Overall, of the many robustness checks considered so far, there are only two main cases where the results are particularly sensitive to the choice of method. The first is the use of the traditional rather than parametric bootstrap, which leads to a much higher estimated variance, coefficient of variation, and coverage rate of the estimates. The traditional bootstrap is not recommended when using Empirical Best models on theoretical grounds, since the estimate is conditioned on the sample (Diallo and Rao, 2018). The other case where the results change dramatically is the omission of the empirical best option, which roughly quadruples the MSEs and doubles the CVs compared to the baseline estimates, and yields substantially less accurate estimates in Sri Lanka as measured by the correlation with the census. Otherwise, the results are largely robust to the range of robustness checks considered.
Next, we perform two additional robustness checks. The first of these examines the results of the household model using different benchmarking procedures. In particular, we examine two alternatives to the baseline estimates, in which both the point estimates and standard errors are benchmarked. The first alternative only benchmarks the point estimates, while the second performs no benchmarking. The results are shown in Table 14. In both countries, not benchmarking the standard errors slightly reduces the average mean squared error. Average CV is minimized in both countries by benchmarking only the point estimates. In Sri Lanka, benchmarking only the point estimates slightly reduces the CV while not benchmarking increases it because an additional district is included. 34 This leads to a modest fall in coverage in Sri Lanka, but a more significant reduction in Tanzania, where the coverage rate declines from 74.8 percent with benchmarking to 69.8 percent without. Benchmarking the mean squared errors, all else equal, overstates uncertainty in cases like these where mean poverty is underestimated. This partially mitigates the downward bias in the household model's mean squared error estimates, which leads the benchmarking procedure to improve coverage. The benchmarked estimates are therefore preferred, although the results are qualitatively similar when either the point estimates are benchmarked, or when no benchmarking is carried out.
The final robustness check examines how the baseline household level model fares when random noise is added to the welfare aggregate. This better approximates a monetary welfare measure such as per capita consumption or income, which contain a greater portion of unexplained variance resulting from both transient welfare shocks and measurement error. We add two error term components to the existing measure of non-monetary welfare, to make an alternative noisier measure of "true welfare". The first error component is an area effect, drawn form a normal distribution with mean zero and variance 0.5, and the second is a household specific error term with mean zero and variance 1. These variance values were chosen arbitrarily, with the aim of achieving a model R 2 of approximately 0.2, which is similar to the R 2 when regressing monetary welfare on similar geospatial indicators in Tanzania (Belghith et al, 2020). The poverty line is set according to the new, noisier, welfare measure, equal to the 4 th percentile of the sample welfare in Sri Lanka, and the 20 th percentile of the sample in Tanzania. Table 15 displays the results when estimating area-level poverty using direct estimates, the Fay-Herriot model, and the household model with a noisier welfare aggregate. Household model diagnostics are displayed below the results. In the household model, the additional noise causes the marginal R 2 to fall to about 0.2 in both cases. This is substantially lower than the values reported when predicting the nonmonetary welfare index, which were 0.27 for Sri Lanka and 0.3 for Tanzania (Table 7). However, the conditional R 2 remains high, and even slightly above what is reported for the non-monetary welfare measure in Table 7. This reflects the additional variance in the area-specific component, which is captured by the mean of the sample residuals for each area and therefore incorporated into the small area predictions by the empirical best estimator. 34 The estimated poverty rate for Ratnapura district is zero in the HIES sample. The CV is therefore not defined for this district after benchmarking and excluded from the average. The overall results, with a few exceptions, are similar to the main results for the non-monetary welfare measure reported in Tables 8 and 9. The household level model produces the most precise estimates, as judged by mean squared error. The MSE reduction achieved using the household-level model in Sri Lanka is smaller than the baseline results, reflecting the challenge of accurately predicting the bottom tail of the welfare distribution with a noisier welfare measure. The average CV rises greatly in Sri Lanka, due to large positive outliers, when using the noisier measure. These outliers are areas with very low predicted poverty rates, yet high mean squared error, due to the greater variability in the area effect in these simulations. This illustrates that average CV should be interpreted with caution in settings with low poverty rates and highly variable area effects. Nonetheless, even with a noisier welfare aggregate, the reduction in mean squared error Sri Lanka compared with the direct estimates is roughly equivalent to doubling the size of the sample. In Tanzania, where the poverty rate is much higher, the householdlevel model leads to a 78 percent reduction in the mean squared error, a factor comparable to that reported in Table 8. The analogous reduction in average CV is 30 percent, which is also comparable to roughly doubling the sample size.
Turning to accuracy and coverage, exploiting sub-area variation using the household model substantially improves accuracy, as measured by the correlation with the true noisier welfare measure. In Sri Lanka, the correlation is over 4 percentage points higher for the household level model than the Fay-Herriot model, while in Tanzania the comparable difference is over 10 percentage points. The correlation in absolute terms remains respectable in Sri Lanka (0.80) and high in Tanzania (0.84 percent) when using the noisier welfare measure. Finally, coverage rates for the household level model decline only modestly in Tanzania and increase significantly in Sri Lanka in comparison with the direct estimates. In Sri Lanka, the coverage rate reported in Table 15 for the noisy welfare measure is remarkably similar to the one for non-monetary poverty reported in Table 8 (84.9% vs. 83.7%). In Tanzania, the coverage rate is higher for the noisy welfare measure (78.0% vs. 73.6%). Accuracy and coverage are particularly important measures to consider because, unlike estimated precision, they are only observed when a "true" census population is available. Overall, the results presented in Table 15 indicate that the accuracy and coverage of the household level model remain high, in comparison with both the Fay-Herriot and direct estimates, when additional noise is introduced into the welfare measure.

VI. Simulations
The results presented in the preceding two sections are based on small area estimation using one sample of households for each country. 35 The samples that were used are particularly significant, since they reflect the actual enumeration areas sampled by the National Statistics Offices for their household budget surveys. Nonetheless, it is useful to check that the strong performance of the household-level model holds when estimated in simulation settings, as is commonly done in the literature. 36

a. Model-based simulation
We begin by conducting a Monte Carlo simulation study, following the basic structure outlined in Das and Chambers (2017). We perform 100 simulations, drawing a new simulated population each time, in 80 areas. Each of these areas contains between 15 and 29 subareas, with the exact number chosen randomly along a uniform distribution. Each subarea contains 15 and 29 households, also chosen randomly from a uniform distribution. Each household is assumed to contain one member.
At the sub-area level, we draw two X variables, X1sa and X2sa, from a joint normal distribution. All households in the same sub-area are assigned the same random values of X1sa and X2sa. X1sa is distributed with mean 0 and variance 2. The mean of X2sa is set equal to 0.05 times the area index, which varies from 1 to 80, with a variance of 2. The covariance of X1sa and X2sa is set to 0.5. We draw area effects (ηa), subarea effects ( sa), and an idiosyncratic error term for each household (εi). These error terms are drawn from normal distributions with mean zero and variances of 1 for the area effect, 0.5 for the sub-area effect, and 2 for the household idiosyncratic error. We then generate the population welfare measure as (6) = 6 + 0.5 * 1 − 0.6 * 2 + + + The variance structure of the two X variables and variance components was selected to achieve an approximate R 2 of 0.25 when regressing household welfare on X1sa and X2sa.
From this population, we set the poverty line at the 25 th percentile of the welfare distribution and calculate population poverty rates for each area. We then draw a two-stage sample of 3 subareas per area in the first stage and 10 households per sampled subarea in the second. We construct sample weights as the inverse of the probability of being sampled, and take the weighted 25 th percentile of the sample distribution of welfare as the poverty line. Finally, we estimate a small area estimation model in the sample using the modified EMDI package, utilizing the population values of X1sa and X2sa to simulate welfare for all households in the population. The resulting simulated welfare aggregate is then compared with the poverty line derived from the sample, and averaged over households for each area to obtain area headcount poverty rates and estimated mean squared errors. 37 This entire process, starting with generating the simulated population, is then repeated 100 times. Therefore, the final output from the procedure is a set of 8,000 population poverty rates and small area estimates, derived from 100 simulations of 80 areas. Table 16 displays the results. The small area estimates reduce the mean squared error by a factor of 10 and average CV by a factor of roughly 3 compared with the traditional EA-clustered direct estimates, while maintaining a high coverage rate close to 95 percent. The mean poverty is nearly exactly 25 percent, and the correlation compared with poverty rates in the population increases from 0.84 for the direct estimates to 0.98 for the small area estimates. The root mean squared error of the predictions 36 Evaluations of small area estimation methods using model and/or design-based simulations include Das and Chambers (2017), Demombynes et al (2007), Elbers, Lanjouw and Leite (2008), Tarozzi andDeaton (2009), Tarozzi (2011) and Tzavidis et al (2018), among many others. 37 The results are not benchmarked to super-areas, because super-areas are not defined in this simulation. 37 falls by a factor of three. The household level model performs well in this stylized setting because the residuals are distributed normally by construction, and a large percentage of variation in simulated welfare is explained by the two predictor variables. It is reassuring, though, that the small area estimator performs well in a favorable controlled environment with populations that vary across simulations.

b. Design-based simulation
While model-based simulations are useful to explore the performance of the estimator in different populations, the results of model-based simulations inevitably depend on the set of assumptions used to construct the simulated data. To verify that the household-level model works well when estimated in multiple samples, this section describes the results of a design-based simulation. To reduce computation time, the simulation is limited to three regions in Northeast Tanzania near the Kenyan border: Arusha, Kilimanjaro, and Tanga. Between them, these three regions according to the census contain 24 districts, 2,078 villages, 10,964 enumeration areas, and nearly 1.17 million households. 38 The 2018 Household Budget Survey sampled 1,217 households in 97 enumeration areas from these three regions, for an average sample of 12.5 households per enumeration area. The three regions are disproportionately poor, as the non-monetary poverty rate is 46.5 percent when using a poverty line set at the 20 th percentile of the national distribution.
To conduct the design-based simulations, we repeat the following seven-step procedure 250 times: 1. Draw a simple random sample of 4 EAs for each of the 24 districts from the census, which selects a total of 96 EAs from the population of 10,964 census EAs contained in the three regions.
2. Draw a simple random sample of 13 households for each selected EA to generate a sample of 1,248 households.
3. Construct sample weights for each household equal to the inverse of their selection probability. 39 4. Use the ordered quantile normalization to construct normalized non-monetary welfare in the sample.
5. Using the household sample from step 2, the sample weights from step 3, and the normalized welfare measure from step 4, estimate a lasso regression of normalized non-monetary welfare on all candidate geospatial indicators. 40 6. Using the same inputs as the previous step, calculate the 46.5 th percentile of the sample distribution of normalized non-monetary welfare to use as the poverty line. 38 We selected three regions to use for this exercise because the lasso procedure did not select any variables when using only one or two regions, when mimicking the sample size of the budget survey. We selected these particular three regions because they are disproportionately poor and because they were the first three regions in numeric order following the capital of Dodoma. 39 The household sample weights are the product of two ratios: The number of census EAs in each district divided by 4, and the number of households in each sampled EA divided by 13. These are multiplied by household size to obtain population weights. 40 As before, we use the plugin method for selecting lambda in the lasso. Results for the household-level model are reported with and without benchmarking. As above, the benchmarked point estimates were rescaled to match the direct estimates for each of the three regions in the population, for each sample drawn in each simulation, and the mean squared error were multiplied by the square of the same scaling factor. The direct estimates are computed in each simulation by calculating the weighted share of sample households that are poor in each district, using the weights calculated in step 3 above. We again compare the direct estimates obtained using the Horvitz-Thompson approximation with results from standard enumeration area clustering, which give identical point estimates but differ in their estimated variances. The estimates produced by the household level model, however, are far more precise than the Horvitz-Thompson direct estimates, with average mean squared errors that are about a quarter as large and average CVs about half the size. The household model estimates are only moderately more precise than the standard direct estimates, reflecting a high degree of correlation in household welfare across EAs within district. The standard direct estimates ignore this correlation and therefore greatly underestimate uncertainty, which are reflected in a coverage rate of 48.7 percent. In contrast, the coverage rate of the benchmarked estimates from the household model is 68 percent, which is moderately below the coverage rate for all of Tanzania using the HBS sample reported in Table 8 (74.8%). The coverage rate of the household model, despite being much more precisely estimated, is only modestly below the Horvitz-Thompson direct estimates. This is because the household model estimates are more accurate, as seen in the higher correlation and lower root mean squared error.
To sum up, when using repeated samples in three poor Tanzanian regions, incorporating remote sensing data in a household-level model dramatically improves precision compared to the Horvitz-Thompson direct estimates. Adding the remote sensing data reduces mean squared error by about 75 percent and CV by about half, which is more or less comparable to quadrupling the sample size. Because the small area estimates are much more accurate than the direct estimates, the increased precision comes at a modest cost in terms of coverage rate. Compared with the standard direct estimates assuming independence across enumeration areas, the small area estimates are not only slightly more precise, but also significantly boost accuracy and coverage. Overall, the main results derived for the full country based on budget survey the specific sample drawn for the budget survey are maintained when focusing on three Tanzanian regions and generating estimates of the area poverty rates and their mean squared errors using 250 different samples.

VII. Conclusion
Policy makers worldwide are increasingly demanding more granular data to inform decisions. This paper examines the extent to which supplementing a synthetic sample drawn from the census with geospatial data at the subarea level in Sri Lanka and Tanzania generates more precise and accurate estimates of non-monetary poverty. The paper employs empirical best predictor models because they are more accessible to most statisticians in National Statistics Offices than hierarchal Bayesian models. In addition, empirical best models have a long history in small area estimation, are similar to other commonly used methods, and have been implemented in multiple publicly available software packages. The analysis examines non-monetary poverty in order to evaluate the estimates against census data, which sheds light on the performance of different methodological approaches and software packages.
The results are encouraging. The correlation between the small area estimates and the actual census is roughly 88 percent in both countries, and substantially greater than the analogous correlations for the direct survey estimates, which are 73 and 77 percent in Sri Lanka and Tanzania, respectively. The household-level model, compared with standard clustered survey estimates, reduces the mean squared error by about two-thirds in Sri Lanka and four-fifths in Tanzania, which is roughly equivalent to tripling and quintupling the effective sample size. The results are robust to alternative implementation options, adding additional noise to the welfare measure, and the implementation of both a model-based Monte Carlo simulation, and a design-based simulation from three regions in Northeast Tanzania. When rescaling mean squared errors to equalize coverage rates, the household-level model remained more efficient than a Fay-Herriot model, especially in Sri Lanka. The strong performance of the householdlevel EBP model, in comparison with the area-level Fay-Herriot model, is consistent with simulation results reported in Hidiroglou and You (2016). That study concluded, on the basis of a model-based simulation study with a linear model and unit-level auxiliary data, that unit-level models achieve lower bias and higher coverage rates than area-level models.
The household-level EBP model moderately underestimates uncertainty, leading to sizeable drops in coverage rates compared with estimates from the area-level Fay-Herriot model, especially in Tanzania. This occurs because the model does not incorporate sample weights, and assumes that survey data are independent within area when conditioning the random effect on the sample residuals. In addition, both the household and area-level models do not account for uncertainty in the estimated variance components in the model. Nonetheless, the coverage rate of the household model in both cases is comparable to standard survey estimates that cluster on enumeration areas. Furthermore, the household model performs significantly better than the Fay-Herriot model in terms of accuracy and efficiency when additional noise is added to the welfare measure, especially in Tanzania.
The financial cost of this type of small area estimation procedure is generally low and falling rapidly. Much of the auxiliary data used for the small area estimation, such as estimates of built-up area, nighttime lights, and vegetation, are freely available. There are two notable exceptions. First, the calculation of spatial features at a national scale requires constructing a cloud-free mosaic of Sentinel imagery, considerable computing power, and the expertise to implement software to calculate spatial features. Second, the data on Tanzanian building footprints are proprietary. However, data on building footprints are increasingly being released in the public domain and it is not difficult to envision information on building footprints becoming freely available in the coming years, potentially through Open Street Map as it continues to improve in accuracy and coverage. Finally, access to subarea survey identifiers and shapefiles, or access to EA geocoordinates, is necessary to link survey data to geospatial indicators. This is not feasible in all contexts, but the growing popularity of geospatial analysis and CAPI data collection is making geospatial survey analysis more common. A conservative estimate is that the time and expertise required to generate these types of estimates costs $50,000 to $150,000. This is minor compared to the value created by even doubling the effective sample size of nationally representative household surveys that often cost at least a million dollars to field.
The results could be improved by further refinements in software and methods. One avenue for further research is to explore the performance of a subarea-level model, an extension of the Fay-Herriot model specified at the subarea level with an area-level mixed effect (Torabi and Rao, 2015). Estimating such a model at the subarea level makes it easier to properly account for sample design effects. However, modeling poverty rates as a linear function of predictors may generate less accurate estimates, especially when poverty rates are low. It would be useful to better understand how the results of a properly specified subarea model compare with a household-level model using subarea predictors.
A second line of research could focus on improving the household-level model by estimating uncertainty more accurately. The estimated variance of the area random effect, for example, could be adjusted to account for sample design effects among the residuals, which would increase the estimated mean squared errors and improve coverage rates. 43 In addition, benchmarking could be incorporated directly into the bootstrap procedure, to obtain more accurate estimates of benchmarked mean squared errors. Third, it should be fairly straightforward to incorporate the ordered quantile normalization transformation directly to the R EMDI package, which already allows for Box-Cox and log shift transformations. This would enable the software to estimate poverty gaps and severity in addition to headcount rates when using the ordered quantile normalization transformation.
This study only considered a measure of non-monetary poverty, in order to evaluate the results of the area predictions against actual census data. This raises the important question of the extent to which the results generalize to predicting monetary poverty rates. The types of geospatial indicators considered in this study are moderately less predictive of monetary welfare than non-monetary welfare. For example, when predicting district-level monetary poverty in Tanzania using the same geospatial indicators considered above, the geospatial variables selected by a stepwise procedure explained 23.5 percent of the variation in household per capita consumption, versus 30 percent for a lasso-selected model of non-monetary welfare. 44 This reflects the difficulty of predicting both transient household welfare shocks and measurement error, both of which are present to a greater degree in measures of monetary poverty than non-monetary poverty. It is encouraging that the household-level EBP model continues to perform well when noise is added to non-monetary welfare. The efficiency gain from using the household model was notably smaller in Sri Lanka when using the noisier welfare measure, but even in that context incorporating geospatial data led to an increase in precision roughly equivalent to doubling the sample size. These results are only suggestive, however, and an important outstanding research agenda is to compare small area estimates of monetary poverty obtained using geospatial data to those obtained from standard census-based poverty maps directly from census data. This can confirm that the encouraging results presented above apply to monetary poverty.
Annex A: Software packages for household-level EBP models As noted in the text, there are five main differences between the different packages, which are listed in Table 4. The first important difference is that the stata SAE package implements a heteroscedasticity correction, following Elbers, Lanjouw and Lanjouw (2003). This correction estimates an "alpha model" that regresses the squared residuals from an OLS regression on subarea characteristics to obtain an estimate of the variance. The estimated variance is then used in a GLS regression to obtain the parameters of the "beta model", which are then used for simulation.
The second main difference, as noted above, involves the type of bootstrap method. 45 Version 2 of the Stata ELL package uses a traditional non-parametric bootstrap approach, which samples clusters randomly with replacement for each bootstrap replication. The R sae package takes a parametric bootstrap approach, which holds the composition of the sample constant for each replication. To do this, the method bootstraps the population instead of the sample. For each replication, a bootstrap population is constructed by drawing from the distribution of the estimated area random effect η and the household idiosyncratic error term ε. The EBP model is then re-estimated using the same households that were present in the sample, thus maintaining the same set of sample households for each replication. Values for η and ε are drawn 100 times from a normal distribution to generate point estimates for poverty in the population. The bootstrap process is repeated 100 times to estimate mean squared error, for a total of 10,000 simulations. The rationale for this parametric bootstrap procedure is that when performing small area estimation, the sample has been realized. The composition of the sample is therefore known and should be held constant across replications. This point is elaborated in greater detail in Diallo and Rao (2018).
A third difference between the EMDI and Stata SAE packages involves the model fitting method. The stata SAE package estimates uses either a traditional GLS model, or Henderson's method three to estimate the parameters of the random effects model. In the simulations below, we use Henderson's method three to fit the model in the SAE stata packages. The R EMDI package utilizes the maximum likelihood estimation method implemented in the lme function of the R nlme package (Pinheiro et al, 2020). In most settings, the choice of fitting method should not make a large difference to the model estimates.
A fourth difference is that the EMDI package, like the R SAE package, appends the sample to the census before aggregating poverty estimates. The Stata SAE package does not at this time. This should have a minor impact on the results in cases where the sample size is a small fraction of the census population.
A fifth difference is that empirical best estimation, along with the assumption of normal errors, is required in both version 3 of the Stata SAE package and the R EMDI package, while it is optional in version 2 of the Stata SAE package. The latter, accordingly, allows for non-parametric selection of the error terms when empirical best is not used, which relaxes the assumption of normality. We use version 2 of the Stata SAE package to estimate the model without empirical best estimation as a robustness check to verify that empirical best methods are much more efficient in this context.