Policy Research Working Paper 9383 World Development Report 2021 Background Paper Small Area Estimation of Non-Monetary Poverty with Geospatial Data Takaaki Masaki David Newhouse Ani Rudra Silwal Adane Bedada Ryan Engstrom Poverty and Equity Global Practice September 2020 Policy Research Working Paper 9383 Abstract This paper uses data from Sri Lanka and Tanzania to eval- a factor of three in Sri Lanka and five in Tanzania. The uate the benefits of combining household surveys with empirical best predictor model moderately underestimates geographically comprehensive geospatial indicators to gen- uncertainty, but coverage rates are similar to standard sur- erate small area estimates of non-monetary poverty. The vey-based estimates that assume independent outcomes preferred estimates are generated by utilizing subarea-level across clusters. A variety of checks, including adding noise geospatial indicators in a household-level empirical best to the welfare measure and model-based and design-based predictor mixed model with a normalized welfare measure. simulations, confirm that the main results are robust. The Mean squared errors are estimated using a parametric results demonstrate that combining household survey data bootstrap procedure. The resulting estimates are highly with subarea-level geospatial indicators can greatly increase correlated with non-monetary poverty calculated from the precision of survey estimates of non-monetary poverty the full census in both countries, and the gain in preci- at comparatively low cost. sion is comparable to increasing the size of the sample by This paper is a product of the Poverty and Equity Global Practice. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at dnewhouse@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Small Area Estimation of Non-Monetary Poverty with Geospatial Data Takaaki Masaki David Newhouse Ani Rudra Silwal Adane Bedada Ryan Engstrom This paper is a product of the Poverty and Equity global practice and a background report for the 2021 World Development Report. We are grateful to Nadia Belghith, Carlos Castelan, Robert Cull, Andrew Dabalen, Kristen Himelein, Dean Joliffe, Pierella Paci, Carolina Sanchez-Paramo, and Thomas Walker for their support. We thank Paul Corral, Kristen Himelein, Peter Lanjouw, and William Seitz for insightful comments and suggestions, and Partha Lahiri, Carl Morris and Roy van der Weide for helpful discussions. Finally, we thank Albina Chuwa, Dilhanie Deepawansa, Keith Garrett, Amara Satharasinghe, and Elizabeth Talbert for their help in obtaining data. All remaining errors are ours. Introduction The proliferation of big data obtained from satellites and mobile devices, as well as research demonstrating that satellite and mobile phone records are strongly correlated with household welfare, has sparked great interest in statistical methods that combine big data with survey data. 1 A key motivation for complementing survey data with comprehensive big data is the potential for small area estimation, which can produce more precise and granular estimates of socioeconomic indicators by combining household-level survey data in surveyed enumeration areas with predicted values from all enumeration areas, whether sampled or not. Established methods for small area estimation typically utilize household survey data and contemporaneous household census data, by first estimating a prediction model of welfare in the survey data and then using the estimated model parameters to simulate welfare in the census. 2 Because traditional small area estimation links household survey with census auxiliary data through the estimated model parameters, the set of potential predictor variables is limited to those present in both the survey and the census. Furthermore, when the wording of questions or timing of the data collection differs significantly between the census and the survey, small area estimates using the standard method can suffer from bias. Finally, because this method relies on census data, a new census is required to credibly estimate welfare changes for small areas. Census data are typically dated and collected once per decade at best, which presents a significant hurdle to deriving timely small area poverty estimates. To overcome these challenges, we link survey data with geospatial and remote-sensing auxiliary data at the subarea level. 3 Subarea refers to the lowest geographic level at which spatial auxiliary data can be merged with household surveys containing data on welfare and poverty, while area refers to the target level for the small area poverty estimates. In recent household surveys, GPS coordinates of households, enumeration areas, or villages are often available, which allows analysts to link survey data with other auxiliary sources of remote-sensing or geospatial data and improve the precision of poverty estimates. Although linking surveys with geospatial auxiliary data generates less precise small area estimates than using a census, it offers several important benefits. Geographic linking allows the relationship between the spatial auxiliary indicators and economic welfare to be estimated directly in the prediction model, avoiding the restrictive assumption that predictors must be identically distributed in the survey and auxiliary data. In addition, some spatial auxiliary data are collected continuously and are highly predictive of subarea poverty, and both the quality and availability of geospatial data are improving rapidly. Compared to other forms of “big data” such as mobile phone data, indicators derived from satellite imagery are attractive because of the comparative ease of acquiring data and the absence of selection bias. Despite the potential value of this approach, only a couple of existing studies have quantified the gain in the precision of small area estimates achieved by supplementing survey data with geospatial data, in the 1 Examples of papers that find strong correlations between big data and welfare are Yeh et al (2020), Jean et al, 2016, Steele et al, 2018, Engstrom et al (2017) and Pokhriyal and Jacques (2017), among others. 2 Elbers, Lanjouw, and Lanjouw (2003), Tarozzi and Deaton (2009), and Molina and Rao (2010), among many others. 3 Lange, Pape, and Putz (2018). 2 context of estimated crop acreage. 4 Existing studies that validate small area estimates of non-monetary poverty generated from mobile phone or geospatial data have, by and large, focused on extrapolation, or the use of big data to predict welfare in countries that are not surveyed. This differs from small area estimation, in which the objective is to predict poverty within the area covered by the sample, at a more granular level. To do this efficiently, it is crucial to use the survey data not only to train the prediction model, but also as a direct input into the estimates. This study therefore makes three main contributions. First, it applies the prevailing framework for small area estimation to combine household survey data with geographically comprehensive geospatial indicators in an efficient way. Second, it evaluates the extent to which incorporating geospatial variables at the subarea level improves the precision of small-area poverty estimates. Finally, it assesses which of the commonly used SAE models – unit-level models (Elbers, Lanjouw, and Lanjouw 2003. Molina and Rao, 2010), and the Fay-Herriot area-level model (Fay and Herriot, 1979) – are best suited for combining survey and geospatial data to produce efficient and accurate estimates of both area-level poverty rates and the uncertainty associated with them. We test these models in the context or generating small area estimates of non-monetary poverty in two developing countries, Sri Lanka and Tanzania. These countries were selected due to the availability of both census and survey data with subarea-level identifiers and matching polygon shapefiles close to the village level. The welfare prediction model uses survey data to estimate the value of a household welfare index as a function of subarea characteristics, and therefore differs from standard small area estimation models that predict welfare using household characteristics. The resulting estimates provide a large efficiency gain compared with direct survey estimates. We mainly consider Empirical Best Predictor (EBP) models, which have a long history in small area estimation and can accommodate subarea-level auxiliary data to produce estimates of poverty rates and their mean squared error for target areas. EBP models differ from the “ELL method” (Elbers, Lanjouw, and Lanjouw 2003), that has traditionally been used for poverty mapping by the World Bank because they incorporate additional information from the survey by conditioning the area effect on survey residuals. 5 This can lead to substantial efficiency gains when the variation in welfare or poverty contributed by the sample is large, relative to the variation contributed by the auxiliary data. On the other hand, a drawback of EBP models is the required assumption that the stochastic error terms in the model follow a normal distribution. While this normality assumption can be tested by examining the residuals from the prediction model, violating the assumption will introduce bias into the estimates. In the traditional case when the auxiliary data are household-level data from a census, the sample is a small fraction of the size of the census and EBP models typically lead to minor efficiency gains. 6 In these cases, it is not obvious that EBP models are preferred to simulations based on standard random effect models that do not condition the area effect on the sample but allow for departures from normality. When the auxiliary data are only available at the subarea level, however, the effective size and statistical power of the auxiliary data are much less than that of a full household census. The share of variation contributed by the sample therefore becomes larger, and as shown in results presented below, EBP 4 Battese, Harter, and Fuller (1988), Erciulescu et al (2019). 5 For example, the methods described in Elbers Lanjouw and Lanjouw (2003) and Tarozzi and Deaton (2009) are not EBP models. 6 Haslett (2016). 3 models provide a large gain in efficiency compared with traditional ELL methods that do not incorporate Empirical Best methods. We compare the estimates generated by a household EBP model with direct estimates obtained solely from the survey, as well as the well-known Fay-Herriot area-level model. 7 The latter sacrifices precision by discarding the variation in the geospatial indicators across subareas within target areas. Once the area estimates are obtained from the household EBP and Fay-Herriot models, we compare them to non- monetary poverty rates calculated directly from the full census. The availability of census data provides a credible benchmark for assessing how different methods and their variants perform, both in terms of the accuracy of both the small area point estimates and their confidence intervals. We compare the predictions from different methods in terms of their precision, their accuracy, and the extent to which the estimated confidence intervals contain the true value, which is known as the coverage rate. The main result is that incorporating remote sensing data in an EBP framework substantially improves the accuracy and precision of small area estimates of non-monetary poverty, largely by incorporating information from non-sampled subareas. 8 This comes at no cost to coverage rates in Sri Lanka and a moderate cost in Tanzania, compared with standard direct estimates. The corresponding efficiency improvement is comparable to roughly tripling the size of the survey in Sri Lanka and quintupling it in Tanzania. 9 The small area estimates moderately underestimate mean squared error by, first, failing to account for uncertainty in estimated variance parameters from the model, second, by incorrectly assuming that sample observations are independent within areas, and third, by failing to incorporate sample weights in model estimation. Estimated coverage rates remain respectable, however, at 75 percent in Tanzania and 84 percent in Sri Lanka when benchmarked. This is comparable to the 76 percent coverage rate in both countries when using standard direct estimates. In Tanzania, the estimates from the unit level model are roughly as accurate and moderately more efficient than those from the area-level Fay-Herriot model. In Sri Lanka, where the poverty rate is low and the Fay-Herriot model is less predictive, the estimates from the unit-level model are substantially more accurate and efficient than the Fay-Herriot estimates. These results are robust to a variety of robustness checks that explore alternative implementation options, including the absence of benchmarking and the use of a noisier welfare measure. When using the noisier welfare measure, the gain in efficiency is not as large in Sri Lanka, on the order of doubling the size of the sample, but the predictions remain accurate and coverage rates remain high. Finally, to further probe the robustness of the results, we implement both a model-based simulation and a design- based simulation of 250 randomly selected samples covering three regions in Northeast Tanzania. The small area estimates perform exceedingly well, while the design-based simulation also shows that the small area estimates substantially improve on the direct estimates in terms of efficiency and accuracy while maintaining high coverage rates. 7 See Fay and Herriot (1979). 8 While the main efficiency improvements occur by incorporating information from non-sampled sub-areas, there are also minor efficiency improvements from combining sample data with synthetic predictions in sampled sub- areas. 9 This is based on a comparison of the average coefficient of variation between the standard direct estimates and the household level model reported in Table 10. 4 The reminder of the paper is organized as follows. Section II describes the data. Section III describes the methodology and estimators that are evaluated. Section IV assesses results in terms of efficiency, accuracy, and coverage. Section V considers a variety of robustness checks of the main method. Section VI considers the results of model-based and design-based simulations, and section VII concludes. II. Description of the data A. Constructing measures of non-monetary welfare and poverty in the census The analysis is conducted using measures of non-monetary poverty constructed from the Sri Lankan and Tanzanian census, which were each conducted in 2012. Non-monetary poverty, unlike monetary poverty, is directly observed in the census population and can therefore provide a benchmark for evaluation. For each country, we identified a set of household asset and demographic proxies for welfare. Principal component analysis, weighted according to household size, was used to derive a score for each proxy welfare indicator. The welfare proxies used in the index and their estimated scores, or loading factors, are listed in Table 1. Each household’s non-monetary welfare measure is obtained by summing the product of each loading factor and the household’s value of the associated variable. No subsequent adjustment for household size was made, reflecting the non-rival nature of the index variables within the household, although household size is included as a welfare proxy in the index for both countries. The results are broadly reasonable, as higher levels of education for the household head are associated with higher non-monetary welfare in both counties, assets and dwelling conditions are positively associated with welfare in Sri Lanka, and non-agricultural work is strongly associated with non- monetary welfare in Tanzania. This measure of non-monetary welfare can be compared with a poverty line threshold to classify each household as poor or non-poor. We set the threshold at the 4th percentile of non-monetary welfare in Sri Lanka and the 20th percentile in Tanzania, to reflect prevailing national monetary poverty rates. In other words, households were classified as non-monetarily poor if their non-monetary welfare, defined as the score of the first principal component, fell in the bottom four percentiles in Sri Lanka and the bottom quintile in Tanzania. Below in the robustness checks section, we also consider results when the poverty rates of the two countries are swapped. Table 1: Factor loadings for census-based welfare index by country Sri Lanka Tanzania Variable Scoring Variable Scoring Coefficients Coefficients Household size 0.09 Head Literate 0.45 Head Ever attended 0.45 Dependency ratio school Children 0-14 and 65+ 0.04 Head age -0.20 Children 0-14 Only -0.05 Household size -0.22 Gender ratio 0.04 Dependency ratio -0.41 Household education Male head 0.12 5 No schooling -0.07 Non-agricultural work 0.29 Up to grade 5 -0.14 Non-livestock work 0.25 Grade 5-10 -0.31 Non-fishing work 0.01 O or A level 0.27 No disability 0.07 College Degree or higher 0.16 Cash transfer beneficiary 0.13 Household Assets House 0.04 Computer 0.27 Landline phone 0.23 TV 0.28 Housing characteristics Roof 0.14 Private Toilet 0.18 Wall 0.19 Waste disposal 0.15 Safe water 0.13 Main cooking fuel is wood -0.23 Electricity for light 0.27 Head Education No schooling -0.11 Up to grade 5 -0.17 Grade 5-10 -0.13 O or A level 0.29 College Degree or higher 0.13 Age of head in years 0.04 Head employment status Unemployed -0.0005 Public sector 0.15 Private sector -0.6 Out of labor force -0.04 Male head 0.11 Head marital status Unmarried -0.04 Married 0.12 Widowed -0.10 Divorced -0.06 Number of variables 35 11 Notes: Table contains factor loadings from the first principal component, estimated in the census weighting by household size. Dependency ratio is equal to the ratio of non-prime age adults (0 to 14 and 65+) to household size. 6 B. Constructing synthetic household surveys With a measure of non-monetary welfare and poverty for each census household in hand, we turn to drawing a synthetic household survey from each country’s census. The synthetic survey, along with the auxiliary geospatial data, are key inputs into the small area estimation procedures. To draw the synthetic survey, we utilize the actual two-stage sample conducted by the National Statistics Offices for two household budget surveys: The 2018 Tanzania Household Budget Survey, and the 2016 Sri Lanka Household income and Expenditure Survey. These surveys were merged with the census at the subarea level, which is the GN Division in Sri Lanka and the village in Tanzania. After retaining the GN Divisions and EAs present in the budget survey, we randomly select census households in each matching EA to match the number of households in each EA for each survey. Finally, we merged the sample weights from the household budget surveys for each subarea. Essentially, this procedure draws a survey that mimics as much as possible the sample drawn by the NSO for the budget surveys. C. Remote sensing data The auxiliary data for the small area estimation exercise are drawn from a large candidate pool of satellite-based information, most of which is derived from publicly available layers and imagery. These include night-time lights from the Visible Infrared Imaging Remote Sensor (VIIRS), at a spatial resolution of 15 arc-seconds, precipitation data from the Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS), elevation and slope taken from the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) satellite, global forest cover change from Hansen (2013) and estimates of built-up area from the Global Human Settlement Layer (GHSL). From this last layer, we compute the percentage of total built-up area observed in 2014 that was constructed prior to 1975 or during 1975- 1990, 1975-1990, and 2000-2014. The Sri Lanka indicators were also supplemented by a variety of spatial “texture” features derived from a cloud-free mosaic of 2017-2018 Sentinel-2 imagery, which is collected every 5 days by Sentinel 2 sensors on board two satellites, Sentinel 2A and 2B. This imagery is made publicly available by the European Space Agency and is the highest spatial resolution (10m per pixel) optical data that are publicly available. The “texture” features were calculated using Jordan Graesser’s Sp.Feas package, an open-source Python library for processing contextual image features from satellite imagery. These contextual features have been shown to be strongly correlated with poverty and population density (Engstrom et al, 2017 and 2019a). The following contextual features were considered as candidate predictors: o Fourier Transform (FT) which is used to detect high or low frequency of lines. o Gabor Filter, a linear Gaussian filter used for edge detection (Mehrota et al, 1992) o Histogram of Oriented Gradients (HOG), which captures the orientation and magnitude of the shades of the image. (Dalal and Triggs, 2005) o Lacunarity (LAC), which describes the extent of gaps and holes in a texture. Low-lacunarity geometric objects are homogeneous because all gap sizes are the same, whereas high- lacunarity objects are heterogeneous. (Myint et al, 2006) o Line Support Regions (LSR), which characterize line attributes (Ünsalan and Boyer, 2004) o Normalized Difference Vegetation Index (NDVI), the most widely used vegetation index, which provides information about the health and amount of vegetation 7 o PanTex, which is a built-up presence index derived from the grey-level co-occurrence matrix (Pesaresi et al, 2008) o Structural Feature Sets (SFS), which are statistical measures to extract the structural features of direction lines (Huang et al, 2007) These spatial features are calculated by comparing pixels with their neighbors and then reporting this value back to the individual pixel (in this case 10m). The number of neighboring pixels considered in the comparison is the scale, which varies by the feature being calculated. For most features, we use scales of 3, 5, 7, which correspond to squares of 3 pixels by 3 pixels, 5 pixels by 5 pixels, and 7 pixels by 7 pixels. The remote sensing data used for Tanzania, while largely derived from publicly available imagery, was supplemented by proprietary data on building footprints produced by Ecopia and Maxar. 10 The publicly available layers considered for Tanzania included: o Night-time lights, taken from the same VIIRS imagery used for Sri Lanka o Population estimates taken from WorldPop, which is in turn derived from the 2011 census and distributed using measures of built-up area. o Precipitation estimates taken from Weilcott and Matsuura (2018) o Elevation and Slope taken from Jarvis (2008) and downloaded from AidData’s geoquery database o Built-up area from the same GHSL data used for Sri Lanka, for the same periods. o Climactic region, obtained from Kottek et al (2006). This indicator divides land into five main climate groups (tropical, dry, temperate, continental, and polar), with each group being divided based on seasonal precipitation and temperature patterns, based on a raster version of the map available at 0.5° resolution. o Crop yield estimates for Maize, Sorghum, and Rice, from the HarvestChoice Dataverse (Wood-Sichra et al. 2016) capture the estimated yield of various crops (ton/km) at 0.5° resolution. The full set of satellite indicators used for each country are listed below in Table 2. Table 2: Indicators and sources for auxiliary satellite data Satellite indicator Sri Lanka Tanzania Urbanization Night-time-lights VIIRS VIIRS* Building footprints Ecopia and Maxar, via Gates Foundation* Built-up area Global Urban Footprint* Built-up area Global Human Settlement Layer* Global Human Settlement Layer Population WorldPop Agglomeration index Belghith et al (2020) (See Appendix A) Spatial features Sentinel 2* 10 The team also calculated spatial features using the Sp.Feas package for Tanzania based on Sentinel 2 imagery, but these were found to be highly colinear with the building footprint data and were therefore discarded. 8 Agro-climactic Precipitation CHIRPS* Wilmott and Matsuura (2018) Elevation and slope ASTER sensor Jarvis (2008)* Global forest cover Hansen (2013) slope Climactic region Kottek et al (2006) Crop yield estimates IFPRI Harvest-Choice (Wood- Sichra et al, 2016) Normalized Difference Sentinel 2 Sentinel 2* Vegetation Index Market access Belghith et al (2020) Natural Disaster Risk UNEP/DEWA/GRID-Europe Note: * indicates that at least one variable from this set was selected in the model of welfare D. Geographic structure of Sri Lanka and Tanzania Before describing the methodology used in the study, we briefly review the geographic structure of the two countries. Table 3 shows the subareas, areas, and super-areas for each country. Subareas are the lowest level for which shapefiles were obtained, and are therefore the most disaggregated level for which geospatial data could be linked with the survey. Areas are the target domains for the small area estimates, which are at a higher level than the subareas, but below the level at which the survey is considered to be representative. Finally, super-areas are the lowest level for which the household budget survey is considered to be representative. Table 3: Geographic structure of Sri Lanka and Tanzania Sri Lanka Tanzania Sub-areas Name GN Division Village Number in country 13,978 16,438 Areas Name DS Division District Number in country 331 159 Super-areas Name District Region Number in country 25 25 Source: Authors’ calculations from 2012 census data. 9 III. Methods This section describes the three methods that are evaluated: Direct survey estimates, the Fay-Herriot area-level model, and the household-level model. It also discusses important methodological considerations involved when estimating the household-level model, such as differences between software packages, transforming the dependent variable, the model selection procedure, and benchmarking. The final subsection considers the criteria used to evaluate the performance of different estimators. A. Direct survey estimates Direct estimates obtained from the survey serve both as a benchmark against which to assess the increase in precision from incorporating satellite data, and as an input into the Fay-Herriot area-level models. We use two methods to obtain direct survey-based estimates. For both methods, the mean poverty estimate for each area is the weighted average, across households, of the dummy variable indicating whether each household is poor. Each household is weighted according to corresponding population weight in the survey. The two methods used to obtain direct estimates differ in how they estimate the variance of the means. The typical method of obtaining variance estimates clusters the residuals by enumeration area, which accounts for positive correlation within enumeration areas but assumes that disturbances across enumeration areas are independent. This assumption underestimates the variance of area-level poverty estimates, particularly if welfare is highly correlated across enumeration areas within target areas. The Horvitz-Thompson (H-T) approximation offers an alternative, more conservative, approach to estimate the variance of the area poverty rates (Horvitz and Thompson, 1952). This approach is more conservative because it relaxes the standard assumption that households’ poverty statuses are independent if they live in different enumeration areas within the same target area. This comes, however, at the cost of imposing an alternative assumption, which is that the probability of one household appearing in the sample is independent of the probability of any other household appearing in the sample. This assumption is clearly violated in standard two-stage sample designs, as the probability of a household appearing in the sample is greater if a household in its same enumeration area also appears in the sample. Nonetheless, the violation of this second assumption underestimates variance by less than the standard assumption of clustered standard errors, and therefore yields less biased variance estimates than the standard approach. We include results for direct estimates based on both the Horvitz-Thompson approach and the standard approach and in results presented below in Table 9, find substantially higher coverage rates for the former. B. Fay-Herriot area-level model The Fay-Herriot model was first introduced by Fay and Herriot (1979) to model incomes for small areas of fewer than 1,000 persons in the United States, and is perhaps the best-known and widely used small area estimator. It was derived as an application of the James-Stein (1961) estimator using data aggregated to the target area level, and is also the Empirical Best Linear Unbiased Predictor (EBLUP) of 10 the area level regression. The small area estimates are a weighted average of the synthetic regression prediction and the direct survey estimate, with the weights inversely related to the estimated variance of the direct and synthetic estimates. Many variants of the Fay-Herriot estimator have been developed that employ transformations and account for spatial and intertemporal correlation, among other refinements. We estimate the following basic area-level Fay-Herriot model: � = + + (1) � is the direct estimate of headcount poverty in area a obtained from the survey. Where Xa is a vector of area-level aggregate remote sensing variables, created by taking population-weighted average of subarea level aggregates. is a random effect, assumed to be distributed normally with mean 0 and variance σ2ua ea is the sampling error, which is also assumed to be distributed normal with mean 0 and variance σ2ea . The variance σ2ea is estimated using the estimates of the variance obtained from the sample as described above. The variance of the direct survey estimate is estimated as the variance of the direct estimate, calculated using the Horvitz-Thompson approximation, as recommended by Halbmeier et al (2019). As noted above, the predictions resulting from the Fay-Herriot model can be expressed as a weighted average of the direct estimates and the model predictions. � (2) = � + (1 − � ̂ � ) 2 ⁄( 2 2 ) Where � = � � + � , or the shrinkage factor. The shrinkage factor is the weight given in each area to the direct estimator relative to the model prediction. This declines as the variance of the direct 2 sample estimate � increases, because the direct sample estimates is less precise. Conversely, the 2 shrinkage factor increases as the estimated random effect becomes less precise and � increases, giving greater weight to the direct sample estimate. The Fay-Herriot estimator, unfortunately, has several weaknesses for the purposes of generating small area estimates of poverty using spatial auxiliary data. The main shortcoming is that the Fay-Herriot model requires aggregating the auxiliary data to the area level, which discards variation in the auxiliary data at the sub-area level that would increase the precision of the estimates. Implementing a sub-area model that appropriately adapts the Fay-Herriot model to address this issue, such as the one proposed by Torabi and Rao (2014), can incorporate auxiliary data at the sub-area level but is currently constrained by the lack of available software. In addition, the Fay-Herriot model is based on direct estimate of poverty rates, and therefore discards information on the variation in welfare within the poor and non-poor portions of the welfare distribution. Another issue with the Fay-Herriot model is that, when no transformation is applied, the poverty rate is assumed to be a linear function of the predictors. The Fay-Herriot model also faces challenges when no households are poor in a significant number of 11 areas. For these areas, either the prediction from the auxiliary data must be ignored, the sample must be ignored, or the area must be dropped from the estimation. Finally, the Fay-Herriot model does not account for uncertainty in the survey-based estimates of variance, which tends to underestimate the variance of the prediction. Despite these limitations, the Fay-Herriot model is much simpler to apply and explain than a unit level model. It generally provides credible estimates and is a standard workhorse of small area estimation. We estimate a Fay-Herriot model by aggregating all indicators to the area level, weighting by population. The Fay-Herriot models are estimated using the Stata FHSAE command (Corral, et al, 2018). We utilize the “Chandra method”, which estimates the prediction model using GLS (Chandra et al, 2013). C. Household-level model In contrast to area level models, unit level models are specified at the level of the individual unit, which in this case is the household. 11 Although the auxiliary data only varies at the sub-area level, predicting welfare at the household level fully utilizes the information in the sample survey on the level and variability of household welfare. In addition, unlike sub-area models, well-established software packages are available that utilize empirical best methods to estimate unit-level models with area-level mixed effects. The EBP model can be written as follows: 12 (3) G(Y) = 1sa + 2a + 3 Xr+ a + Where: Yi is the non-monetary welfare for household i, which in each country is the principal component score constructed using the loading factors in Table 1. G(Yi) is a transformed version of welfare. Traditionally, ELL uses the log function to transform welfare, but any monotonic transformation can be used. As described in greater detail below, we use an ordered quantile normal transformation for the primary set of results. Xsa is a set of satellite-derived indicators that vary at the sub-area (village or GN Division) level. Xa is a set of satellite-derived indicators that vary at the area level. The areas sub-districts in Sri Lanka and districts in Tanzania. These predictors are important to include to reduce the variance of the area random effect and thereby mitigate bias in the estimated mean squared error (Elbers, Lanjouw, and Leite, 2008). Xr is a set of regional dummy variables indicating super-areas for which the survey is considered to be representative. The Sri Lanka specification also includes dummies for the rural and estate sectors, while the Tanzania specification includes a rural dummy. 13 11 See Morris (1983) for a comprehensive review of EBLUP models, and Jiang and Lahiri (2006) for a comprehensive review of their application to small area estimation. 12 See Molina and Rao (2010) and Tzavidis et al (2018). 13 Sri Lanka is unique in that the national statistics office designates certain areas as the “Estate Sector”. The estate sector mainly consists of tea plantations and is historically economically disadvantaged. 12 a is a mixed effect specified at the area level, assumed to be distributed normally. Although traditionally ELL has specified the area effect at the cluster or enumeration area level, doing so underestimates the mean squared error by failing to account for correlation in model error across clusters within areas (Marhuendra et al, 2018, Das and Chambers 2018). 14 a is conditioned on the sample values of observed welfare, denoted ys. is a household level idiosyncratic error term, also assumed to be distributed normally. A key feature of the EBP framework is that the random area effect a is conditioned on the sample values of observed welfare ys. This, along with the assumption that σ2 is normal, allows for the use of the following model for generating simulated welfare: (4) (∗ ) = 1 + 2 + 3 + η ∗ � + + � = [ | ] (5) η ∗ (6) ~�0, σ2 (1 − )� , where σ2 (7) = . σ2 +σ2 ⁄ γa is the shrinkage factor due to conditioning the random effect on the sample mean for area a. Na is the number of sample observations for area a. The shrinkage in variance due to conditioning the random effect on the sample will be higher when Na is large or when the variance of the residual ε is low, because in these cases the sample contributes more information about the unexplained component of welfare. Shrinkage falls as the estimated within-area correlation in welfare σ2 , declines, because in those cases a larger share of unobserved welfare is idiosyncratic to households rather than common to areas, so the sample residuals convey less information about the true area mean. Similarly, the shrinkage factor γ declines as the area sample size Na declines, because the sample contributes less information about the welfare in the area. This basic structure is appealing in part because it is similar to the Elbers, Lanjouw, and Lanjouw (2003) model that has been used extensively for survey-census poverty mapping for decades, with the main difference being the conditioning of the random effect on the mean sample residuals. The concept of shrinkage in the nested error EBP framework is well-established in the small area estimation context (Battese, Harter, and Fuller, 1988, Jiang and Lahiri, 2006, Molina and Rao, 2010, Van der Weide, 2014). In addition, unlike Bayesian spatial autocorrelation models, nested error random effect models do not impose additional smoothness in model errors across neighboring areas. Spatial correlation is instead modeled entirely through the area random effects, which are conditioned on the sample residuals for that area. As noted above, it is not clear that Empirical Best methods are preferred in the traditional poverty mapping setting where a household census serves as the auxiliary data. While EB methods benefit from incorporating information from the sample, the sample is typically a small share of the census 14 Omitting the cluster effect also can lead to bias by simultaneously overestimating the variance of epsilon and the variance of eta, but the resulting biases act in opposite directions and therefore partially cancel each other out. 13 population, leading to limited benefits from using EB. Meanwhile, an important drawback of EB is the required assumptions that the components of the error term are distributed normally, which can reduce the accuracy of the estimates if the underlying distribution of the residuals is not normal. The choice of whether to use EB in a typical case, when the auxiliary data are a household census, must weigh the cost of assuming normality against the benefit of obtaining additional information from a relatively small sample. The calculus changes, however, when the auxiliary data are specified at an aggregate geographic level, such as a subarea or area. In this case, EB estimation is essential to effectively combine the survey data with census-based predictions. Intuitively, the auxiliary data contribute less variation when they only vary across sub-areas, compared to the sample, because the sample, unlike the auxiliary data, contains household level information. In the EBP framework described above, the use of predictors aggregated to the sub-area level increases the variance of the area effect σ2 , which also represents the covariance of unexplained welfare across households within the same area. This covariance rises when using sub-area predictors because all households within each sub-area share the same set of characteristics used to predict welfare in the model. The increase in σ2 due to the use of regionally aggregated predictors in turn increases the shrinkage factor γ, leading to a greater increase in precision when using an EBP model versus the traditional ELL estimator. The assumption of normality is critical in the EBP framework, as a distributional assumption is required to condition the area level random effect on the average residual of the regression in that area. In the EBP model, this conditioning is equivalent in a Bayesian framework to specifying the average of the residuals as a prior distribution for the random effect under the assumption of normality. The normality assumption is also necessary for the Monte-Carlo simulations of the parameters, which are required because headcount poverty is a non-linear function of welfare. In general, software packages generate small area poverty estimates by repeatedly drawing area effects and the household idiosyncratic error from a normal distribution and aggregating the simulation results. This results in biased estimates of poverty rates if the true error terms implied by the model are not in fact normal. Below, we discuss a monotonic transformation to the welfare measure that makes the assumptions that the area effect and the household error term are distributed normally more palatable. D. Software packages to estimate EBP models Table 4 summarizes the three software packages that we utilize to estimate household level models. The primary set of results reported below are estimated using a modified version of the R EMDI package (Kreutzmann et al, 2018). 15 We use this package for the main results because it is well-documented, flexible, and fast when running on powerful computers that can exploit parallel processing. In addition, it appends the sample to the census in order to generate slightly more accurate estimates. This package, like others designed for poverty mapping, estimates the mixed model (3) and then repeatedly draws both area effects and household error terms to simulate welfare. Simulated welfare is compared to the 15 The R EMDI package is available on CRAN. It is an extension of the R SAE package, and runs substantially faster because it allows the user to parallelize the bootstrap simulations across several cores. 14 poverty line to determine the poverty status of each household in each simulation, which is aggregated across simulations to estimate poverty in each area. Estimates of mean squared error are obtained through a parametric bootstrapping approach, which is described in detail in Gonzalez-Manteiga, et al (2008) and Molina and Marhuenda (2015). We leverage the open nature of the code to make one modification to the EMDI package related to the use of weights, which allows for household size to be used as a weight when aggregating the simulated household welfare outcomes into area-level poverty estimates. 16 As a robustness check, we also utilized the second and third versions of an alternative software package, the Stata SAE package (Nguyen et al, 2018). 17 The main differences between the packages are summarized in Table 4, and discussed in Annex A. These differences in design and methodology across different packages, particularly the choice of a non-parametric versus parametric bootstrap method, affects the resulting estimates of poverty rates and their mean squared errors. Comparing simulated small area estimates against actual census data provide evidence on how these differences affect estimated poverty rates and mean squared error in two use-case scenarios. Table 4: Differences between Stata SAE and R SAE package Stata ELL Modified R EMDI Version 2 Version 3 Heteroscedasticity Optional Optional Not available correction Type of bootstrap Traditional Non-Parametric Parametric bootstrap Parametric bootstrap with clustered bootstrap with with constant sample constant sample varying sample composition. composition. composition. Model fitting method Henderson method 3 with Henderson method 3 Maximum likelihood non-parametric bootstrap estimation Empirical Best Optional Required Required estimation Sample weights Optional Optional Not available Sample appended to Not available Not available with Required census for simulations sample weights E. Normalizing transformation Welfare typically follows a right-skewed distribution, and it is therefore standard practice to transform the welfare indicator before implementing small area estimation models based on simulated welfare. The most common transformation is to take the logarithm of welfare, following Elbers, Lanjouw, and 16 This modification involved taking a weighted mean instead of a simple mean across areas when aggregating households’ simulated poverty status for each area. Code is available upon request from the authors. 17 Version 2 was released in September 2019 and is available for download at: https://github.com/jpazvd/sae. Version 1 is available through the Stata SSC archive. Version two implemented a clustered bootstrap at the level of the enumeration area, using the traditional non-parametric bootstrap. Version 3 is a comprehensive update that implements a parametric bootstrap approach similar to the R EMDI package (Corral, et al, 2020). 15 Lanjouw (2003). A recent review article considers the issue of transformations for small area estimation in detail (Tzavidis et al, 2018). It reports design-based simulation results using Mexican data that considered alternative transformations, including log-shift and Box-Cox transformations (Box and Cox, 1964). A log-shift transformation involves adding a constant to welfare prior to taking the log, while the −1 Box-cox transformation can be written as: for ≠ 0, and log y for λ=0. Tzavidis et al (2018) conclude on the basis of the simulation results that log-shift transformations tend to perform well. Stata has implemented two commands that can quickly estimate the parameters for the log shift and Box-Cox transformations that minimize the skewness of the welfare distribution. These commands, however, only go part of the way towards normalization, because they cannot guarantee that the resulting distribution will have a kurtosis of three, which is characteristic of a normal distribution. Furthermore, the log shift transformations is not monotonic when the underlying distribution is left-skewed. 18 Finding an appropriate monotonic transformation for the welfare variable is important, as failure to satisfy the normality assumption can lead to systematic bias in EBP models. While benchmarking can be used to at least partially correct this bias, relying on benchmarking becomes more difficult to justify as the extent of the bias increases. For these reasons, our preferred estimates utilize a monotonic transformation called the “ordered quantile normalization” to transform welfare. This method transforms the scaled rank of the welfare variable to conform with a normal distribution. Variants of this method date back more than 70 years (Bartlett, 1947, Van der Waerden, 1952). Among several methods proposed in the literature, the ordered quantile normalization most consistently transforms an underlying variable to follow a normal distribution (Peterson and Cavanough, 2019). We utilize the Ordernorm function in the Bestnormalize R package to implement this transformation. The method produces an exactly normal distribution of the transformed variable when the original welfare measure contains no ties, irrespective of its underlying distribution. While a normally distributed outcome variable does not guarantee normality of the residuals, it helps make the distribution of residuals closer to normal, and the ordered quantile normalization leads to smaller discrepancies between the official national poverty rate estimated from the survey and the weighted mean of the small area estimates. To calculate small area estimates of the poverty gap, poverty severity, or inequality, it is critical to retransform the estimates back to the original welfare metric, as is done in the ELL method by exponentiating log welfare in each simulation. However, this retransformation is not necessary when estimating headcount poverty rates. This is because households’ poverty status is preserved under any monotonic transformation applied to both the welfare measure and the poverty line threshold. We therefore set the poverty line to the percentile of the transformed welfare measure that corresponds to the assumed national poverty rate, which is 4 percent in Sri Lanka and 20 percent in Tanzania. Because the ordered quantile normalization transformation is based on household ranks within the distribution, applying it entails information loss regarding the shape of the distribution between household ranks, and alters the estimated parameters of the model. To verify that this loss of information is minor compared with the benefits of normalizing the welfare measure, we also consider in the robustness section below a more traditional Box-Cox normalization procedure designed to minimize skewness. When the underlying distribution is left-skewed, the skew-minimizing transformation becomes log(− + ), 18 which is not monotonic. 16 F. Model selection using the Lasso Model selection in small area estimation remains an unsettled issue and has traditionally been treated as both art and science. Practice can vary widely. Tzavidis et al (2018), for example, select only 6 covariates present in Mexican census and survey data, based in part on the Akaike Information Criterion (AIC) from a standard OLS model. Most small area applications using ELL, however, use a considerably larger set of variables. Zhao and Lanjouw (2008), for example, note that many successful applications include less than 20 household level variables. However, they also advocate including cluster-level means, which can boost the number of variables significantly past 20 or 30. In recent years, the least absolute shrinkage and selection operator (LASSO) has become an increasingly popular tool for selecting prediction models. Small area estimation offers a natural application of the “post-lasso” procedure, in which the variables selected via the lasso are used as predictors into the mixed model used to estimate the parameters of the small area estimation model. 19 The post-lasso procedure offers the benefit of a convenient and data-driven approach to model selection that, in its most popular variant, maximizes out-of-sample predictive accuracy, while roughly equalizing in and out of sample R2 to prevent overfitting the model to the sample. The LASSO framework formalizes and extends the advice in Zhao and Lanjouw (2008) to use out-of-sample cross-validation to ensure that the model is stable and predicts well out of sample. Several varieties of LASSO have been implemented in popular statistical software. We use the plugin lasso estimator, also known as the rigorous lasso, implemented in Stata version 16 (StataCorp, 2019). This variant of lasso is computationally faster and more parsimonious than others that use cross-validation to minimize mean out-of-sample prediction error, and the resulting models maintain strong predictive power. 20 G. Benchmarking Benchmarking refers to the procedure of forcing model-dependent estimators to agree with the direct sample estimates for super-areas, defined as the lowest geographic level for which the sample is considered representative. Benchmarking can be desirable to ensure that the population-weighted average of small area poverty estimates, when aggregated to super-areas, match official published statistics derived from the survey. Simulations indicate that in the cross-sectional context, simple “ratio” benchmarking performs equally well as other methods and there is no significant loss of efficiency from benchmarking in general (Pfefferman et al, 2014, Wang et al, 2008). We therefore apply a simple ratio benchmarking procedure by multiplying the estimated headcount rate in each area by a scaling factor unique to each super-area, for both the Fay-Herriot and household level model estimates. The scaling 19 See Hastie et al (2015) for a review of the lasso and related estimators, and Belloni and Chernozhukov (2013) for more details on the post-lasso procedure. 20 We use the variant of the Stata plugin estimator that allows for heteroscedastic errors. The plugin method uses an iterative formula to select the lambda shrinkage parameter in the lasso instead of a grid search. See StataCorp (2019), Belloni and Chernozhukov (2011), and appendix A of Belloni, Chernozhukov, and Hansen (2014) for technical details regarding the implementation of the plugin lasso estimator. 17 factor is defined as the ratio of the estimated poverty rate obtained from the survey to the population- weighted mean poverty rate of the small area estimates, for each super-area. This leaves open the question of how to adjust the mean squared error estimates while benchmarking. Ideally, to obtain accurate estimates of the mean squared error, benchmarking would be performed by the small area estimation package within each bootstrap replication. Unfortunately, this option is not currently implemented in available software packages. As a second-best solution, we scale the mean squared error by the square of the same scaling factor that is applied to the point estimates, which leaves the coefficient of variation unchanged by the benchmarking process. Because the point estimates are slightly underestimated in Tanzania and Sri Lanka, this procedure increases the mean squared error and improves coverage rates. However, as a robustness check, we also report results for estimates in which only the point estimates are benchmarked, and in which neither the point estimates nor the mean squared errors were benchmarked. H. Criteria for evaluating estimates The synthetic sample, auxiliary data, and methodology can be used to generate small area estimates of headcount poverty and their mean squared error. But how exactly does one evaluate the estimates produced by different small area estimation methods? We examine seven summary statistics, averaged across areas. We give each area equal weight in the evaluation to better reflect the goal of obtaining accurate estimates for each target area rather than estimates that can be aggregated to higher levels. The six summary statistics are divided into four buckets as follows: • Mean Poverty Rate and Average Relative Bias. The former represents the simple average of the predicted area headcount rates across areas, prior to benchmarking. For the direct survey estimates, the mean poverty rate is the unweighted average poverty rate across municipalities. The mean of the small area estimates can differ from this unweighted average, due to bias caused by either violations of the normality assumptions or a non-representative sample. Substantial bias would necessitate a greater reliance on benchmarking, which would in turn make the imperfect adjustment for mean squared errors more problematic. Therefore, estimators with mean poverty rates closer to those of the direct estimates are, ceteris paribus, preferred. Meanwhile, the average relative bias equals the average, across areas of the ratio of the difference between the true and estimated poverty rates to the true poverty rate. 1 � − Θ∗ Θ ARB = � ∗ Θ =1 Where Na is the number of areas, Θ � is the estimated poverty rate for area a, and Θ∗ is the true poverty rate calculated from the census. Like the average poverty rate, it is a measure of the systematic bias in the estimates rather than accuracy. It differs from the average poverty rate, however, by reflecting any systematic bias at the area level after the benchmarking procedure. • Average Mean Squared Error (MSE) and Average Coefficient of Variation (CV). These measure the degree of uncertainty associated with the point estimates. Estimation of the MSE for the 18 model-based estimates are generated using a parametric bootstrap under the assumption that the model holds, and much remains to be learned about how model misspecification affects MSE estimates. The CV for each area is defined as the ratio of the square root of the estimated mean squared error to the mean estimated poverty rate, a definition sometimes referred to as CV(RMSE). We report the average CV across areas. CVs are an important indicator of uncertainty because in many cases national statistics offices will not publish statistics when CVs exceed a maximum threshold. 21 Comparisons of the average CV between the model-based and direct estimates should be interpreted with caution, because CVs of the direct estimates are not defined for areas with no poor households in the survey, leading the set of areas included in the average to differ for direct estimates. • Correlation and Root Mean Squared Error. These measure the accuracy of the prediction. The correlation represents the simple correlation between the small area estimates and the actual headcount poverty rates calculated from the full census. The Root Mean Square Error (RMSE) is equal to the square root for the average squared difference between the estimates and the actual census poverty rates 1 2 = � ���Θ � − Θ∗ � � The root mean squared error in this context is a function of the predicted poverty rates and is therefore a measure of accuracy. In contrast, the average mean squared error described above is a measure of precision or uncertainty. • Average Coverage rate. The average coverage rate represents the share of areas for which the estimated 95 percent confidence interval contains the actual census poverty rate, and are a standard metric used to assess the performance of estimators in the literature. 22 The upper and lower bounds of the confidence interval are determined by multiplying the square root of the estimated mean squared error by 1.96 and adding and subtracting it from the point estimate. IV. Main Results A. Fay-Herriot area model Table 5 presents the coefficient estimates from the Fay-Herriot area level model, which is used as a reference point for comparison with the direct estimates and the unit level model. 23 The area-level predictor variables were selected using the lasso plugin method, to prevent overfitting the model to the 21 For example, the US Census Bureau considers serious data quality issues related to sampling error to occur when the estimated coefficients of variation (CV) for the majority of the key estimates are larger than 30 percent. 22 See, for example, Tarozzi and Deaton (2009), Demombynes et al (2007), Elbers, Lanjouw, and Leite (2008), Pratesi and Salvati (2007), Hidiroglu and Yu (2016), among many others. 23 In particular, examining the Fay-Herriot model allows us to quantify the potential gains in efficiency and predictive accuracy from including sub-area level predictors in these contexts. 19 sample. The set of candidate variables included all area-level remote sensing indicators as well as super- area dummies, although no super-area dummies were selected in Tanzania. In each country, the lasso selected a small number of predictor variables, only 3 in Tanzania and 8 in Sri Lanka. The dependent variable is headcount poverty, expressed as a percentage from zero to one hundred. In Sri Lanka, less annual variability in rainfall as well as a positive rainfall shock in the first quarter are associated with lower poverty rates. In Tanzania, districts with larger buildings are much less poor, as are areas with a larger share of built-up area, while more rural districts tend to be poorer. Despite containing only three variables, the Tanzania model explain half of the variation in non-monetary poverty rates. Meanwhile, the eight Sri Lanka variables explain only 26 percent of the variation in headcount poverty across subdistricts. Overall, the results suggest that the Fay-Herriot estimates, despite using predictors that only vary across area, can significantly improve on the efficiency of the direct estimates, particularly in Tanzania where the model fit is especially good. Table 5: GLS coefficients from Fay-Herriot estimation Sri Lanka Tanzania Mean nighttime lights in 2016 -0.003 Mean area of nearest 25 buildings. -19.45*** 1990 Built-up area (GHSL) -0.006 Percent built-up area (Global Urban -2.56* Footprint) Standard deviation of rainfall 0.006*** Share rural 0.83 Rainfall deviation from historical -12.16*** Constant 84.70*** mean, Q1 Colombo -4.55** Gampaha -3.80* Matara -1.44 Ratnapura -6.58*** Constant -7.37** R2 0.26 R2 0.50 Number of Observations 328 Number of Observations 159 Notes: Coefficients based on Fay-Herriot estimation of poverty rates using the Chandra method. Predictor variables selected using the lasso plug-in method. B. Household-level model Next, we turn to the unit level model, and examine the results of the post-lasso regression used to estimate the model parameters in Table 6. The first striking result is the overall predictive power of the models. The R2 from the model is 0.32 in Tanzania and 0.25 in Sri Lana, which is notable given that the explanatory variable, non-monetary welfare, is measured at the household level while all the independent variables vary only at the subarea level. A Shapley decomposition indicates that about half of the model’s explanatory power in Sri Lanka and about a third in Tanzania are derived from super-area and sector dummies, which by construction cannot explain any variation in non-monetary areas across areas within super-areas. This illustrates one of the limitations of relying solely on R2 as a measure of the ability of the model to discern poor areas within the super-areas for which credible survey estimates are already available. Below, we consider an alternative, more informative metric, which is the extent to which the small area estimation procedure increases the precision of area estimates relative to direct survey-based estimates. 20 At the subarea level, the satellite indicators explain an impressive amount of the variation in average non-monetary welfare. In sub-area level regressions, predictor variables explain 73 percent of the variation in average non-monetary welfare in Tanzania and 59 percent in Sri Lanka. The subarea model R2 of 0.73 percent for Tanzania is substantially higher than the estimates presented in Jean et al (2016), which find that features derived from satellite data explain 58 percent of the DHS asset index in Tanzania. The stronger predictive power of the model for Tanzania likely results from the use of a richer welfare index as the dependent variable, as well as the use of interpretable features derived from imagery such as building footprints, rather than features optimized to predict night-time lights. 24 Meanwhile, the R2 of 59 percent for Sri Lanka is remarkably similar to estimates of the explanatory power of monetary poverty reported in Engstrom et al (2017). 25 Table 6: Post-lasso model of normalized household per capita consumption Sri Lanka Tanzania Variable Coefficie Variable Coefficient nt Sub-area variables Sub-area variables 1990 built-up area -0.30    Sum of nighttime lights 0.03*   2014 built-up area 1.82*** Mean population (GHSL) 0.01    Standard deviation of rain 0.00    Minimum agglomeration index -0.01*   Rain Z score, Q2 0.05*   Mean std. dev. of size of 5 nearest -0.01    buildings Rain Z-score squared, Q4 -0.45    Sum of mean size of 5 nearest buildings 0.30*** Fourier Transform mean, scale 7 0.08*** Number of buildings within 100 m 0.05**  Line Support Region mean, Scale 7 0.47**   % area never built-up, 1975-2015 (GHSL) -0.25    % of areas built up 1990 to 2000 (GHSL) 0.13    Area variables % of areas built up 1975 to 1990 (GHSL) 1.47*** 1990 built-up area 2.47*** % of area built-up (GUF) 0.04    Standard deviation of rain 1.06*** Precipitation in 2014 0.00    Rain Z score squared, Q3 -0.49    % Humid tropical rainforest (Kloppen 5.49*** classification) Rain Z-score squared, Q4 -0.46    Standard deviation of NDVI 56.67*** Gabor standard deviation, scale 5 -0.83*** Histogram of Ordered Gradients -1.74*** Area variables standard deviation, Scale 5 Line Support Region mean, Scale 7 -0.69**  Mean of nighttime lights 0.15    Structural feature sets mean, scale 7 1.00*** Minimum of nighttime lights 0.48    Maximum economist costs of drought 0.00 24 Ayush et al (2020) also finds that the use of interpretable features instead of features derived from transfer learning increased predictive power by 31 percent in Uganda, which is consistent with the stronger performance seen in these models. 25 Engstrom et al (2017) averages a model-based estimate of poverty derived from the census over large numbers of census households in each village, which makes it similar in practice to a non-monetary measure of poverty. 21 Super-area variables Super-area variables Colombo 2.47*** Morogoro 0.18**  Kalutara 1.06*** Simiyu -0.21**  Hambantota -0.49    Jaffna -0.46    Mannar -0.83*** Mullaitivu -1.74*** Kilinochchi -0.69**  Batticaloa 1.00*** Trincomalee -0.14    Puttalam 7.33*** Anuradhapura 2.47*** Badulla 1.06*** Moneragala -0.49    Rathnapura -0.46    Sector Sector -0.34*** Urban 2.65*** Rural Estate -2.02*** -1.29*** Constant -4.49*** Constant R2 0.251 R2 0.321 Of which: sub-area variables 0.060 Of which: sub-area variables 0.134 area variables 0.058 area variables 0.059 Super-area and sector dummies 0.129 Super-area and sector dummies 0.104 Number of observations 19,570 Number of observations 9,393    Notes: Stars indicate statistical significance at 5, 1, and 0.1 percent levels with robust standard errors. Subarea variables are aggregated across GN Divisions in Sri Lanka and villages in Tanzania. Area variables are population- weighted averages of sub-area variables aggregated to the are level. Super-area variables are district dummies in Sri Lanka and regional dummies in Tanzania. Sector dummies are a rural dummy in Tanzania and urban and estate sector dummies in Sri Lanka. See Table 2 for sources of remote sensing indicators. For Tanzania, the lasso model selected several building footprints and built-up area measures, as well as measures of night-time lights that capture building and population density. Building counts, night- time lights, and built-up area are all positively associated with welfare. Several climactic variables were also selected, reflecting the dependence of rural areas on favorable rainfall patterns. Higher variance in the NDVI vegetation index, reflecting areas that contain a mix of built-up area and green space, is also positively related to welfare. The Sri Lanka model contains measures of built-up area, rainfall, as well as texture features such as the Fourier Transform and Line Support Region Mean at the subarea level, and other features such as the standard deviation of the Gabor filter and Histogram of Ordered Gradients at 22 the area level. This is consistent with previous research showing that these texture algorithms or contextual features reflect spatial variability in building and road patterns, as well as poverty. 26 A Shapley decomposition indicates that the higher R2 in Tanzania, compared with Sri Lanka, can be attributed to more predictive sub-area variables, which predict 13 percent of the variation in Tanzania as opposed to 6 percent in Sri Lanka. This is likely due to two main factors. The first is that the Tanzanian indicators include measures of building footprints at the subarea level. The second is that Tanzania is a less developed setting in which building density and climactic variables, which most of the satellite variables are capturing, are more strongly correlated with non-monetary welfare. 27 C. Evaluation Results The previous section demonstrated that remote sensing indicators are predictive of both area-level poverty and household non-monetary welfare. This section turns to evaluating the small area estimates generated by different methods. Before considering the evaluation results, Table 7 reports model diagnostics for the unit-level model. The geospatial indicators in the model predict about 37 and 30 percent of the variation in household non-monetary welfare in Sri Lanka and Tanzania respectively. 28 The conditional R2, which reflects the gain in model fit from conditioning the mixed effect on the sample, is about 3 percentage points higher than the marginal R2 in both countries. The distribution of both the area effect and the household error term is roughly normal, reflecting the success of the normalizing transformation. The Wilks-Shapiro test for normality of the area effect fails to reject the null hypothesis of a normally distributed area random effect in both countries. Figure 1, which plot the quantiles of both the area effects and the household residuals against the quantiles of the normal distribution for each country, confirms that the distribution of both components of the error term is approximately normal, although there are negative outliers in the left tail of the area random effect in Tanzania. Table 8 begins by presenting the (unweighted) mean poverty rates and uncertainty of the small area estimates, as measured by the mean squared error, multiplied by ten thousand, and the mean coefficient of variation. The table presents the results for direct survey estimates for both the standard method and the Horvitz-Thompson estimates, as well as the Fay-Herriot area model and the unit-level model estimated with the modified EMDI package. All are simple unweighted averages across areas. 26 Engstrom et al. 2019b, Engstrom et al. 2019c, Sandborn and Engstrom 2016. 27 In the simulated surveys drawn from the census, the rural dummy explains 24 percent of the variation in household non-monetary welfare in Tanzania, while the urban and estate sectors only explain 6 percent in Sri Lanka. 28 This corresponds to the marginal R2 in Table 7. The R2 results differ slightly from the post-lasso model because the small area estimation model does not use weights. 23 Table 7: Details and Diagnostics for unit level models Sri Lanka Tanzania Poverty rate 4.0% 20.0% Number of variables selected for model 28 19 Number of households in sample 19,570 9,393 Number of households out of sample 4,190,436 8,901,279 Number of sub-areas in survey 2487 786 Number of sub-areas in census 13,985 14,981 Number of areas in survey 328 159 Number of areas in census 331 159 Unit-level model diagnostics Marginal R2 0.266 0.299 Conditional R2 0.296 0.323 Skewness of area effect 0.101 -0.26 Kurtosis of area effect 3.385 3.60 Wilks-Shapiro P-value 0.558 0.316 Skewness of household error -0.05 -0.12 Kurtosis of household error 3.23 3.51 Variance of estimated area effect 0.036 0.024 Variance of estimated household residual 0.723 0.682 Percentage of variance of error term due to area effect 5.0% 3.5% Notes: Table reports number of selected variables, number of observations, number of sub-areas and target areas, conditional and marginal R2, indicators of normality for area random effect and household residuals, and the percentage of total variance in the error term accounted for by the area random effect. Figure 1: Quantile vs. Quantile plots of household error terms and area random effects Sri Lanka Tanzania Notes: Figures report normal quantile-quantile plots of estimated household welfare residuals (error term) and area random effects (random effect). 24 The results for the mean poverty predictions in Table 8 show that the non-benchmarked Fay-Herriot estimates underestimate poverty, by a slight amount in Sri Lanka but by a more significant amount in Tanzania. The difference in precision is more striking. The Fay-Herriot estimators are significantly more precise than the Horvitz-Thompson direct estimates in both countries. Relative to these more conservative direct estimates, the Fay-Herriot procedure shrinks the mean square error by nearly half in both countries. However, the household level model, compared with the Horvitz-Thompson estimates, shrinks the mean squared error substantially more, by 70 percent in Sri Lanka and 85 percent in Tanzania. Table 8: Average Mean Squared Error (MSE) and coefficient of variation (CV) of target area estimates of headcount non- monetary poverty, by country and method Mean and Uncertainty Sri Lankan subdistricts Tanzanian districts Mean Mean Mean Mean Mean Mean poverty MSE CV poverty MSE CV Direct survey estimates Horvitz-Thompson approximation 4.0 15.5 65.2 20.0 73.7 43.3 EA-Clustered variance estimates 4.0 10.9 58.6 20.0 59.2 38.1 Area level Fay-Herriot model 3.7 8.3 47.9 16.8 35.8 34.2 Household-level EB model 3.7 4.6 32.0 18.6 11.1 17.5 Notes: Mean poverty refers to the average of area headcount rates prior to benchmarking. Mean MSE is the mean across areas of the mean squared error, estimated using a parametric bootstrap approach, multiplied by 10,000 Table 9 shows the evaluation results against the area poverty rates derived from the census. Average relative bias is highest for the Fay-Herriot estimator in both countries, and lowest for the direct estimates. Of greater interest for evaluating accuracy, however, are the results for correlation and root mean squared error. In both countries, the small area estimates are far more accurate than the direct estimates, but the comparisons between the household level model and Fay-Herriot model is mixed. In Sri Lanka, the household model estimates are more highly correlated with the census than the Fay- Herriot estimates by a moderate amount (0.88 vs 0.84), but slightly less strongly correlated in Tanzania (0.876 vs. 0.883). The Fay-Herriot predictions have a lower root mean squared error in Sri Lanka, reflecting fewer large outliers, but the root mean squared error is virtually equal in Tanzania. Overall, with regards to accuracy, both the household level model and the Fay-Herriot model perform well. The household model estimates are substantially more precise than the Fay-Herriot model and the direct estimates, but the tighter confidence intervals around the household model estimates may lower coverage rates. The last column of Table 8 shows the coverage rate, which for the household model in Sri Lanka is 84.3%. This is slightly higher than the Horvitz-Thompson estimates (82.2%), moderately higher than the clustered direct estimates (76.1%) but substantially lower than the Fay-Herriot estimates (91.8%). For Tanzania, the coverage rate for the household level model is 74.8%. This is the lowest of the four estimators but only modestly lower than that of the standard enumeration-area clustered estimates (76.1%). The Horvitz-Thompson and Fay-Herriot estimators achieve much higher coverage rates of 85.5% and 91.8%, respectively. 25 Table 9: Average Relative Bias, Average correlation, Root Mean Squared Error, and Coverage rate of target area estimates of headcount non-monetary poverty, by country and method Accuracy and Sri Lankan subdistricts Tanzanian districts Coverage rate ARB Correlation RMSE CR ARB Correlation RMSE CR Direct survey estimates H-T 0.257 72.5 0.049 82.2 0.043 77.0 0.088 85.5 EA-Clustering 0.257 72.5 0.049 76.1 0.043 77.0 0.088 76.1 Area level Fay- 0.395 84.1 0.026 91.8 0.071 88.3 0.052 91.8 Herriot model Household- 0.293 87.6 0.034 84.3 0.062 87.6 0.053 74.8 level EB model Notes: ARB refers to average relative bias, which is the average ratio of the difference between estimated and census poverty rates to the census poverty rate. Correlation refers to the unweighted correlation across areas between estimated and census non-monetary poverty rates. RMSE refers to root mean squared error, which is the square root of the average squared difference between estimated and census poverty rates. CR stands for coverage rate, which is the share of areas for which the estimated 95 percent confidence intervals for the area non-monetary poverty rate contains the census non-monetary poverty rate. Point estimates have been rescaled to match direct survey estimates at the regional level and mean squared error adjusted accordingly. H-T refers to the Horvitz-Thompson approximation, while EA clustering refers to standard variance estimates clustered by enumeration area. Figures 2 and 3 give a detailed look at the point estimates and mean squared errors for the area poverty estimates for each method. Sri Lanka is shown on the left panel and Tanzania on the right, and in each country the areas are ordered according to their non-monetary poverty rate in the census. The results clearly show that both Fay-Herriot and Household-level models greatly improve on the accuracy of the direct estimates, especially in areas with higher poverty rates. In Sri Lanka, many of the direct estimates are zero, even in high-poverty areas. The comparison between household models and Fay-Herriot models is less clear, but the Fay-Herriot estimates appear to be more prone to overestimate poverty for low-poverty areas and underestimate poverty for high-poverty areas. 29 Figure 3 presents the mean squared errors produced by the different methods. It clearly demonstrates that the household model estimates are substantially more precise than both the Fay-Herriot estimates and the direct estimates in both countries. 29 As described above, the Fay-Herriot estimates use the predicted poverty rate from the model in areas where no sampled households are poor. 26 Figure 2: Comparison of area poverty estimates by method for Sri Lanka (left) and Tanzania (right): .8 .5 .4 .6 Poverty rate Poverty rate .3 .4 .2 .2 .1 0 0 1 331 1 159 Sub-district District Direct survey estimates Fay-Herriot model estimates Direct survey estimates Fay-Herriot model estimates Household model estimates Census value Household model estimates Census value Notes: Figures show predicted asset-based poverty rates generated by household-level model, Fay-Herriot model, and Direct Survey estimates, in comparison with actual asset-based poverty rates calculated from the census Figure 3: Comparison of estimated Mean Square Error by area and method for Sri Lanka (left) and Tanzania (right) 150 150 Mean Squared Error * 10000 Mean Squared Error * 10000 100 100 50 50 0 0 1 331 1 159 Sub-district District Direct estimates Fay-Herriot estimates Direct survey estimates Fay-Herriot model estimates Household model estimates Household model estimates Notes: Figures show mean squared errors of predicted asset-based poverty rates generated by household-level model, Fay- Herriot model, and Direct Survey estimates. Direct survey estimate MSEs top-coded at 0.015. The significantly lower mean squared errors generated by the household level model is associated with a moderate reduction in coverage. This begs the question of whether the household level model remains more precise than others after equalizing coverage rates. To shed light on this, we perform a subsequent adjustment. This adjustment artificially inflates the estimates MSEs from each method by multiplying them by a constant scale factor greater than one, with the scale factor manually selected to achieve a 95 percent coverage rates. This occurs after the benchmarking and is therefore an additional adjustment to the mean squared error. Table 10 presents the results. The first column shows the scale factor that was multiplied by the mean squared error to achieve a 95 percent coverage rate, while the 27 Table 10: Measures of uncertainty when mean squared errors are rescaled to achieve 95 percent coverage rate Sri Lankan subdistricts Tanzanian districts Scale Mean Mean Scale Mean Mean factor MSE CV factor MSE CV Horvitz-Thompson direct estimate with N/A N/A N/A 2.5 460.3 108.3 rescaled MSE Fay-Herriot model with rescaled MSE 2.46 20.4 75.3 1.44 51.5 41.1 Household-level model with rescaled MSE 2.37 10.3 47.9 3.61 40.2 33.2 Notes: Scale factor refers to a constant scaling factor that was multiplied by the estimated mean squared error to inflate the MSEs to achieve 95 percent coverage rates. Mean MSE is multiplied by 10,000. second and third columns give the post-adjustment average MSE and CV. 30 In Sri Lanka, the household level model is substantially more efficient than the Fay-Herriot model when the mean squared errors are rescaled, as the average mean squared error is 45 percent lower. The difference between the Fay- Herriot and the household level model results is smaller in Tanzania, because the Fay-Herriot model performs better. The household level model remains moderately more efficient after the rescaling, however, as the average mean squared error and average CV are each about 20 percent lower. Finally, table 11 compares the precision of the unit level model with the direct survey estimates for super-areas, which are districts in Sri Lanka and regions in Tanzania. The super-areas are levels for which the household survey is considered representative and survey-based monetary poverty rates are routinely published. In Sri Lanka, where there are 331 target areas, the mean squared errors for the subdistrict estimates are about sixty percent larger than the direct survey estimates for districts. In Tanzania, meanwhile, there are only 159 areas and the same number as super-areas as Sri Lanka. In this case, the average mean squared error of the small area district estimates is slightly lower than the direct estimates for regions, while the mean CV is modestly higher. This type of comparisons is useful to inform decisions by national statistics offices of whether small area estimates are sufficiently precise to publish. Table 11: Precision of household level EB model compared with direct estimates for super-areas Sri Lanka Tanzania Mean MSE Mean CV Mean MSE Mean CV Super-area estimates from direct sample 1.4 16.8 11.4 15.8 estimate (Horvitz-Thompson approximation) Area estimates from Household EB model 4.6 32.0 11.1 17.5 Notes: Top row shows the unweighted mean squared error (MSE) times 10,000 and coefficient of variation (CV) for non- monetary poverty estimates calculated from the national household budget survey across 25 Districts in Sri Lanka and 25 Regions in Tanzania. Bottom row shows the unweighted MSE times 10,000 and mean CV for non-monetary poverty estimates that incorporating geospatial data across 331 subdistricts in Sri Lanka and 159 Districts in Tanzania, benchmarked to survey estimates for 25 districts in Sri Lanka and 25 regions in Tanzania. 30 In Sri Lanka, it is not possible to rescale the direct estimates to achieve 95 percent coverage because the area- level poverty and variance estimates are zero for more than 5 percent of the sub-districts. 28 V. Robustness checks for the household-level model The previous section demonstrated that the household-level model produced more efficient estimates than the area level model and the direct estimates, while both the household level model and Fay- Herriot model both predicted poverty rates much more accurately than the direct estimates. The household-level model examined above was derived using a particular variant of the empirical best model, however, and it is informative to examine how the performance of the estimates varies based on different methodological choices. Table 12 describes a set of eight robustness checks. The first two relate to the choice of software used to estimate the model. The baseline estimates were developed using a modified version of the R EMDI command that allows for weights in the simulation stage. In the first robustness check, this is compared with the latest version 3.0 of the Stata SAE package. This version allows for weights in both the model fitting and simulation stage, and implements the same bootstrap method as the R EMDI package. However, it uses a slightly different fitting method to estimate the model, adds a heteroscedasticity correction, and does not append the sample data to the census. The second robustness check uses version 2.0 of the Stata SAE package. This version of the Stata SAE package employs a traditional clustered bootstrap approach, which estimates the variance of the predicted poverty rates, instead of the parametric bootstrap approach used by the other packages to estimate mean squared error. The third robustness check does not use the empirical best method, to verify that the estimates are imprecise when only using synthetic predictions from the model. Version 2.0 of the Stata SAE package is used for this third robustness check because, unlike version 3 and the EMDI package, it allows the empirical best option to be disabled. The fourth and fifth robustness checks experiment with disabling the heteroscedasticity correction in version 3 of the Stata package, and dropping the use of weights in the R EMDI package in the simulation stage. These are useful to check because, as noted above, the R EMDI package does not implement a heteroscedasticity correction, and the publicly available version does not allow for weights in the simulation stage. The sixth robustness check considers using a Box-Cox transformation rather than an Ordered Quantile Normalization to transform the dependent variable in the model. This gives an indication of how much the use of the Ordered Quantile Normalization improves the results and ensures that the estimates remain reasonable when using a more traditional normalization method. The seventh robustness check considers the predictions when using a stepwise procedure, set with a probability threshold of 0.01, is used to select the model instead of the plugin lasso method. The final robustness check is a simple test of the extent to which the results are robust to the prevailing poverty rate. This check swaps the country poverty rates, setting the Tanzanian poverty line at the 4th percentile and the Sri Lankan poverty line at the 20th percentile. These are then compared against the corresponding poverty rates calculated in the census using these new, swapped poverty rates. 29 Table 12: Description of robustness checks Robustness check Software Options package 0. Baseline Modified R Simulation stage weights, no heteroscedasticity correction. MLE estimates EMDI fitting method, welfare transformed using ordered quantile normalization, point and MSE estimates benchmarked to survey super-areas, lasso model selection. Poverty rate of 4% in Sri Lanka and 20% in Tanzania. 1. Stata Stata SAE Same as (0) but with sample weights, heteroscedasticity parametric version 3 correction and H3 fitting method bootstrap 2. Stata traditional Stata SAE Same as (1) with traditional non-parametric bootstrap instead of bootstrap version 2 parametric bootstrap during simulation stage 3. No Empirical Stata SAE Same as (2) with empirical best and bootstrap options disabled Best version 2 4. Stata Stata SAE Same as (1) with no heteroscedasticity correction parametric version 3 bootstrap with no heteroscedasticity correction 5. No weights R EMDI Same as (0) with no simulation stage weights 6. Box-Cox Modified R Same as (0) but transform welfare using Box-Cox transformation transformation EMDI with lambda selected to minimize skewness instead of ordered quantile normalization. 7. Stepwise model Modified R Same as (0) but select prediction model and heteroscedasticity selection EMDI model (for Stata SAE) using stepwise selection at p=0.01 8. Swapped Modified R Same as (0) but with poverty rate of 20% in Sri Lanka and 4% in poverty rates EMDI Tanzania Table 13 displays the results of these eight robustness checks. For the purposes of brevity, we report the mean poverty, MSE, correlation, and coverage rate. 31 We begin by comparing the results produced by the baseline EMDI estimates with version 2 of the Stata SAE package, which implements the traditional bootstrap method, with and without the Empirical Best option. For all three estimates, the unweighted average of the predictions prior to benchmarking are close to the actual poverty rate in the census. The Stata package with the traditional bootstrap, however, gives significantly less precise estimates in both countries, as seen by the higher mean squared errors and CV. Purely synthetic predictions obtained without the empirical best option are substantially less precise than the empirical best predictions using the traditional clustered bootstrap in Sri Lanka, and moderately less precise in Tanzania. In both countries, however, the synthetic predictions are far less precise than estimates obtained using a parametric bootstrap approach, demonstrating the high efficiency cost of failing to incorporate sample 31 Results for all areas are available upon request. 30 information when the auxiliary data are aggregated to the subarea level. 32 The estimates produced by the Stata SAE with the traditional bootstrap are less accurate than the baseline EMDI estimates in Sri Lanka, but more accurate in Tanzania. The coverage rate of the traditional bootstrap estimates is 93 percent, which is very close to the target 95 percent rate. This high coverage rate results from two countervailing biases in the variance estimate; the traditional bootstrap overstates the variance by failing to hold the composition of the sample constant across replications, which counteracts the downward bias due to the failure to account for the positive correlation among the sample residuals. Overall, the traditional bootstrap performs quite well in these two settings, in terms of accuracy and coverage, but poorly in terms of efficiency. The estimates produced by the latest version of Stata SAE with the parametric bootstrap are slightly less accurate, in terms of its correlation with the full census, than those produced by the EMDI package. 33 The Stata SAE estimates are also less efficient than the baseline EMDI estimate in Sri Lanka, as the mean MSE rises from 4.6 to 6.8. In Tanzania, in contrast, the Stata SAE package with the parametric bootstrap yields estimates that are slightly more efficient than the baseline estimates, but because the estimates are less accurate, the coverage rate falls significantly from 74.8 to 64.2 percent. The fifth row of the table presents results without the heteroscedasticity correction. Disabling the correction brings the mean poverty rate closer to the true 4 percent rate in Sri Lanka, but further from the assumed poverty rate of 20 percent in Tanzania. In Sri Lanka, disabling the correction increases the accuracy of the estimates to slightly above the baseline estimates. In Tanzania, disabling the heteroscedasticity correction has minor effects on accuracy and slightly decreases precision, improving the coverage rate. The limited effects of the heteroscedasticity correction likely results from the negligible explanatory power of the “alpha model” that predicts the variance of the residuals. The R2 of the alpha model, which like the “beta” prediction model was selected using the plugin Lasso method to avoid overfitting, is only 0.004 in Sri Lanka and 0.003 in Tanzania. The bottom four rows of the table present a variety of robustness checks using the EMDI package. The first of these uses the publicly available version of the package that does not allow for household weights when aggregating the simulated household poverty outcomes into an area poverty rate. The estimator without weights is slightly less accurate than those with weights in each country, with a corresponding mild decline in both efficiency and coverage. The second EMDI robustness check uses a more standard Box-Cox transformation of the welfare variable instead of the ordered quantile normalization transformation. This leads to considerable downward bias in the estimates in both countries, leading us to prefer the ordered quantile normalization, although the correlation and coverage rate changes little when using the Box-Cox transformation. The third EMDI robustness check reports the results of a model selected using stepwise regression, with a probability 32 The penalty for failing to implement EB in a traditional bootstrap setting is smaller in Tanzania than Sri Lanka, although the estimates without EB are very imprecise in both countries. This is likely because the traditional bootstrap with EB performs gives less precise estimates in Tanzania than Sri Lanka, due to the smaller number of areas in Tanzania. 33 This may result from the failure of the Stata SAE package to append the sample to the census when simulating poverty. The sample constitutes about 0.5 percent of Sri Lankan households while the Tanzanian sample only contains about 0.1 percent of households, which is consistent with the greater fall in accuracy in Sri Lanka. 31 Table 13: Robustness check results for alternative methods for household level model Sri Lanka Tanzania Mean Mean Mean Corr CR Mean Mean Mean Corr CR poverty MSE CV poverty MSE CV Baseline EMDI Estimates 3.7 4.6 32.0 87.6 84.3 18.6 11.1 17.5 87.6 74.8 Stata SAE with 3.7 10.9* 43.6 85.5 95.5 19.7 44.6* 33.4 88.3 93.1 traditional bootstrap With no Empirical Best 3.7 17.5* 62.4 83.6 99.1 19.8 55.9* 38.0 89.2 95.6 Stata SAE with 3.6 6.8 38.4 85.6 85.8 21.4 10.0 15.6 85.0 64.2 parametric bootstrap No heteroscedasticity 3.8 5.5 33.9 88.0 84.9 22.4 10.7 16.0 84.4 66.7 correction EMDI estimates with: No simulation weights 3.7 4.8 31.9 87.5 82.8 18.3 11.8 17.8 87.4 73.0 Box-Cox transformation 3.5 4.7 31.2 87.4 81.9 15.8 13.1 18.7 87.5 76.1 Stepwise (p=0.01) 3.7 4.7 27.9 83.0 80.7 18.8 9.8 16.4 86.5 69.8 model selection Swapped poverty rates 19.6 18.4 17.4 94.8 84.9 2.8 1.6 32.9 86.1 69.8 Notes: Rows represent different variants of household-level model as listed in Table 12. Columns represent mean poverty rates, average mean squared error times 10,000, mean coefficient of variation, correlation with actual census value (Corr), and coverage rate (CR). All means are unweighted across 331 subdistricts in Sri Lanka and 159 Districts in Tanzania. All results except for mean poverty reflect benchmarking to survey estimates for 25 districts in Sri Lanka and 25 regions in Tanzania. Asterisks indicate mean estimated variance instead of mean squared error. value threshold of 1 percent. The resulting models are overfit. This leads to less accurate predictions than the Lasso-selected model in both countries and slightly reduces mean squared error in Tanzania, leading to a moderate fall in coverage rates in both countries. Finally, we report the results of the EMDI estimators when the poverty rates of the two countries are swapped. Not surprisingly, the performance of the model improves in Sri Lanka when the poverty rate is set to 20 percent, as the remote sensing indicators better predict poverty rates when they are larger. When setting the poverty rate to 4 percent in Tanzania, the EMDI estimator tends to underestimate poverty, with a mean predicted poverty rate of 2.8 percent. However, the correlation between the estimates and the true census poverty rate remains strong at 86.1 percent, and the benchmarked coverage rate of 70 percent is respectable. Overall, of the many robustness checks considered so far, there are only two main cases where the results are particularly sensitive to the choice of method. The first is the use of the traditional rather than parametric bootstrap, which leads to a much higher estimated variance, coefficient of variation, and coverage rate of the estimates. The traditional bootstrap is not recommended when using Empirical Best models on theoretical grounds, since the estimate is conditioned on the sample (Diallo and Rao, 2018). The other case where the results change dramatically is the omission of the empirical best option, which roughly quadruples the MSEs and doubles the CVs compared to the baseline estimates, 32 and yields substantially less accurate estimates in Sri Lanka as measured by the correlation with the census. Otherwise, the results are largely robust to the range of robustness checks considered. Next, we perform two additional robustness checks. The first of these examines the results of the household model using different benchmarking procedures. In particular, we examine two alternatives to the baseline estimates, in which both the point estimates and standard errors are benchmarked. The first alternative only benchmarks the point estimates, while the second performs no benchmarking. The results are shown in Table 14. In both countries, not benchmarking the standard errors slightly reduces the average mean squared error. Average CV is minimized in both countries by benchmarking only the point estimates. In Sri Lanka, benchmarking only the point estimates slightly reduces the CV while not benchmarking increases it because an additional district is included. 34 This leads to a modest fall in coverage in Sri Lanka, but a more significant reduction in Tanzania, where the coverage rate declines from 74.8 percent with benchmarking to 69.8 percent without. Benchmarking the mean squared errors, all else equal, overstates uncertainty in cases like these where mean poverty is underestimated. This partially mitigates the downward bias in the household model’s mean squared error estimates, which leads the benchmarking procedure to improve coverage. The benchmarked estimates are therefore preferred, although the results are qualitatively similar when either the point estimates are benchmarked, or when no benchmarking is carried out. The final robustness check examines how the baseline household level model fares when random noise is added to the welfare aggregate. This better approximates a monetary welfare measure such as per capita consumption or income, which contain a greater portion of unexplained variance resulting from both transient welfare shocks and measurement error. We add two error term components to the existing measure of non-monetary welfare, to make an alternative noisier measure of “true welfare”. The first error component is an area effect, drawn form a normal distribution with mean zero and variance 0.5, and the second is a household specific error term with mean zero and variance 1. These variance values were chosen arbitrarily, with the aim of achieving a model R2 of approximately 0.2, which is similar to the R2 when regressing monetary welfare on similar geospatial indicators in Tanzania (Belghith et al, 2020). The poverty line is set according to the new, noisier, welfare measure, equal to the 4th percentile of the sample welfare in Sri Lanka, and the 20th percentile of the sample in Tanzania. Table 15 displays the results when estimating area-level poverty using direct estimates, the Fay-Herriot model, and the household model with a noisier welfare aggregate. Household model diagnostics are displayed below the results. In the household model, the additional noise causes the marginal R2 to fall to about 0.2 in both cases. This is substantially lower than the values reported when predicting the non- monetary welfare index, which were 0.27 for Sri Lanka and 0.3 for Tanzania (Table 7). However, the conditional R2 remains high, and even slightly above what is reported for the non-monetary welfare measure in Table 7. This reflects the additional variance in the area-specific component, which is captured by the mean of the sample residuals for each area and therefore incorporated into the small area predictions by the empirical best estimator. 34 The estimated poverty rate for Ratnapura district is zero in the HIES sample. The CV is therefore not defined for this district after benchmarking and excluded from the average. 33 Table 14: Robustness check for benchmarking Sri Lanka Tanzania Benchmarking Mean Mean Mean Corr CR Mean Mean Mean Corr CR poverty MSE CV poverty MSE CV Baseline estimates 3.7 4.6 32.0 87.6 84.3 18.6 11.1 17.5 87.6 74.8 Benchmark point 3.7 4.2 29.9 87.6 79.5 18.6 9.7 16.9 87.6 71.1 estimates only No benchmarking 3.7 4.2 39.1 88.1 82.2 18.6 9.7 17.5 83.7 69.8 Notes: Columns represent mean poverty rates, mean squared error times 10,000, coefficient of variation, correlation with actual census value (Corr), and coverage rate (CR). Top row reports estimates when point estimates and MSEs are benchmarked to survey results for each super-area. Bottom row shows results without benchmarking. The overall results, with a few exceptions, are similar to the main results for the non-monetary welfare measure reported in Tables 8 and 9. The household level model produces the most precise estimates, as judged by mean squared error. The MSE reduction achieved using the household-level model in Sri Lanka is smaller than the baseline results, reflecting the challenge of accurately predicting the bottom tail of the welfare distribution with a noisier welfare measure. The average CV rises greatly in Sri Lanka, due to large positive outliers, when using the noisier measure. These outliers are areas with very low predicted poverty rates, yet high mean squared error, due to the greater variability in the area effect in these simulations. This illustrates that average CV should be interpreted with caution in settings with low poverty rates and highly variable area effects. Nonetheless, even with a noisier welfare aggregate, the reduction in mean squared error Sri Lanka compared with the direct estimates is roughly equivalent to doubling the size of the sample. In Tanzania, where the poverty rate is much higher, the household- level model leads to a 78 percent reduction in the mean squared error, a factor comparable to that reported in Table 8. The analogous reduction in average CV is 30 percent, which is also comparable to roughly doubling the sample size. Turning to accuracy and coverage, exploiting sub-area variation using the household model substantially improves accuracy, as measured by the correlation with the true noisier welfare measure. In Sri Lanka, the correlation is over 4 percentage points higher for the household level model than the Fay-Herriot model, while in Tanzania the comparable difference is over 10 percentage points. The correlation in absolute terms remains respectable in Sri Lanka (0.80) and high in Tanzania (0.84 percent) when using the noisier welfare measure. Finally, coverage rates for the household level model decline only modestly in Tanzania and increase significantly in Sri Lanka in comparison with the direct estimates. In Sri Lanka, the coverage rate reported in Table 15 for the noisy welfare measure is remarkably similar to the one for non-monetary poverty reported in Table 8 (84.9% vs. 83.7%). In Tanzania, the coverage rate is higher for the noisy welfare measure (78.0% vs. 73.6%). Accuracy and coverage are particularly important measures to consider because, unlike estimated precision, they are only observed when a “true” census population is available. Overall, the results presented in Table 15 indicate that the accuracy and coverage of the household level model remain high, in comparison with both the Fay-Herriot and direct estimates, when additional noise is introduced into the welfare measure. 34 Table 15: Robustness check for noisy welfare measure Sri Lanka Tanzania Mean Mean Mean Corr CR Mean Mean Mean Corr CR poverty MSE CV poverty MSE CV Direct estimates (H-T 4.0 15.9 65.3 59.0 74.6 20.0 82.3 45.4 77.2 83.0 approximation) Area level Fay-Herriot 4.0 10.5 54.1 75.9 80.7 16.8 53.2 39.5 72.8 88.1 model Household-level EB 3.7 7.3 68.1 79.5 84.9 17.6 18.1 31.7 85.2 79.9 model Household Model Diagnostics Marginal R2 0.196 0.208 Conditional R2 0.304 0.329 Skewness of area 0.14 -0.04 effect Kurtosis of area effect 2.99 2.61 Wilks-Shapiro P-value 0.40 0.33 Skewness of -0.03 -0.06 household error Kurtosis of household 3.12 3.08 error Variance of estimated 0.110 0.122 area effect Variance of estimated 0.702 0.671 household residual Percentage of variance 15.7% 18.2% of error term due to area effect VI. Simulations The results presented in the preceding two sections are based on small area estimation using one sample of households for each country. 35 The samples that were used are particularly significant, since they reflect the actual enumeration areas sampled by the National Statistics Offices for their household 35 MSE in the baseline estimates were estimated by simulating different populations while maintaining the same sample composition in each parametric bootstrap replication. 35 budget surveys. Nonetheless, it is useful to check that the strong performance of the household-level model holds when estimated in simulation settings, as is commonly done in the literature. 36 a. Model-based simulation We begin by conducting a Monte Carlo simulation study, following the basic structure outlined in Das and Chambers (2017). We perform 100 simulations, drawing a new simulated population each time, in 80 areas. Each of these areas contains between 15 and 29 subareas, with the exact number chosen randomly along a uniform distribution. Each subarea contains 15 and 29 households, also chosen randomly from a uniform distribution. Each household is assumed to contain one member. At the sub-area level, we draw two X variables, X1sa and X2sa, from a joint normal distribution. All households in the same sub-area are assigned the same random values of X1sa and X2sa. X1sa is distributed with mean 0 and variance 2. The mean of X2sa is set equal to 0.05 times the area index, which varies from 1 to 80, with a variance of 2. The covariance of X1sa and X2sa is set to 0.5. We draw area effects (ηa), subarea effects (sa), and an idiosyncratic error term for each household (εi). These error terms are drawn from normal distributions with mean zero and variances of 1 for the area effect, 0.5 for the sub-area effect, and 2 for the household idiosyncratic error. We then generate the population welfare measure as (6) = 6 + 0.5 ∗ 1 − 0.6 ∗ 2 + + + The variance structure of the two X variables and variance components was selected to achieve an approximate R2 of 0.25 when regressing household welfare on X1sa and X2sa. From this population, we set the poverty line at the 25th percentile of the welfare distribution and calculate population poverty rates for each area. We then draw a two-stage sample of 3 subareas per area in the first stage and 10 households per sampled subarea in the second. We construct sample weights as the inverse of the probability of being sampled, and take the weighted 25th percentile of the sample distribution of welfare as the poverty line. Finally, we estimate a small area estimation model in the sample using the modified EMDI package, utilizing the population values of X1sa and X2sa to simulate welfare for all households in the population. The resulting simulated welfare aggregate is then compared with the poverty line derived from the sample, and averaged over households for each area to obtain area headcount poverty rates and estimated mean squared errors. 37 This entire process, starting with generating the simulated population, is then repeated 100 times. Therefore, the final output from the procedure is a set of 8,000 population poverty rates and small area estimates, derived from 100 simulations of 80 areas. Table 16 displays the results. The small area estimates reduce the mean squared error by a factor of 10 and average CV by a factor of roughly 3 compared with the traditional EA-clustered direct estimates, while maintaining a high coverage rate close to 95 percent. The mean poverty is nearly exactly 25 percent, and the correlation compared with poverty rates in the population increases from 0.84 for the direct estimates to 0.98 for the small area estimates. The root mean squared error of the predictions 36 Evaluations of small area estimation methods using model and/or design-based simulations include Das and Chambers (2017), Demombynes et al (2007), Elbers, Lanjouw and Leite (2008), Tarozzi and Deaton (2009), Tarozzi (2011) and Tzavidis et al (2018), among many others. 37 The results are not benchmarked to super-areas, because super-areas are not defined in this simulation. 36 falls by a factor of three. The household level model performs well in this stylized setting because the residuals are distributed normally by construction, and a large percentage of variation in simulated welfare is explained by the two predictor variables. It is reassuring, though, that the small area estimator performs well in a favorable controlled environment with populations that vary across simulations. b. Design-based simulation While model-based simulations are useful to explore the performance of the estimator in different populations, the results of model-based simulations inevitably depend on the set of assumptions used to construct the simulated data. To verify that the household-level model works well when estimated in multiple samples, this section describes the results of a design-based simulation. To reduce computation time, the simulation is limited to three regions in Northeast Tanzania near the Kenyan border: Arusha, Kilimanjaro, and Tanga. Between them, these three regions according to the census contain 24 districts, 2,078 villages, 10,964 enumeration areas, and nearly 1.17 million households. 38 The 2018 Household Budget Survey sampled 1,217 households in 97 enumeration areas from these three regions, for an average sample of 12.5 households per enumeration area. The three regions are disproportionately poor, as the non-monetary poverty rate is 46.5 percent when using a poverty line set at the 20th percentile of the national distribution. To conduct the design-based simulations, we repeat the following seven-step procedure 250 times: 1. Draw a simple random sample of 4 EAs for each of the 24 districts from the census, which selects a total of 96 EAs from the population of 10,964 census EAs contained in the three regions. 2. Draw a simple random sample of 13 households for each selected EA to generate a sample of 1,248 households. 3. Construct sample weights for each household equal to the inverse of their selection probability. 39 4. Use the ordered quantile normalization to construct normalized non-monetary welfare in the sample. 5. Using the household sample from step 2, the sample weights from step 3, and the normalized welfare measure from step 4, estimate a lasso regression of normalized non-monetary welfare on all candidate geospatial indicators. 40 6. Using the same inputs as the previous step, calculate the 46.5th percentile of the sample distribution of normalized non-monetary welfare to use as the poverty line. 38 We selected three regions to use for this exercise because the lasso procedure did not select any variables when using only one or two regions, when mimicking the sample size of the budget survey. We selected these particular three regions because they are disproportionately poor and because they were the first three regions in numeric order following the capital of Dodoma. 39 The household sample weights are the product of two ratios: The number of census EAs in each district divided by 4, and the number of households in each sampled EA divided by 13. These are multiplied by household size to obtain population weights. 40 As before, we use the plugin method for selecting lambda in the lasso. 37 Table 16: Results of model-based simulations Monte Carlo simulations Mean Mean Mean Corr ARB RMSE Coverage Poverty MSE CV rate Direct survey estimates H-T approximation 24.9 51.8 31.5 0.840 -0.005 0.060 95.7 EA-Clustering 24.9 44.4 29.8 0.840 -0.005 0.060 94.0 Household-level EB model 25.1 4.7 9.2 0.978 0.012 0.020 96.4 Model diagnostics Marginal R2 0.236 Conditional R2 0.237 Skewness of area 0.080 effect Kurtosis of area 3.00 effect Wilks-Shapiro P- 0.481 value of area effect Skewness of -0.002 household error Kurtosis of 3.00 household error Wilks-Shapiro P- 0.483 value of household error Notes: Results reflect averages across 8000 estimates, representing 100 simulations of 80 simulated areas. Diagnostic results reflect averages over 100 simulations. H-T indicates Horvitz-Thompson approximation of variance. 7. Use the modified EMDI package to generate point and MSE estimates for each district based on the census data of 1.17 million households containing all model prediction variables and household size for these regions. The model estimates derived from the sample are not weighted, and the dependent variable is normalized welfare calculated in step 4. Census households are weighted by household size when estimating district level headcount poverty rates. As above, we use 100 Monte Carlo simulations to generate point estimates of headcount poverty and 100 bootstrap replications to estimate the mean squared error. After repeating this seven-step process 250 times, we can calculate average evaluation statistics of the type reported in Tables 7 and 8 across the 24 districts and 250 simulations. 41 Since the Fay-Herriot 41 Poverty rates for each district are calculated based on a poverty line set at the 46.5th percentile of the national distribution of non-monetary welfare. 38 model would not be expected to perform well in a sample of only 24 districts, we only report average evaluation statistics for the two types of direct estimates and the household level unit level models. Results for the household-level model are reported with and without benchmarking. As above, the benchmarked point estimates were rescaled to match the direct estimates for each of the three regions in the population, for each sample drawn in each simulation, and the mean squared error were multiplied by the square of the same scaling factor. The direct estimates are computed in each simulation by calculating the weighted share of sample households that are poor in each district, using the weights calculated in step 3 above. We again compare the direct estimates obtained using the Horvitz-Thompson approximation with results from standard enumeration area clustering, which give identical point estimates but differ in their estimated variances. Table 17 displays the results, which confirm that the household model continues to perform well when using repeated samples from the three selected regions. The household model in these simulations demonstrate a moderate amount of bias, with a mean of 42.4 percent prior to benchmarking as compared with 46.5 percent in the population. This bias results from the distribution of the residuals deviating substantially from normality at the tails. 42 The estimates produced by the household level model, however, are far more precise than the Horvitz- Thompson direct estimates, with average mean squared errors that are about a quarter as large and average CVs about half the size. The household model estimates are only moderately more precise than the standard direct estimates, reflecting a high degree of correlation in household welfare across EAs within district. The standard direct estimates ignore this correlation and therefore greatly underestimate uncertainty, which are reflected in a coverage rate of 48.7 percent. In contrast, the coverage rate of the benchmarked estimates from the household model is 68 percent, which is moderately below the coverage rate for all of Tanzania using the HBS sample reported in Table 8 (74.8%). The coverage rate of the household model, despite being much more precisely estimated, is only modestly below the Horvitz- Thompson direct estimates. This is because the household model estimates are more accurate, as seen in the higher correlation and lower root mean squared error. To sum up, when using repeated samples in three poor Tanzanian regions, incorporating remote sensing data in a household-level model dramatically improves precision compared to the Horvitz-Thompson direct estimates. Adding the remote sensing data reduces mean squared error by about 75 percent and CV by about half, which is more or less comparable to quadrupling the sample size. Because the small area estimates are much more accurate than the direct estimates, the increased precision comes at a modest cost in terms of coverage rate. Compared with the standard direct estimates assuming independence across enumeration areas, the small area estimates are not only slightly more precise, but also significantly boost accuracy and coverage. Overall, the main results derived for the full country based on budget survey the specific sample drawn for the budget survey are maintained when focusing on three Tanzanian regions and generating estimates of the area poverty rates and their mean squared errors using 250 different samples. 42 As shown in the bottom of Table 17, the kurtosis of the residuals is 2.5 for the area effect and 3.7 for the household errors. 39 Table 17: Results of Design-Based Simulation Arusha, Kilimanjaro, and Tanga Regions Mean ARB Mean Mean Corr RMSE Coverage Poverty MSE CV rate Direct survey estimates H-T approximation 47.0 -0.031 135.4 29.9 0.673 0.212 72.1 EA-Clustered standard errors 47.0 -0.031 52.0 21.0 0.673 0.212 48.7 Household-level EB model Benchmarked 45.7 0.028 29.9 14.0 0.828 0.112 68.0 Non-benchmarked 42.4 -0.046 26.0 14.0 0.803 0.083 66.6 Model diagnostics Marginal R2 0.324 Conditional R2 0.361 Skewness of area effect 0.13 Kurtosis of area effect 2.45 Wilks-Shapiro P-value 0.59 of area effect Skewness of hh error -0.19 Kurtosis of hh error 3.66 Wilks-Shapiro P-value 0.002 of hh error Notes: Results reflect averages across 6000 estimates, representing 250 simulations of 24 districts. Fay-Herriot model not included due to small number of districts in sample. Diagnostic results reflect averages over 250 simulations VII. Conclusion Policy makers worldwide are increasingly demanding more granular data to inform decisions. This paper examines the extent to which supplementing a synthetic sample drawn from the census with geospatial data at the subarea level in Sri Lanka and Tanzania generates more precise and accurate estimates of non-monetary poverty. The paper employs empirical best predictor models because they are more accessible to most statisticians in National Statistics Offices than hierarchal Bayesian models. In addition, empirical best models have a long history in small area estimation, are similar to other commonly used methods, and have been implemented in multiple publicly available software packages. The analysis examines non-monetary poverty in order to evaluate the estimates against census data, which sheds light on the performance of different methodological approaches and software packages. The results are encouraging. The correlation between the small area estimates and the actual census is roughly 88 percent in both countries, and substantially greater than the analogous correlations for the direct survey estimates, which are 73 and 77 percent in Sri Lanka and Tanzania, respectively. The household-level model, compared with standard clustered survey estimates, reduces the mean squared error by about two-thirds in Sri Lanka and four-fifths in Tanzania, which is roughly equivalent to tripling 40 and quintupling the effective sample size. The results are robust to alternative implementation options, adding additional noise to the welfare measure, and the implementation of both a model-based Monte Carlo simulation, and a design-based simulation from three regions in Northeast Tanzania. When rescaling mean squared errors to equalize coverage rates, the household-level model remained more efficient than a Fay-Herriot model, especially in Sri Lanka. The strong performance of the household- level EBP model, in comparison with the area-level Fay-Herriot model, is consistent with simulation results reported in Hidiroglou and You (2016). That study concluded, on the basis of a model-based simulation study with a linear model and unit-level auxiliary data, that unit-level models achieve lower bias and higher coverage rates than area-level models. The household-level EBP model moderately underestimates uncertainty, leading to sizeable drops in coverage rates compared with estimates from the area-level Fay-Herriot model, especially in Tanzania. This occurs because the model does not incorporate sample weights, and assumes that survey data are independent within area when conditioning the random effect on the sample residuals. In addition, both the household and area-level models do not account for uncertainty in the estimated variance components in the model. Nonetheless, the coverage rate of the household model in both cases is comparable to standard survey estimates that cluster on enumeration areas. Furthermore, the household model performs significantly better than the Fay-Herriot model in terms of accuracy and efficiency when additional noise is added to the welfare measure, especially in Tanzania. The financial cost of this type of small area estimation procedure is generally low and falling rapidly. Much of the auxiliary data used for the small area estimation, such as estimates of built-up area, nighttime lights, and vegetation, are freely available. There are two notable exceptions. First, the calculation of spatial features at a national scale requires constructing a cloud-free mosaic of Sentinel imagery, considerable computing power, and the expertise to implement software to calculate spatial features. Second, the data on Tanzanian building footprints are proprietary. However, data on building footprints are increasingly being released in the public domain and it is not difficult to envision information on building footprints becoming freely available in the coming years, potentially through Open Street Map as it continues to improve in accuracy and coverage. Finally, access to subarea survey identifiers and shapefiles, or access to EA geocoordinates, is necessary to link survey data to geospatial indicators. This is not feasible in all contexts, but the growing popularity of geospatial analysis and CAPI data collection is making geospatial survey analysis more common. A conservative estimate is that the time and expertise required to generate these types of estimates costs $50,000 to $150,000. This is minor compared to the value created by even doubling the effective sample size of nationally representative household surveys that often cost at least a million dollars to field. The results could be improved by further refinements in software and methods. One avenue for further research is to explore the performance of a subarea-level model, an extension of the Fay-Herriot model specified at the subarea level with an area-level mixed effect (Torabi and Rao, 2015). Estimating such a model at the subarea level makes it easier to properly account for sample design effects. However, modeling poverty rates as a linear function of predictors may generate less accurate estimates, especially when poverty rates are low. It would be useful to better understand how the results of a properly specified subarea model compare with a household-level model using subarea predictors. A second line of research could focus on improving the household-level model by estimating uncertainty more accurately. The estimated variance of the area random effect, for example, could be adjusted to 41 account for sample design effects among the residuals, which would increase the estimated mean squared errors and improve coverage rates. 43 In addition, benchmarking could be incorporated directly into the bootstrap procedure, to obtain more accurate estimates of benchmarked mean squared errors. Third, it should be fairly straightforward to incorporate the ordered quantile normalization transformation directly to the R EMDI package, which already allows for Box-Cox and log shift transformations. This would enable the software to estimate poverty gaps and severity in addition to headcount rates when using the ordered quantile normalization transformation. This study only considered a measure of non-monetary poverty, in order to evaluate the results of the area predictions against actual census data. This raises the important question of the extent to which the results generalize to predicting monetary poverty rates. The types of geospatial indicators considered in this study are moderately less predictive of monetary welfare than non-monetary welfare. For example, when predicting district-level monetary poverty in Tanzania using the same geospatial indicators considered above, the geospatial variables selected by a stepwise procedure explained 23.5 percent of the variation in household per capita consumption, versus 30 percent for a lasso-selected model of non-monetary welfare. 44 This reflects the difficulty of predicting both transient household welfare shocks and measurement error, both of which are present to a greater degree in measures of monetary poverty than non-monetary poverty. It is encouraging that the household-level EBP model continues to perform well when noise is added to non-monetary welfare. The efficiency gain from using the household model was notably smaller in Sri Lanka when using the noisier welfare measure, but even in that context incorporating geospatial data led to an increase in precision roughly equivalent to doubling the sample size. These results are only suggestive, however, and an important outstanding research agenda is to compare small area estimates of monetary poverty obtained using geospatial data to those obtained from standard census-based poverty maps directly from census data. This can confirm that the encouraging results presented above apply to monetary poverty. A final caveat, which applies to small area estimation in general, is the challenge of measuring local transient shocks. In the context of geospatial data, key indicators used to generate model predictions, such as building counts, are not necessarily affected by local economic shocks. This also poses an issue for traditional census-based poverty maps, as the census predictions tend to rely heavily on indicators such as educational attainment and household size that may also fail to track local shocks. Empirical Best methods partly address this issue by incorporating welfare information from the sample. Nonetheless, future research is needed to explore which geospatial indicators better reflect local economic shocks. Overall, the results from an evaluation exercise using data from two countries demonstrate major efficiency gains from combining survey data with geographically comprehensive geospatial data to generate small area estimates of non-monetary poverty. These efficiency gains came at no cost to coverage in Sri Lanka and a moderate cost in coverage rates in Tanzania, and the latter can be addressed through further refinements to the methodology. The financial cost of incorporating geospatial data is low, relative to the cost of achieving similar efficiency gains by surveying additional households. While there is room for further methodological improvement, these techniques can currently be applied at modest cost to generate more granular and informative estimates of non-monetary welfare. 43 Chambers and Das (2017) propose a similar strategy of adjusting traditional ELL estimates to correct for biased estimates of mean squared error. 44 See Belghith et al, 2020. This corresponds to the marginal R2 reported in Table 6. 42 References Ayush, K., Uzkent, B., Burke, M., Lobell, D., & Ermon, S. (2020). Generating Interpretable Poverty Maps using Object Detection in Satellite Images. arXiv preprint arXiv:2002.01612. Bartlett, Maurice S. "The use of transformations." Biometrics 3.1 (1947): 39-52. Battese, George E., Rachel M. Harter, and Wayne A. Fuller. "An error-components model for prediction of county crop areas using survey and satellite data." Journal of the American Statistical Association 83.401 (1988): 28-36. Belghith, Nadia Belhaj Hassine; Karamba, R. Wendy; Talbert, Elizabeth Ann; De Boisseson, Pierre Marie Antoine. 2020. Tanzania - Mainland Poverty Assessment 2019 : Part 1 : Path to Poverty Reduction and Pro-Poor Growth (English). Washington, D.C. : World Bank Group. Belloni, Alexandre, and Victor Chernozhukov. "High dimensional sparse econometric models: An introduction." Inverse Problems and High-Dimensional Estimation. Springer, Berlin, Heidelberg, 2011. 121-156. Belloni, Alexandre, and Victor Chernozhukov. "Least squares after model selection in high-dimensional sparse models." Bernoulli 19.2 (2013): 521-547. Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen. "Inference on treatment effects after selection among high-dimensional controls." The Review of Economic Studies 81.2 (2014): 608-650. Boonstra, H. J. "hbsae: Hierarchical Bayesian small area estimation." R package version 1 (2012). Box, George EP, and David R. Cox. "An analysis of transformations." Journal of the Royal Statistical Society: Series B (Methodological) 26.2 (1964): 211-243. Chandra, H., Sud, UC. and Gupta V.K. (2013). Small Area Estimation under Area Level Model Using R Software Corral Rodas, Paul, William Seitz, João Pedro Azevedo, and Minh Cong Nguyen. "FHSAE: Stata module to fit an area level Fay-Herriot model." (2018). Corral Rodas, Paul Andres, Isabel Molina, and Minh Cong Nguyen. "Pull Your Small Area Estimates up by the Bootstraps." World Bank Policy Research Working Paper 9256 (2020). Dalal, Navneet, and Bill Triggs. "Histograms of oriented gradients for human detection." 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05). Vol. 1. IEEE, 2005. Das, Sumonkanti, and Ray Chambers. "Robust mean‐squared error estimation for poverty estimates based on the method of Elbers, Lanjouw and Lanjouw." Journal of the Royal Statistical Society: Series A (Statistics in Society) 180.4 (2017): 1137-1161. Demombynes, G., Elbers, C., Lanjouw, J. O., & Lanjouw, P. (2007). How good a map? Putting small area estimation to the test. The World Bank. Diallo, Mamadou S., and J. N. K. Rao. "Small area estimation of complex parameters under unit‐level models with skew‐normal errors." Scandinavian Journal of Statistics 45.4 (2018): 1092-1116. Elbers, Chris, Jean O. Lanjouw, and Peter Lanjouw. "Micro–level estimation of poverty and inequality." Econometrica 71.1 (2003): 355-364. 43 Elbers, Chris, Peter Lanjouw, and Phillippe George Leite. Brazil within Brazil: Testing the poverty map methodology in Minas Gerais. The World Bank, 2008. Engstrom, Ryan, Jonathan Hersh, and David Newhouse. "Poverty from space: Using high-resolution satellite imagery for estimating economic well-being." (2017). Engstrom, Ryan, David Newhouse, and Vidhya Soundararajan (2019a). Estimating Small Area Population Density Using Survey Data and Satellite Imagery: An Application to Sri Lanka. The World Bank, 2019. Engstrom, Ryan, Pavelsku, Dan, Tomomi, Tanaka, Wambile, A. (2019b). Mapping Poverty and Slums Using Multiple Methodologies in Accra , Ghana. Joint Urban Remote Sensing Conference 2019, 1–4. Engstrom, R, Harrison, R., Mann, M., & Fletcher, A. (2019c). Evaluating the relationship between contextual features derived from very high spatial resolution imagery and urban attributes: A case study in Sri Lanka. 2019 Joint Urban Remote Sensing Event, JURSE 2019. https://doi.org/10.1109/JURSE.2019.8809041 Erciulescu, Andreea L., Nathan B. Cruze, and Balgobin Nandram. "Model‐based county level crop estimates incorporating auxiliary sources of information." Journal of the Royal Statistical Society: Series A (Statistics in Society) 182.1 (2019): 283-303. Fay III, Robert E., and Roger A. Herriot. "Estimates of income for small places: an application of James- Stein procedures to census data." Journal of the American Statistical Association 74.366a (1979): 269- 277. González-Manteiga, Wenceslao, et al. "Bootstrap mean squared error of a small-area EBLUP." Journal of Statistical Computation and Simulation 78.5 (2008): 443-462. Guadarrama, María, Isabel Molina, and J. N. K. Rao. "A comparison of small area estimation methods for poverty mapping." Statistics in Transition new series 1.17 (2016): 41-66. Haslett, Stephen J. "Small area estimation using both survey and census unit record data." Analysis of Poverty Data by Small Area Estimation (2016): 325-348.. Hastie, Trevor, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015. Head, Andrew, et al. "Can human development be measured with satellite imagery?." ICTD. 2017. Hidiroglou, Michael A., and Yong You. "Comparison of unit level and area level small area estimators." Survey Methodology 42.41-61 (2016). Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47, 663-685 Huang, Xin, Liangpei Zhang, and Pingxiang Li. "Classification and extraction of spatial features in urban areas using high-resolution multispectral imagery." IEEE Geoscience and Remote Sensing Letters 4.2 (2007): 260-264. James, W., and C. Stein. "Proc. Fourth Berkeley Symp. Math. Statist. Probab." Estimation with quadratic loss. Vol. 1. Univ. California Press, 1961. 361-379. Jean, Neal, et al. "Combining satellite imagery and machine learning to predict poverty.” Science 353.6301 (2016): 790-794. 44 Kreutzmann, Ann-Kristin, et al. "The R package emdi for the estimation and mapping of regional disaggregated indicators." Journal of Statistical Software (2018). Hall, Peter, and Tapabrata Maiti. "On parametric bootstrap methods for small area prediction." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68.2 (2006): 221-238. Jarvis, A., et al. "SRTM 90m Digital Elevation Database v4. 1." (2008). Jiang, Jiming, and Partha Lahiri. "Mixed model prediction and small area estimation." Test 15.1 (2006): Halbmeier, Christoph, et al. "The fayherriot command for estimating small-area indicators." The Stata Journal 19.3 (2019): 626-644. Kottek, Markus, et al. "World map of the Köppen-Geiger climate classification updated." Meteorologische Zeitschrift 15.3 (2006): 259-263. Lange, Simon, Utz Johann Pape, and Peter Pütz. Small area estimation of poverty under structural change. The World Bank, 2018. Marhuenda, Yolanda, et al. "Poverty mapping in small areas under a twofold nested error regression model." Journal of the Royal Statistical Society: Series A (Statistics in Society) 180.4 (2017): 1111-1136. Matsuura, Kenji, and Cort J. Willmott. "Terrestrial precipitation: 1900–2017 gridded monthly time series." Electronic. Department of Geography, University of Delaware, Newark, DE 19716 (2018). Mehrotra, Rajiv, Kameswara Rao Namuduri, and Nagarajan Ranganathan. "Gabor filter-based edge detection." Pattern recognition 25.12 (1992): 1479-1494. Miller, Rupert G. "The jackknife-a review." Biometrika 61.1 (1974): 1-15. Molina, Isabel, and J. N. K. Rao. "Small area estimation of poverty indicators." Canadian Journal of Statistics 38.3 (2010): 369-385. Molina, Isabel, and Yolanda Marhuenda. "sae: An R package for small area estimation." The R Journal 7.1 (2015): 81-98. Morris, Carl N. "Parametric empirical Bayes inference: theory and applications." Journal of the American Statistical Association 78.381 (1983): 47-55. Myint, Soe W., Victor Mesev, and Nina Lam. "Urban textural analysis from remote sensor data: Lacunarity measurements based on the differential box counting method." Geographical Analysis 38.4 (2006): 371- 390. Nguyen, Minh C., et al. Small Area Estimation: An extended ELL approach. mimeo, 2017 Pesaresi, Martino, Andrea Gerhardinger, and François Kayitakire. "A robust built-up area presence index by anisotropic rotation-invariant textural measure." IEEE Journal of selected topics in applied earth observations and remote sensing 1.3 (2008): 180-192. Peterson, Ryan A., and Joseph E. Cavanaugh. "Ordered quantile normalization: a semiparametric transformation built for the cross-validation era." Journal of Applied Statistics (2019): 1- 16. Pfeffermann, Danny, Anna Sikov, and Richard Tiller. "Single-and two-stage cross-sectional and time series benchmarking procedures for small area estimation." Test 23.4 (2014): 631-666. 45 Pokhriyal, Neeti, and Damien Christophe Jacques. "Combining disparate data sources for improved poverty prediction and mapping." Proceedings of the National Academy of Sciences 114.46 (2017): E9783-E9792. Prasad, NG Narasimha, and Jon NK Rao. "The estimation of the mean squared error of small-area estimators." Journal of the American statistical association 85.409 (1990): 163-171. Sandborn, Avery, and Ryan N. Engstrom. "Determining the relationship between census data and spatial features derived from high-resolution imagery in Accra, Ghana." IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 9.5 (2016): 1970-1977. StataCorp, L. P. "Stata LASSO reference manual." College Station, TX: StataCorp LP (2019). Steele, Jessica E., et al. "Mapping poverty using mobile phone and satellite data." Journal of The Royal Society Interface 14.127 (2017): 20160690. Tak, Hyungsuk, Joseph Kelly, and Carl N. Morris. "Rgbp: an R package for gaussian, poisson, and binomial random effects models with frequency coverage evaluations." arXiv preprint arXiv:1612.01595 (2016). Tarozzi, Alessandro, and Angus Deaton. "Using census and survey data to estimate poverty and inequality for small areas." The review of economics and statistics 91.4 (2009): 773-792. Tarozzi, Alessandro. "Can census data alone signal heterogeneity in the estimation of poverty maps?." Journal of Development Economics 95.2 (2011): 170-185. Torabi, Mahmoud, and J. N. K. Rao. "On small area estimation under a sub-area level model." Journal of Multivariate Analysis 127 (2014): 36-55. Tzavidis, Nikos, et al. "From start to finish: a framework for the production of small area official statistics." Journal of the Royal Statistical Society: Series A (Statistics in Society) 181.4 (2018): 927-979. Van der Weide, Roy. GLS estimation and empirical bayes prediction for linear mixed models with Heteroskedasticity and sampling weights: a background study for the POVMAP project. The World Bank, 2014. Unsalan, C., and Kim L. Boyer. "Classifying land development in high-resolution panchromatic satellite images using straight-line statistics." IEEE Transactions on Geoscience and Remote Sensing 42.4 (2004): 907-919. Van der Waerden, B. L. "Order tests for the two-sample problem and their power." Indagationes Mathematicae (Proceedings). Vol. 55. North-Holland, 1952. Wang, Junyuan, Wayne A. Fuller, and Yongming Qu. "Small area estimation under a restriction." Survey methodology 34.1 (2008): 29. Wood‐Sichra, Ulrike, Alison B. Joglekar, and Liangzhi You. Spatial Produc‐tion Allocation Model (SPAM) 2005: Technical Documentation. HarvestChoice Working Paper, 2016. Yeh, C., Perez, A., Driscoll, A., Azzari, G., Tang, Z., Lobell, D., ... & Burke, M. (2020). Using publicly available satellite imagery and deep learning to understand economic well-being in Africa. Nature communications, 11(1), 1-11. Zhao, Qinghua. "User manual for povmap." World Bank. http://siteresources. worldbank. org/INTPGI/Resources/342674-1092157888460/Zhao_ ManualPovMap. pdf (2006). 46 Annex A: Software packages for household-level EBP models As noted in the text, there are five main differences between the different packages, which are listed in Table 4. The first important difference is that the stata SAE package implements a heteroscedasticity correction, following Elbers, Lanjouw and Lanjouw (2003). This correction estimates an “alpha model” that regresses the squared residuals from an OLS regression on subarea characteristics to obtain an estimate of the variance. The estimated variance is then used in a GLS regression to obtain the parameters of the “beta model”, which are then used for simulation. The second main difference, as noted above, involves the type of bootstrap method. 45 Version 2 of the Stata ELL package uses a traditional non-parametric bootstrap approach, which samples clusters randomly with replacement for each bootstrap replication. The R sae package takes a parametric bootstrap approach, which holds the composition of the sample constant for each replication. To do this, the method bootstraps the population instead of the sample. For each replication, a bootstrap population is constructed by drawing from the distribution of the estimated area random effect η and the household idiosyncratic error term ε. The EBP model is then re-estimated using the same households that were present in the sample, thus maintaining the same set of sample households for each replication. Values for η and ε are drawn 100 times from a normal distribution to generate point estimates for poverty in the population. The bootstrap process is repeated 100 times to estimate mean squared error, for a total of 10,000 simulations. The rationale for this parametric bootstrap procedure is that when performing small area estimation, the sample has been realized. The composition of the sample is therefore known and should be held constant across replications. This point is elaborated in greater detail in Diallo and Rao (2018). A third difference between the EMDI and Stata SAE packages involves the model fitting method. The stata SAE package estimates uses either a traditional GLS model, or Henderson’s method three to estimate the parameters of the random effects model. In the simulations below, we use Henderson’s method three to fit the model in the SAE stata packages. The R EMDI package utilizes the maximum likelihood estimation method implemented in the lme function of the R nlme package (Pinheiro et al, 2020). In most settings, the choice of fitting method should not make a large difference to the model estimates. A fourth difference is that the EMDI package, like the R SAE package, appends the sample to the census before aggregating poverty estimates. The Stata SAE package does not at this time. This should have a minor impact on the results in cases where the sample size is a small fraction of the census population. A fifth difference is that empirical best estimation, along with the assumption of normal errors, is required in both version 3 of the Stata SAE package and the R EMDI package, while it is optional in version 2 of the Stata SAE package. The latter, accordingly, allows for non-parametric selection of the error terms when empirical best is not used, which relaxes the assumption of normality. We use version 2 of the Stata SAE package to estimate the model without empirical best estimation as a robustness check to verify that empirical best methods are much more efficient in this context. 45 No software to our knowledge implements the Prasad and Rao (1990) approximation to obtain mean squared error estimates. 47