Poverty from Space: Using High-Resolution Satellite Imagery for Estimating Economic Well-Being

Can features extracted from high spatial resolution satellite imagery accurately estimate poverty and economic well-being? This paper investigates this question by extracting object and texture features from satellite images of Sri Lanka, which are used to estimate poverty rates and average log consumption for 1,291 administrative units (Grama Niladhari divisions). The features that were extracted include the number and density of buildings, prevalence of shadows, number of cars, density and length of roads, type of agriculture, roof material, and a suite of texture and spectral features calculated using a nonoverlapping box approach. A simple linear regression model, using only these inputs as explanatory variables, explains nearly 60 percent of poverty headcount rates and average log consumption. In comparison, models built using night-time lights explain only 15 percent of the variation in poverty or income. The predictions remain accurate when restricting the sample to poorer Gram Niladhari divisions. Two sample applications, extrapolating predictions into adjacent areas and estimating local area poverty using an artificially reduced census, confirm the out-of-sample predictive capabilities.


Introduction
Despite the best efforts of national statistics offices and the international development community, local area estimates of poverty and economic welfare remain rare. Between 2002 and 2011, as many as 57 countries conducted zero or only one survey capable of producing poverty statistics, and data are scarcest in the poorest countries (Serajuddin et al., 2015). But even in countries where data are collected regularly, household surveys are typically too small to produce reliable estimates below the district level. Generating welfare estimates for smaller areas requires both a household welfare survey and contemporaneous census data, and the latter are typically available once per decade at best. Furthermore, safety concerns may prohibit survey data collection in many conflict areas altogether.
Satellite imagery has generated considerable enthusiasm as a potential supplement to household data that can help fill these severe data gaps. In recent years, private companies such as DigitalGlobe and Airbus have rapidly expanded the coverage and availability of high spatial resolution imagery (HSRI), driving down commercial prices. Planet (formerly Planetlabs) currently operates more satellites than any organization other than the U.S. and Russian governments, and just recently, successfully launched 88 dove satellites that will allow for coverage of the entire globe with imagery resolution of 3 to 5 m per pixel on a daily basis. Continued technological advances will increasingly allow social scientists to benefit from this type of imagery, which has been utilized intensively by the intelligence and military communities for decades.
This paper investigates the ability of object and texture features derived from HSRI (High Spatial Resolution Imagery) to estimate and predict poverty rates at local levels. The area of our study covers 3,500 square kilometers in Sri Lanka, which contain 1,291 villages (Grama Niladhari (GN) divisions). For each village, we extract both object and texture features to use as explanatory variables in poverty prediction models. Object features extracted include the number of cars, number and size of buildings, type of farmland (plantation or paddy), the type of roofs, the share of shadow pixels (building height proxy), road extent and road material, along with textural measures. These features are identified using a combination of deep learning-based Convolutional Neural Networks (CNN) and classification of spectral and textural characteristics. These satellite derived features were then matched to household estimates of per capita consumption imputed into the 2011 Census for the 1,291 GN divisions.
We investigate five main questions: 1) To what extent can variation in GN economic well-being --poverty rates defined at the 10 and 40 th percentiles of national income and average GN consumption --be explained by high spatial-resolution features? 2) Which features are most strongly correlated with these measures of well-being? 3) Do these features predict equally well in poor and rich GNs? In urban and rural GNs? 4) Can these models predict into geographically adjacent areas? and 5) Are predictions robust to the use of a smaller sample of training data?
We find that: i) satellite features are highly predictive of economic well-being and explain about 60 percent of the variation in both village average consumption and estimated poverty headcount rates; ii) Built-up area and roof type strongly correlate with welfare. Car counts and building height are strong correlates in urban areas, while the share of paved roads and agricultural type are strong correlates in rural areas; iii) Accuracy declines only slightly in the poorest decile of villages (average consumption of $4.67 per day). Models are less accurate in urban areas than rural ones; iv) Predicting into adjacent areas produces less accurate poverty measures, but ranking between true and predicted rates is moderately high; and v) Using a 1 percent sample of the census based ground truth, designed to mimic the sampling strategy of the Household Income and Expenditure Survey, has little impact on the accuracy of the prediction.
This paper contributes to a growing literature exploring how remotely sensed data may be used to assess welfare. Traditionally, the most popular remotely sensed measure for economic applications has been night-time lights (NTL), which measures the intensity of light captured passively by satellite. Strong correlations between NTL and GDP appear at the country level (Henderson et al., 2009, Pinkovskiy andSala-I-Martin, 2016) although within a country NTL appears more strongly correlated with density than welfare. The relationship between lights and wages or other measures of income appears weak (Mellander et al., 2013), casting doubt on its reliability as a proxy for small area estimates of welfare. Additionally, NTL is ill-suited for identifying variation in welfare within small areas because of its low spatial resolution. Even the most advanced NTL satellite, the Visible Infrared Imaging Radiometer Suite VIIRS, has a spatial resolution at nadir of approximately 1.0 km 2 . 5 Indeed, we find that NTL captures only 15% of the variation in poverty or income in the same area where high resolution spatial features capture 60 percent of the variation.
Daytime imagery has recently emerged as a practical source of information on welfare, in large part due to new developments in computer vision algorithms. Advances in Deep Learning such as Convolutional Neural Networks (CNN) have the capability to algorithmically classify objects such as cars, building area, roads, crops and roof type (Krizhevsky, Sutskever, and Hinton, 2012). These objects may be more strongly correlated with local income and wealth than NTL. Furthermore, textural and spectral algorithms provide a simpler alternative to analyzing HSRI that does not rely on object classification (Graesser et al. 2012, Sandborn and Engstrom 2016. In this approach, the spatial and spectral variations in imagery are calculated over a neighborhood of pixels to characterize the local scale spatial pattern of the objects observed in the imagery. These measures, which we refer to as "texture" or "spectral" measures, capture information about an area that may not be clear from object recognition alone. This paper also contributes to a literature exploring how supervised learning techniques from machine learning may be applied to unstructured data to reveal information about human welfare (Athey, 2017). Glaeser, Kominers, Luca, and Naik (2015) apply texture-based machine vision classification to images that are captured from Google Street View, trained using subjective ratings of the images on the basis of the perceived safety. They estimate a support vector machine model and show the fitted model can reliably predict block level income in New York City. Jean et al. (2016) employ an innovative transfer learning approach, in which a set of 4,096 unstructured features are extracted from the penultimate layer of a convolutional neural network that uses Google Earth daytime imagery to predict the luminosity of NTL. These 4,096 features are then used to predict the average per capita consumption of enumeration areas (villages), taken from living standard measurement surveys using ridge regression to prevent overfitting. The resulting model predicts well and explains an average of 46 percent of the variation in 4 village per capita consumption, out of sample, across the four countries in which it was trained. While this innovative use of daytime imagery substantially improves on the use of night time lights alone, there are two issues with its applicability to poverty measurement. First, it is not clear that the transfer learning method generalizes well to other contexts where population density is low. Extensions of this approach in Haiti and Nepal (Head et al., 2017) show declines in predictive power, suggesting the NTL step in the transfer learning process may be ill suited for poor, low-density areas. Second, the transfer learning method is not necessarily optimal for predicting very poor areas. When the top two quintiles are excluded from their sample, restricting the sample to those below twice the international poverty line, the falls precipitously, to about 0.12. This illustrates the challenges this method faces in distinguishing welfare among the poorest of the poor, who in the African context most likely live in relative dark. 6 This study utilizes imagery features that are based either on recognizable objects or "texture" algorithms developed for computer vision applications, derived from High Spatial Resolution Imagery (HSRI). This method offers several advantages for the estimation of poverty rates. First, it eliminates reliance on NTL, which is a coarse measure of welfare, to identify relevant features for model development. Second, it provides a more transparent understanding of the underlying factors that explain geographic variation in welfare in different contexts. Third, features developed from HSRI, such as roads and the extent of built-up area, are useful for policy analysis in other areas, such as transport and urban planning. Finally, a feature-based approach can easily be extended to alternative welfare indicators, such as headcount poverty rates measured at different thresholds.
The paper proceeds as follows: Section 2 summarizes how the data were created and presents brief summary statistics. Section 3 presents the statistical methodology. Section 4 examines the predictive power of high resolution satellite features (HRSF) to estimate poverty in small areas at the village level. Section 5 examines out-of-sample performance using two applications from estimating local area economic well-being. Section 6 concludes.

Data Description
Our analysis is restricted to a sample area of approximately 3,500 km 2 in Sri Lanka. National coverage was not feasible due to the high cost and partial availability of high-resolution imagery; however, these data are rapidly becoming more available and less expensive as companies such a Planet and DigitalGlobe expand their archives and launch newer, more precise satellites with more frequent revisit rates. We sampled DS divisions conditional on HSRI being available, drawing areas from urban, rural, and estate sectors. 7 According to the 2012 census, population by sector in Sir Lanka is rural (77.4%), urban (18.2%) and estate (4.4%) (Sri Lanka Department of 5 Census and Statistics, 2012). Population by sector in our sample is rural (45.9%), urban (46.2%) and estate (7.8%).

2.1
Details on Satellite Imagery The satellite imagery consists of 55 unique "scenes" purchased from Digital Globe, covering areas specified in our sample area. Each "scene" is an individual image captured by a particular sensor at a particular time. Images were acquired by three different sensors: Worldview 2, GeoEye 1, and Quickbird 2. These sensors have a spatial resolution of 0.46m 2 , 0.41m 2 , and 0.61m 2 , respectively in the panchromatic band and 1.84m 2 , 1.65m 2 , and 2.4m 2 , respectively, in the multi-spectral bands. Pre-processing of imagery included pan-sharpening, ortho-rectification, and image mosaicking. Most imagery was captured in either 2011 or 2012, although some imagery from 2010 was also used.

2.2
Details on Poverty Data Ideally village poverty and consumption statistics would be generated directly from the 2012/13 Household Income and Expenditure Survey (HIES), a detailed survey that measures the consumption patterns of 25,000 households on approximately 400 consumption items. The survey contains an average of 8.4 households per GN division in the 47 sampled DS divisions, making the HIES insufficient to generate consistent poverty estimates at the GN division level without supplementary data. We therefore draw on the most common method to impute welfare estimates (Elbers, Lanjouw, Lanjouw, 2003) into the 2011 Census of Population and Housing, which is identical to the method used to generate official poverty estimates at the DS division level (Department of Census and Statistics and World Bank, 2015). For each household in the census, per capita consumption was estimated based on models developed from the HIES, using household indicators that are common to both the Census and the HIES. 8 We derive GN headcount poverty rates using the standard Foster-Greer-Thornbecke method (Foster, et al., 1984), for two poverty lines: poverty line 1 at the 10th percentile of the national per capita consumption distribution, and poverty line 2 at the 40th percentile. This is equivalent to $3.00 and $5.13 per day, respectively, in 2011 PPP terms, which compares to an extreme poverty line in 2011 prices of $1.90 per day.
Imputing welfare into the census requires an assumption of spatial homogeneity within small areas. This assumption "may severely underestimate the variance of the error in predicting welfare estimates at the local level in the likely presence of small-area heterogeneity in the conditional distribution of expenditure or income" (Tarozzi and Deaton, 2009). To test the extent of spatial heterogeneity in practice, small area estimates of poverty have been compared to census-based measures in Mexico and Brazil, which each collect income information in their census. Considerable spatial heterogeneity is present in Mexico. 9 In contrast, Elbers et al (2009) find significantly less in Minas Gerais, Brazil. The effect of spatial heterogeneity on the results presented below is unclear. We are not aware of any empirical estimate of the extent to which the spatial heterogeneity assumption leads to biased poverty headcount estimates at the local level. To the extent any additional noise in the poverty estimates due to uncaptured heterogeneity in the coefficients is independent across neighboring households within a GN, this noise would be significantly reduced after averaging over a large number of households.

2.3
Comparison of GN Poverty Rates and Mean NTL Reflectance A simple visual comparison between mean NTL and village poverty rates illustrates why NTL provides limited information on sub-national welfare. Figure 2 presents a panel of three images for the Divisional Secretariat of Seethawaka: mean raw NTL (left), poverty rates derived from the 10% national income threshold (middle), and log of mean population density (right). Comparing the left and middle panels, there is only a small association between villages that have low NTL reflectance and those that are high in poverty. Problems of overglow (Henderson et al., 2012) mean that poor villages adjacent to wealthy ones will be misclassified as non-poor. 10 While NTL tracks the general contours of poverty for the DS -lower poverty areas in the Northwest and higher poverty areas in the Southeast -this coarse association is only of limited use for public policy applications such as poverty targeting or budget allocations.
NTL appears to give a more accurate approximation of the population density of the underlying GN divisions, which is consistent with Mellander et al. (2013). Comparing the right and left panels shows a strong association between high NTL areas and areas with a high population density. We take this to suggest that the information content contained within NTL related to human welfare is limited. While lights at night may indicate gross associations, it is a highly imperfect measure of 7 welfare. We therefore investigate whether the much richer set of information contained in HSRI daytime imagery translates into more accurate welfare predictions.

Feature Extraction from High-Resolution Satellites
The derived high-resolution spatial features fall into seven broad categories: (1) Agricultural Land, (2) Cars, (3) Building Density and Vegetation, (4) Shadows (building height proxy) (5) Road and Transportation; (6) Roof Type; and (7) Textural and Spectral characteristics. In addition to the satellite features, we use two geographic attributes of the GN division: Whether it is administratively classified as an urban area, and its area in square kilometers. Table 1 presents summary statistics for these variables.
Deep learning-based object classification was used for classifying the share of the GN division that is built-up (i.e. consists of buildings), the number of cars in the GN, and the share of pixels in the GN that were identified as shadow pixels (proxy for building heights), and crop type. The classification method used is similar to Krizhevsky, Sutskever, and Hinton (2012), which utilizes convolutional neural networks (CNN) to build object predictions from raw imagery. Roof type, paved and unpaved roads of different widths, and railroads were classified using a combination of Trimble eCognition and Erdas Imagine software, utilizing a combination of support vector machines and visual identification. Classifier accuracy is great than 90 percent for all of the objects recognized. Details on the extraction and classification process are provided in detail in the online appendix, which includes an example ROC curve for buildings.

Object Classification Details
The agricultural land variables consist of the fraction of GN agriculture identified as paddy (rice cultivation) or plantation (cash crops such as tea). These sum to 100 percent for GNs with agricultural land, so the excluded category in subsequent regressions is GN divisions with no agricultural land. We also calculated the fraction of total GN area that is either paddy, plantation, or any agriculture. Figure 4 shows an example of a developed area building classification, with raw image shown at the top and CNN classification accuracy shown below. On the bottom panel, true positives are highlighted green, with false positives highlighted red. Figure 5 shows a sample car classification. Cars that are positively identified are shown circled in blue. False negatives are most prevalent where there is considerable tree masking of pixels.

Figure 3: Example Developed Area (Buildings) Classification
Notes: above image shows raw (left) and classified (right) for developed area building classifier from raw satellite imagery. Areas in green show are true positive building classifications. Images in red are false positives: erroneously classified areas as buildings.

Figure 4: Example Car Classification
Notes: Cars identified by the convolutional neural network shown in blue. Three car-related variables were calculated -the log total number of cars in a GN, total cars divided by total road length, and cars per square kilometer of the GN. The average GN division in our sample contains 50 cars. However, there is wide dispersion, as the 99 th percentile of the car count distribution is equal to 577 cars and the maximum value is 4,000 cars. On the left side of the distribution, 136 of 1,291 GNs contain no cars. Because the distribution is skewed, we take the log of the car count, while imposing a smooth function for GNs with zero or few cars. 11 Building density variables include the fraction of an area covered by built-up area and the number of roofs identified, Built-up area captures any human settlements -buildings, homes, etc.regardless of use or condition. These are grouped with two measures of the Normalized Difference Vegetation Index. Although technically a spectral characteristic, the presence of vegetation in urban areas indicates development such as parks, trees, or lawns (i.e., area that is not built up) within the urban environment. In the rural environment it also indicates undeveloped areas, and the values can aid in describing variations in agricultural type and productivity depending on the timing of the image acquisition. The fifth category is two indicators that capture shadows of buildings: the log of the number of pixels classified as shadow as well as the fraction of shadows in a GN. The shadow variables use the angle of the sun as it shines on a building, and the shadows it displaces, to estimate the presence of shadows. 12 The road variables we calculate are the log of total road length, fraction of roads that are paved, and length of airport runway and length of railroad identified. For roof type, we calculate the fraction of roofs in a village that are either clay, aluminum, asbestos, with the omitted category being roofs that are identified as none of the above, the vast majority being gray cement roofs. Roof type can be identified through remote sensing by using hyperspectral imaging, or using reflectance from several contiguous spectral bands. Different roof materials exhibit different spectral properties, particularly in the sub-visible bands of the spectrum. The roofs in our sample are clay (36.5%) aluminum (14.08%), asbestos (7.8%) or gray concrete (41.6%).

Details of Textural and Spectral Features
We calculate six separate types of spectral and textural features: Fourier transform, Gabor filter, Histogram of Oriented Gradients (HoG), Line support regions (LSR), Pantex, and Speed-Up Robustness Features (SURF). These are often used in machine vision problems to decompose an image. They are intended to capture aspects of a neighborhood that are not so easily identified directly, including the presence of characteristics associated with slums such as many irregular building lines or high density. These features may be considered outputs from a dimension reduction technique, in that they are reduced dimensionality descriptions of a complex 2-D satellite imagery.
Because these measures may be novel to readers without backgrounds in remote sensing, further description may be helpful. We consider Pantex here to be a measure of human settlements. It is a 11 spatial similarity index, where each cell is compared to adjacent cells in all directions. Forests will have a low Pantex level, since cells in all directions have similar contrast, as will cells with straight roads. Cities dense with many buildings will have high Pantex values. HOG captures "local intensity gradients or edge directions" (Dalal and Triggs, 2005) and in the context here captures intensity of lines of development or agriculture. Local binary patterns (LBPM) captures local spatial patterns and gray scale contrast. SURF detects local features used for characterizing grid patterns, and measures orderliness of building development, the opposite of which is typically referred to as a slum. Areas with right angles, corners, or areas with regular grid patterns, will have larger SURF values relative to areas with chaotic or irregular spacing. For more detail on imagery and the feature extraction process, we refer the reader to the online appendix.

Statistical Methodology
We estimate linear regressions in which the dependent variable is estimated poverty or log welfare (per capita household consumption) at the level of the GN divisions, derived from the census. Since these are linear models, the fitted values are not constrained to lie between zero and one, but this is a minor issue in the sample. 13 The error term is assumed to be clustered at the level of the DS division, and the standard errors are robust to heteroscedasticity.
Given the list of available covariates, variable choice is not obvious. Estimating a model with the full set of candidate variables in table 1 would likely produce predictions that are overfit, in the sense that they perform much better in-sample than out-of-sample (Athey and Imbens, 2015). One attractive method for variable selection among a large selection of covariates is Lasso regularization. Lasso is a regularized regression that estimates a regression model with an added constraint that enforces parsimony (Tibshirani, 1996). The motivation for the shrinkage estimator is that by reducing the parameters of the model, one increases bias at the expense of lower variance.
Our baseline model is a "Post-Lasso" estimator (Belloni and Chernuzhukov, 2013). This two-step estimator first estimates a Lasso model over the full set of coefficients, followed by an OLS model over the set of non-zero coefficients from the Lasso step. The model we estimate in the Lasso step is defined as (1) argmin Where the poverty rate in a GN is given by and 0 is a parameter that penalizes the absolute values of the coefficients. At the extreme, full relaxation of the penalization factor, that 12 is setting to zero, yields unconstrained OLS estimates. Thus as → ∞ , → . As → ∞, the penalty increases and converges to the zero vector. Lasso regressions are useful as a variable selection methodology because the sharp ℓ metric shrinks variables exactly to zero if they prove unuseful in decreasing the sum of squared errors. This creates a type of variable selection. However, simultaneously the Lasso "shrinks" the magnitude of coefficients towards zero, even for those that remain non-zero (Varian, 2014). Thus, by subsequently estimating an OLS model in the second stage, we ensure the coefficient estimates are unbiased. To choose the appropriate value of , we apply 10-fold cross validation, and choose the value of that minimizes root-mean squared error (RMSE) across folds. GLM versions of the model, which ensures that predicted values lie between zero and one, do not change the results qualitatively and are available by request.
Inferential standard errors are typically absent from Lasso models. Because of the Oracle property of the Lasso estimator (Fan and Li, 2001), we use the standard errors from the OLS model in the second stage as our measures of population inference. The Oracle property ensures that inference in the second stage using the reduced set of variables selected in the first stage is consistent with inference were we to use a single stage estimation strategy using only the selected variables present in the true data-generating process (Belloni and Chernuzhukov, 2013). Table 2 presents the estimates from the main specification for the full sample. The first two columns show the model where GN poverty is defined at the lower poverty rate, the next two present the higher poverty rate models, and the next two present average GN consumption dependent variable models. Many extracted satellite features have high explanatory power, including agriculture type, length of roads and fraction of roads paved, number and density of buildings, NDVI, roof type, shadows (building height proxy) and two spatial features, LBPM, and Fourier transform. The models explain a high amount of the variation in poverty, summarized in the in-sample R-squared values between 0.608 and 0.618. Out-of-sample Rsquared, estimated using ten-fold cross-validation, varies between 0.588 and 0.605. We conclude from the results that the models are not likely to be overfit to the data.

Results
The results suggest that, in words, a simple linear model that includes only the geographic size of the GN division, whether it is urban, and remotely sensed information explains 61 percent of the variation across GNs in headcount poverty rates. Figure 6 plots predicted against true average GN consumption, with colors assigned by province in which the GN is located. A LOWESS smoothing line is shown with associated confidence interval. A perfect model would have predictions exactly on the 45-degree line. While there is noise, the predictions tend to straddle the 45° line, indicating a high degree of agreement between the predicted and true welfare values. However, the model has a tendency to under-predict for wealthier GNs.

Marginal Effects of Satellite Features
While the primary objective of this exercise is to obtain accurate predictions, the model coefficients also shed light on the nature and magnitude of the conditional correlations between imagery features and poverty. The coefficients may be difficult to interpret for two reasons: First, the independent variables are often measured in different units. Second, in some cases multiple independent variables are based on the same underlying feature. In these cases, it is meaningless to evaluate the conditional correlation of one variable while holding the others constant. To understand the magnitudes of the coefficients, we group independent variables and consider the marginal effect of a one standard deviation increase of the underlying satellite feature. 14 Table  3 presents these marginal effects tables. For some dependent variables, the reported marginal effects reflect a combination of multiple underlying indicators, while for others they reflect single variables, as indicated in the right most column.
The size of the GN, in square kilometers, is more strongly correlated with headcount or average consumption. This suggests that households in the bottom decile are disproportionately found in larger GN divisions. The presence of agricultural land is weakly and negatively associated with poverty, controlling for other characteristics of the GN, although the result is not statistically significant. Of the indicators related to the distribution of paddy vs. plantation land, The LASSO procedure selected three of the indicators for 10 and 40 percent poverty incidence models, and two for the log consumption model. 15 The results indicate a discernible but fairly weak negative relationship between the presence of paddy agricultural land and poverty, which is consistent with the socioeconomically disadvantaged nature of the tea plantation sector in Sri Lanka.
14 Except for percent of GN agriculture that is plantation, for which a one standard deviation decrease is considered. 15 Since an increase in paddy land implies a reduction in agricultural land, for those GNs with agricultural land, the latter is subtracted instead of added when calculating the marginal effect. Compared with land type, the association between poverty and cars is mildly stronger. A one standard deviation increase in log cars (to an average value of 3.1) is associated with a 2.2 percentage point decline in poverty at the higher poverty line, and a 0.035 increase in predicted log per capita consumption. A one standard deviation increase in all three car related variables is associated with a 1.2 percentage point decline in poverty at the lower percent rate. Road characteristics are moderately associated with local poverty rates. Length of roads, fraction of roads paved, and runways are negatively associated with poverty, though only the first two are statistically significant, while GNs with more railways are poorer. A one standard deviation change in total length of road is associated with a 1.9 percentage point decline in the lower poverty line, a 2.5 percentage point decline where poverty is defined at the higher line, and a .031 increase in log consumption. The magnitudes of the marginal effects for fraction of roads that are paved are broadly similar, though a one standard deviation increase is only associated with a weaker 1 percentage point decline at the lower poverty line.
Measures of building density are strongly associated with log welfare and poverty. A one standard deviation increase in these two variables is associated with a 2.7 percentage point decline at the lower poverty rate, an 8.1 percentage point decline at the higher poverty rate, and a 0.16 increase in log consumption. In the lower poverty rate model, a one standard deviation increase is associated with a smaller 2.7 percentage point decline in poverty. Vegetation is moderately associated with poverty. A one standard deviation reduction in vegetation is associated with a 2.9 percentage point reduction at the higher poverty line, and a .04 increase in mean per capita consumption, which is comparable to cars or the fraction of roads that are paved. For the lower poverty line model, both NDVI measures are selected. The higher poverty line and log welfare models only include NDVI calculated over blocks of 64 pixels, suggesting that very high spatial resolution imagery may not be critical for generating informative measures of NDVI for prediction.
Two measures of shadows, a proxy for building height, are selected: the share of valid area covered by shadows, and the log number of shadow pixels. A one standard deviation increase in both measures is associated with a 3 percentage point increase in poverty at the lower poverty line, an 8 percentage point increase in poverty at the higher one, and a 0.13 decrease in mean log per capita consumption. For roof type, the Lasso procedure selects both the fraction of roofs classified as clay and aluminum, for all three models, and includes the fraction classified as asbestos for the lower poverty line model. The signs on clay and aluminum in the poverty regressions are positive, suggesting that these are generally inferior compared to the omitted category of grey concrete. This appears to be consistent with an analysis in Kenya that documents that roofs with greater luminosity, like aluminum, are associated with lower levels of poverty (Suri et al., 2015). The marginal effect of a standard deviation in clay and aluminum roofs are, respectively, 1.7 and 0.6 percentage points for the lower poverty line model, and .06 and .03 for mean log per capita consumption. These magnitudes are stronger than roads and vegetation, but considerably less than those for building density and shadows.
Of the texture variables, five out of seven are selected for the 10 percent model (LBPM, LSR, Gabor, Fourier, and SURF). Of these, only LBPM and SURF are selected for the 40 percent and log per capita consumption model. In general, the estimated marginal effects for these variables are modest. The main exception is the mean of the Fourier transform, which is positively associated with poverty in the lower poverty line model, though the coefficient is not statistically significant. A one standard deviation increase in SURF is associated with a one percentage point decline in the lower poverty line model and a 0.03 increase in log per capita consumption. This is consistent with wealthier areas being laid out in a more orderly way, with more "right angles" in housing layouts.
Figure 7 <Note to check -figure 7 is not included here.> presents a map showing the true welfare measures in the left panel, against the predicted welfare measures in the right, for a particular DS division, Seethawaka. The top panel shows predicted welfare from the OLS model against actual welfare. The model is able to distinguish the poorer eastern areas from the richer western ones. Even poor GNs adjacent to richer ones can be distinguished; although the smallest GNs are less than a half mile across, the HRSF model is able to distinguish with considerable accuracy the variation in average consumption. The middle panel shows predicted and true poverty rates defined at the lower poverty line. Again, the predicted model approximates the true poverty rates with considerable accuracy. The lower poverty regions in the south and northeast are replicated in the predicted values. The model tends to under-predict poverty in the lowest poverty areas in the mid-west, suggesting that two-step or zero-inflated Poisson models may perform better.

Figure 6: Predicted Versus True Welfare Measures, Average Consumption (top), 10% Poverty (middle) 40% Poverty (bottom)
In sum, predictive models based on an urban indicator, the size of the GN, and a host of features derived from satellite imagery predict poverty rates and mean log per capita consumption remarkably well. Greater numbers of cars are associated with lower poverty, although the relationship is not statistically significant, as is a denser road network and a larger share of paved roads. The indicators most strongly associated with poverty are building density and shadows. Shadows are positively associated with poverty, which suggests they are capturing variation in tree cover that is inversely related to building density. Consistent with this, areas characterized by more and lusher vegetation tend to be poorer. Clay and aluminum roofs, compared to grey roofs, are associated with greater levels of poverty. Of the spatial features, SURF exhibits a fairly strong association with poverty at the lower poverty line, suggesting that neighborhoods laid out in a more orderly way tend to be less poor. The following sections consider the robustness of these main findings.

Decomposition of Satellite Feature Explanatory Power
The results presented above indicate that features derived from satellite imagery explain a large portion of village income or poverty, and that associations are particularly strong for measures of building density and shadows. However, these results do not address the question of which indicators account for the model's predictive power. To address this issue, we decompose the using a Shapley decomposition (Shorrocks, 2013;Huettner and Sunder, 2012;Israeli, 2007). This procedure calculates the marginal of a set of explanatory variables, as the amount by which declines when removing that set from the set of candidate variables. In other words, for a model with sets of explanatory variables, the procedure will estimate 2 models and average the marginal obtained for each set of independent variables across all estimated models. This ensures that the variable's contribution to is independent of the order in which it appears in the model. Table 4 presents the decomposition. The results confirm that measures of building densitybuilt up area, number of buildings, shadow pixels, and to a lesser extent vegetation, are powerful contributors to predictive power. Collectively, these three sets of variables account for 39 to 45 percent of the model's explanatory power. However, a number of other variables are moderately important. GN area, urban classification, road characteristics, roof type, and the texture variables each explain 8 to 12 percent of the variation. The car and agricultural variables explain a bit less than that, between 5 and 7 percent each. In short, while broad measures of building density explain a large share of the variation, virtually all sets of indicators contribute substantial predictive power to the model.

Comparisons to Night Time Lights
How does the predictive power of indicators derived from daytime imagery compare with night time lights? To shed light on this, Table 5 presents OLS models covering the same sample area using night time lights as the independent variable. The first three columns present poverty and per capita consumption models. Aggregate night time lights is positively correlated with welfare and negatively correlated with poverty, however the total explanatory power is low: values for the three regressions are between 0.10 and 0.147, with performance lowest for the 10 percent headcount measure and highest for log consumption per capita. Adding higher order polynomials up to a quartic only increases it to 0.15. Models built using high resolution satellite indicators capture around four times as much variation in poverty or welfare as NTL. Columns 4-6 of table 4 show estimates that include DS division fixed effects. Night time lights is no longer significant in any of the specifications, indicating that within DS divisions, NTL is weakly correlated with welfare.
Given the prevalence, ease of use and familiarity with night time lights, one might also ask how much more explanatory power do night time lights provide in addition to the indicators extracted from daytime imagery? Table 6 answers that question, by adding night time lights to the Shapley decomposition. The night time lights category includes average, squared, cubed, and average standard deviation of NTL. The night time lights variables explain between 7 and 12 percent of the variance in per capita consumption or poverty according to the decomposition, meaning there is roughly a 90 percent additional variation in poverty or income that is captured through high resolution satellite predictions. Furthermore, adding night time lights marginally increases the overall of the regression, by about 0.01. In this context, NTL is not a particularly accurate proxy for poverty and welfare, and adds very little explanatory power to the set of available daytime indicators.

Urban and Rural Linear Models
How does the relationship between indicators and welfare differ in urban and rural areas? Table 7 shows model estimates estimated separately for 393 urban villages and the 898 rural ones, based on Sri Lanka's official definition of urban and rural areas. 16 Variables were again selected through Lasso estimation. The urban model selects fewer variables -13 of the candidate variables in the urban model are selected versus 16 for the rural model. R-squared values are slightly higher in rural areas (0.656) and significantly lower in urban areas (0.445). 17 For the urban model, log number of cars, built-up development, and shadow pixels are important. In rural models, agricultural variables, roof type, shadow pixels, NDVI, Pantex and LBPM are important. The association between cars and poverty is significantly stronger in urban areas. In addition, the association between NDVI and poverty is strongly negative in rural areas, as rural areas with more vegetation and less built-up area are poorer. The coefficient on NDVI in urban areas, meanwhile, is positive and not statistically significant, suggesting that if anything wealthier urban GNs are characterized by a greater prevalence of lush vegetation. Notes: Tables gives estimated marginal effect of a one standard deviation change in variable or variables listed in right column. For example, the combined marginal effect of a one standard deviation in all three cars variables on the 10 percent relative poverty rate is a reduction of 1.2 percentage points. Variables excluded from 40 percent poverty and log consumption models, as shown in Table 2, are also excluded when calculating marginal effects for those dependent variables. For agricultural land, % of GN that is plantation is subtracted from the sum of % GN agriculture that is paddy and % total GN area that is paddy. * p<0.05, ** p<0.01, *** p<0.001

4.5
Correcting for Spatial Autoregression One unaddressed concern is whether the presence of either spatial autocorrelation or spatial heterogeneity leads the standard errors to be underestimated. Spatial autocorrelation can occur in the presence of geographic spillovers or interactions (Anselin, 2013), and considering the village-level observations one could develop plausible stories by which poverty is influenced by this mechanism. A Moran's I test for the presence of such disturbances according to Anselin (1996) rejects the null hypothesis that there is no spatial autocorrelation present. To correct for the spatial autocorrelation, we model explicitly the spatial autoregression (SAR) process and allow for SAR disturbances, a so-called SARAR model. This is implemented via a generalized spatial two-stage least-squares (GS2SLS) as shown in Drukker et al. (2013). The results presented in table 9 show that after correcting for spatial autocorrelation, most high-resolution spatial features remain significant predictors of local area poverty. Although there is some presence of autocorrelation, it is not sufficient to alter the joint significance of the spatial variables.

4.6
Using an Alternative Measure of Simulated Welfare The results so far have demonstrated that indicators derived from satellite imagery are strongly predictive of variation in welfare and poverty, measured using imputed welfare into the 2011 census. Imputing into the 2011 census is necessary because the analysis is carried out at the GN division level, and the HIES survey alone does not contain sufficient data to accurately estimate welfare at that level. The baseline method uses, as the dependent variable, the average welfare or poverty rates, taken across both households in the village and the 100 simulations of predicted residuals. This average is then regressed on various satellite features. Because the dependent variable is an average taken over 100 simulations, it is a measure of expected poverty and welfare across both simulations and GN households. It incorporates the full distribution of potential outcomes into the measure of poverty and welfare. Averaging over the 100 simulations per household also reduces the variance of the stochastic component of welfare by a factor of 100, which raises the explanatory power of the satellite indicators.
An alternative would be to compare satellite-based predictions against simulated poverty and welfare. This would entail regressing the estimated poverty rate of the GN for each of the 100 simulations against the independent variables, and then averaging the resulting R-squared statistics across the 100 simulations. In each case, the dependent variable is a single simulated value of welfare rather than an average across simulations. This provides a lower bound estimate of the explanatory power of the satellite indicators, were the census to include consumption data. If actual consumption data were collected in the census, its unexplained portion of actual consumption data may be correlated with the satellite indicators. In reality, the unexplained portion of the prediction is constructed be pure noise that cannot be explained by the imagery indicators.
The top row of Table 10 reports the in-sample R 2 when using expected welfare as the dependent variable, which is identical to the results reported in Table 2. The bottom row reports the average R 2 when using the simulated welfare approach. Using each individual simulation of welfare as the dependent variable reduces the in-sample R 2 for the base regression reported in Table 2 from 0.610 to 0.406 when predicting the 10 percent poverty rate, from 0.618 to 0.496 when predicting the 40 percent poverty rate, and from 0.608 to 0.515 when predicting log per capita consumption. Incorporating all 100 draws of the residual from the census prediction regression has a stronger impact on the accuracy of the resulting poverty measure at the lower poverty threshold. This is because a greater share of households have predicted welfare that fall above the 10 percent threshold than the 40 percent threshold, meaning that the residuals from the predicted welfare function play a larger role in determining poverty at the lower threshold. Overall, the independent variables continue to predict much of the variation across GNs in welfare and poverty, even when the GN estimate of welfare and poverty is based on one draw from the residual of the prediction regression.

Do High Resolution Satellite Features Explain the Poverty Gap?
The poverty gap is a useful supplement to the headcount rate for understanding poverty because it takes the depth of poverty into account. The poverty gap or metric measures poverty depth by considering how far the poor are from a given poverty line. 18 We compute the average poverty gap for each village, and use this measure as a dependent variable in a regression where the right-hand side includes the size of the GN, a dummy indicating urban classification, and the features created from high resolution satellite imagery. We consider again poverty lines defined at the 10 th and 40 th percentiles of national consumption per capita. Table 11 presents the results estimated via OLS. The coefficients can be interpreted as a unit change in the distance between the poverty gap and the poverty line for the average village. As was the case for headcount rates, high resolution features explain the poverty gap well, with adjusted values between 0.588 and 0.609. Not surprisingly, building density and shadow variables are also strong correlates of the poverty gap.

Estimating Poverty Using a Reduced Census Training
A key motivation for this analysis is to understand how HRSF compliments traditional surveys to generate small area estimates. The standard small area estimation technique used to model poverty combines a smaller household survey with a population census. Conducting a population census is expensive, but needed to cover the full range of individuals within a country. Can satellite features be combined with a smaller household survey alone to produce sufficiently precise small area estimates? To assess this, we examine whether the predictive power of satellite imagery remains when it is calibrated using a census extract, of approximately the size of the Household Income and Expenditure Survey, rather than a full census.
We produce an alternative version of each of the three dependent variables (either per capita consumption or GN poverty rate) using artificial subsamples of the census intended to mimic the size of a household survey. This involves sampling 20% of GNs, and 5% of the actual households in that GN, from the predicted welfare measure imputed into the household-level census data. We use these artificial samples to build a model of poverty or consumption using HRSF, which produces estimates of poverty that can then compared to actual estimates based on the full census data. This sheds slight on the trade-off between reducing the size of the training sample, which saves considerable money, and the resulting loss of precision of the estimated poverty rates. Figure 9 plots the results of the simulation exercise, where we have plotted R-squared values between predicted welfare rates and true welfare rates, both in-sample (GNs within the subsample) and out-of-sample (GNs excluded from the subsample), and mean absolute error. Average R-squared values between predicted and true values do not depreciate significantly when using the sample consisting of 20 percent of GNs and 5 percent of households within those GNs. R-squared values decline modestly with the smaller sample size, but R-squared values remain well above 0.5 even when sampling artificially many fewer households. The second panel of figure 9 shows the same exercise with mean absolute error used as a metric, and the results confirm there is negligible difference when using many fewer households to train the model. These results suggest that it is possible to generate reasonably reliable estimates of economic well-being by combining household survey data with features extracted from satellite imagery.

Poverty Estimation via Geographic Extrapolation
Another major motivation for using satellite imagery is to extrapolate poverty estimates into areas where survey data on economic well-being do not exist. While many poor countries are unable to conduct regular surveys, several other countries collect welfare data regularly but omit selected regions, due to political turmoil, violence, animosity towards the central government, or prohibitive expense. For example, from 2002 through 2009/10, Sri Lanka's HIES failed to cover certain districts in the North and Eastern parts of the country due to civil conflict, and Pakistan's HIES exclude the Federally Administered Tribal Areas, Jammu and Kashmir.
To assess how well a model "travels" to a different geographic area, we fit a series of models, where in each model we exclude a single Divisional Secretariat (DS), a larger administrative area, from the model, and use the estimated model to predict into that excluded area. This is a form of "leave-one-out cross-validation" (LOOCV), a common method used to infer statistical out-of-sample performance (Gentle et al., 2012). We estimate both linear models and random forest models 19 to predict out of sample to determine if more flexible model specifications perform better out-of-sample.
Our approach differs from the standard case in that for each estimation we exclude, or "leave out", an entire Divisional Secretariat (DS), an administrative sub-unit at the level immediately 19 For each random forest model, we use 1,000 decision trees, sampling 13 of the predictors with replacement. below the district. Table 12 shows model performance at predicting into adjacent areas, comparing normalized root mean squared error, normalized mean absolute error, and the correlation between predicted and true welfare rates using both random forest and linear models to fit HRSF models. The adjacent prediction error rates are larger than when predicting randomly out of sample using cross-validation. Normalized error rates divide average error by the average value of welfare, therefore the error rates can be interpreted as fractions of average welfare. Mean absolute error is estimated at 2.5% of log household consumption, 45% of the average poverty rate at the lower poverty line, and 30% of the average poverty rate at the higher poverty line using linear models to estimate and predict into adjacent areas. The error rates are lower when using random forests to estimate and predict into adjacent areas. When predicting using random forest models, the mean absolute error declines to 1.5% of log household consumption, 38% of the average poverty rate at the lower poverty line, and 25% of the higher poverty line.
While these error rates imply adjacent predictions will be too imprecise for producing welfare measures intended as official statistics, they may be sufficient for generating rank ordering of villages by poverty or income. The rank correlation between the predicted and the true values results in a Spearman's ρ estimated at between 0.67 and 0.7 for the linear models, and between 0.74 and 0.76 using the random forest models. We conclude from these results that HRSF cannot yet be used to predict accurately into adjacent areas for official statistics, but the accuracy may be acceptable for targeting or other applications, and is likely to improve as better machine learning methods are employed.

Conclusion
Traditionally, given the prohibitive cost of conducting surveys sufficiently large to provide accurate statistics for small areas, generating small area poverty estimates requires pairing a welfare survey with a census or inter-census survey. Census and inter-census data are expensive to collect and therefore produced relatively infrequently. The data are also usually disseminated with a significant lag, making it difficult to rapidly assess changes in local living standards. The results above show that indicators derived from high spatial resolution imagery, when paired with survey data, generate accurate predictions of local-level poverty and welfare, and that by and large the conditional correlations are of sensible signs and magnitudes. Furthermore, predictions based on specific features accurately predict mean per capita consumption throughout the welfare distribution. While the welfare consequences of more frequent measures of poverty and inequality are unknown, they may be large given the many applications of frequent local measures of economic well-being, ranging from impact evaluation, to budget allocation to social transfers.
How well do indicators derived from satellite imagery predict poverty and which indicators are most important? We investigate these questions using a sample of 1,291 villages in Sri Lanka, linking measures of economic well-being with features derived from high resolution satellite imagery. The results indicate that the correlation between satellite derived indicators and economic well-being is remarkably strong. Simple linear models explain 60 to 61 percent of the variation in poverty and average log per capita consumption. In both rural and urban areas, variables measuring building density, built-up area, and shadows are the strongest predictors of variation in poverty. As expected, the extent and lushness of vegetation is negatively correlated with welfare in rural areas, and mildly positively correlated with incomes in urban areas, suggesting that trees and other vegetation are a luxury in urban areas.
While these results are very encouraging, additional analysis suggests caution when extrapolating predictions into geographically adjacent areas. The normalized error rates range from a quarter to one-half of the poverty rates, depending on the incidence of poverty. The likely impediment to extrapolation is geographic heterogeneity in the relationship between indicators and welfare. Using models that learn from geographic heterogeneity, such as ensemble methods or deep learning methods, will likely improve performance for this task. Another factor is the time differences at which the satellite images were taken, which can contribute to noise in the independent variables across geographic regions. This could impact particular indicators such as car counts, which can vary greatly according to the day of the week the imagery was obtained. Measures of agriculture also exhibit considerable seasonal variation, which may also confound extrapolation to adjacent areas. This suggests that some indicators may particularly contribute to bias when extrapolating across space, and that the date of the image is an important consideration when considering spatial extrapolation using satellite-based indicators. We suspect this will improve as the revisit rates of satellites improves. Planet, which has 190 imagery satellites in orbit, already claims daily revisit rates for all the earth's land mass, sometimes giving revisit rates as frequently as every hour.
These findings raise a host of questions for further work, and contribute to an ongoing discussion regarding the use of predictive methods in public policy (Athey, 2017). The most immediate of these is whether satellite indicators can substitute for census data in different contexts and for different indicators. Does the strong correlation between satellite-based indicators and economic well-being extend to wage income measured directly from an expenditure survey? Second, it is important to better understand the extent to which these results generalize to different social and ecological environments, such as Africa, the Middle East, and other parts of Asia. There is no guarantee that the predictive power of building density, shadows, and other features documented above will hold in all environments.
A second line of research could explore whether changes in satellite imagery could be used to forecast changes in economic well-being across space and time. Poverty surveys are typically collected every three years and the most recent global estimates are produced with a three-year lag. Therefore, the ability to "now-cast" measures of economic well-being by combining frequently updated satellite imagery with the most recent survey-based measures of poverty has great potential. Secondly, additional research can shed light on identifying the best way to predict into adjacent areas not covered by surveys. Overall, the inevitable increase in the availability of imagery and feature identification algorithms, in conjunction with the encouraging results from this study, implies that satellite imagery will become an increasingly valuable tool to help governments and stakeholders better understand the spatial nature of poverty.