Estimating Small Area Population Density Using Survey Data and Satellite Imagery: An Application to Sri Lanka

Country-level census data are typically collected once every 10 years. However, conflict, migration, urbanization, and natural disasters can cause rapid shifts in local population patterns. This study uses Sri Lankan data to demonstrate the feasibility of a bottom-up method that combines household survey data with contemporaneous satellite imagery to track frequent changes in local population density. A Poisson regression model based on indicators derived from satellite data, selected using the least absolute shrinkage and selection operator, accurately predicts village-level population density. The model is estimated in villages sampled in the 2012/13 Household Income and Expenditure Survey to obtain out-of-sample density predictions in the nonsurveyed villages. The predictions approximate the 2012 census density well and are more accurate than other bottom-up studies based on lower-resolution satellite data. The predictions are also more accurate than most publicly available population products, which rely on areal interpolation of census data to redistribute population at the local level. The accuracies are similar when estimated using a random forest model, and when density estimates are expressed in terms of population counts. The collective evidence suggests that combining surveys with satellite data is a cost-effective method to track local population changes at more frequent intervals.


Policy Research Working Paper 8776
Country-level census data are typically collected once every 10 years. However, conflict, migration, urbanization, and natural disasters can cause rapid shifts in local population patterns. This study uses Sri Lankan data to demonstrate the feasibility of a bottom-up method that combines household survey data with contemporaneous satellite imagery to track frequent changes in local population density. A Poisson regression model based on indicators derived from satellite data, selected using the least absolute shrinkage and selection operator, accurately predicts village-level population density. The model is estimated in villages sampled in the 2012/13 Household Income and Expenditure Survey to obtain out-of-sample density predictions in the nonsurveyed villages. The predictions approximate the 2012 census density well and are more accurate than other bottom-up studies based on lower-resolution satellite data. The predictions are also more accurate than most publicly available population products, which rely on areal interpolation of census data to redistribute population at the local level. The accuracies are similar when estimated using a random forest model, and when density estimates are expressed in terms of population counts. The collective evidence suggests that combining surveys with satellite data is a cost-effective method to track local population changes at more frequent intervals. This paper is a product of the Poverty and Equity Global Practice. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/research. The authors may be contacted at vidhyasrajan@iimb.ac.in.

Introduction
Up-to-date estimates of population density in small areas is a valuable input for policymakers (Stevens et al., 2015;Wardrop et al., 2018). They could, for example, facilitate efficient delivery of public goods and services and infrastructure projects (Guiteras et al., 2018); track net migration patterns, especially in response to civilian conflict, political upheavals, and climate tragedies; and help better understand the impact of geographically-targeted economic policy interventions such as Special Economic Zones. Traditional population data sources do not meet these requirements, as censuses provide local population measurements infrequently, typically decennially. Although household surveys can yield more frequent population estimates, they are not representative at small administrative levels. The challenge of tracking population is particularly exacerbated in low-and lower-middle-income countries where population growth rates are high and net migration patterns are rapid ( Figure 1).
Satellite imagery, in combination with survey data, has the potential to fill this gap by generating more frequent estimates of population that are representative in small areas. Satellites offer several advantages as data collection instruments. They systematically and universally capture large geographic areas, produce imagery at reasonably regular intervals, and are unaffected by topography, political climate, or conflict. The technology and market for both collecting and processing satellite-based imagery are developing rapidly, and with ever more satellites being launched, high-spatial resolution imagery is increasingly available at reasonable cost.
Although existing publicly available population products incorporate satellite data, this does not imply that they are up to date. Existing products typically rely on dasymetric mapping approaches to distribute population from the census to lower administrative units, either using equal weights, or weights based on satellite indicators and advanced statistical techniques. These "top-down" methods may not provide accurate or up-to-date population estimates for two reasons. First, the accuracy of distributing population relies heavily on the input population data, that is, on the census (Wardrop et al., 2018). However, the census itself may quickly become outdated due to frequent migration patterns, and rapid and uneven population growth. 1 Second, these methods often rely on Night Time Lights (NTL) or other low-spatial resolution or coarse built-up area measures from satellite and other sources to distribute census population, and hence are limited at fine spatial scales (Vogel et al., 2018).
This study seeks to overcome these limitations by proposing a "bottom-up" method which combines survey and satellite data to generate local estimates of population density, and by extension, population counts. Instead of relying on the census, which becomes outdated over time, the proposed method exploits the updated demographic information in surveyed areas from existing periodic household surveys, and the widespread coverage and granular information offered by satellite imagery. The method predicts population density in non-surveyed areas, and in between-census years by employing updated survey and satellite data. This technique is applied in the context of Sri Lanka using the Household Income and Expenditure Survey (HIES), a nationally-representative household survey. The satellite indicators include those derived from both low-and high-resolution satellite imagery for the entire country, and additionally, object and contextual features derived from very-high resolution imagery for a randomly selected portion of the country. These indicators are used to predict population density at the Gram Niladhari (GN) division, the lowest administrative level in Sri Lanka. The GN division is similar in size to a village in many developing country settings, and we henceforth, for ease of exposition, refer to GN divisions as "villages".
To motivate this approach, we begin by documenting the inconsistency of existing "topdown" population products at the village level, both with each other and with the census. We then address three questions that shed light on the ability of indicators derived from satellite imagery to predict population density. First, how accurately do satellite data predict census population density at the local level? Second, does including indicators derived from high and very high-resolution satellite imagery substantially improve the predictive power of the model? Third, how does the size of the training data affect prediction accuracy? We then implement the "bottom-up" approach, and finally ask how accurate are out-of-sample predictions that are derived from the HIES and satellite-imagery-based model, compared to the existing "top-down" products?
The "top-down" products considered include, WorldPop for the years 2010 and 2015, the Global Human Settlement Layer (GHSL) for 2014, High Resolution Settlement Layer (HRSL) created by Facebook for 2015, the Center for International Earth Science Information Network's (CIESIN) Gridded Population of the World (GPW) for 2010 and 2015, and LandScan for 2010. We use simple statistical measures of association to confirm that top-down estimates are poorly correlated with each other and with the census at the village level. WorldPop 2015 and Facebook are the exceptions, because they use high-resolution satellite imagery and are calibrated to the latest census data at geographically fine levels. 2 However, since even the most accurate population products use the census to redistribute population, they may quickly become outdated as the census ages, necessitating "bottom-up" methods to track changes more frequently.
We obtain satellite indicators from a variety of sources, which we categorize into four types, based on their resolution, availability, and accessibility. The publicly available indicators obtainable for the entire country are categorized into, (i) low-resolution indicators, based on imagery resolution ranging from about 0.5km per pixel to about 30m per pixel; and (ii) highresolution imagery whose resolution ranges from about 30m per pixel to 12m per pixel. The proprietary indicators are taken from Engstrom et al. (2017), and are based on very-high resolution imagery acquired from DigitalGlobe covering 55 Divisional Secretariat divisions (one level higher than the village, henceforth referred to as sub-districts). These are divided into two categories: (iii) contextual feature indicators; (iv) object-based indicators quantifying objects such as roofs and cars.
Our results that use Poisson regressions based on variables selected by the least absolute shrinkage and selection operator (LASSO) indicate that satellite indicators accurately predict village-level census population density. Models using publicly available satellite indicators in a full national sample predict density with an out-of-sample R 2 of 0.75. Including the very-high resolution indicators and estimating the model in the 55 sub-district sample, the out-of-sample R 2 rises to 0.83. Although the public indicators perform well in predicting density in rural as compared to urban areas, the urban-sector prediction accuracy increases substantially (about 0.13 points) when object and contextual features from Engstrom et al. (2017) are added, reflecting in part the heterogeneity in the relationship between population density and builtup area in urban areas. 3 The accuracy of the predictions does not meaningfully change when the size of the training sample is reduced. Results from random forest models produce similar results to the LASSO-based regressions, and corroborate the results across different types of high-resolution resolution-types and sectors.
Finally, we obtain results from a similar Poisson model that uses estimates of population density obtained from the 2012/13 HIES instead of the census. We then compare the out-ofsample predictions from this model with the actual census population densities in villages not covered by the HIES. This yields an out-of-sample R 2 of 0.79, Spearman Rank Correlation (SRC) of 0.91, a mean Relative Error (RE) of 37%, and a median RE of 28%. Based on R 2 and mean RE, our model is more accurate than the top-down measures: GPW (both years), GHSL, LandScan, and the 2012 WorldPop, and not as accurate as Facebook and WorldPop 2015. 4 The results on these relative accuracies extend to village-level population-count predictions derived by multiplying density predictions by village area. The prediction accuracy remains similar if only indicators that are publicly available are used in the model. Two aspects of these results are particularly noteworthy. First, the survey used to calibrate the model covers only 17% of the villages in Sri Lanka. The fact that the combination of satellite imagery with this type of small survey predicts population density as accurately as the estimates that use an entire census demonstrates the cost-effectiveness of this approach. 5 In fact, simulations indicate that combining survey data with satellite data generates sub-district estimates of population density that are as precise as a survey that samples 80 percent of the villages nationwide. Second, although our estimates are not as accurate as the most accurate top-down estimates, the model gives comparable or better performance than most others. However, because surveys are collected much more frequently than censuses, bottom-up methods that combine satellite imagery with surveys have the crucial advantage of remaining up to date even as the census ages. 6 This study fits into a rapidly growing literature on using satellite imagery to predict population counts and density in local areas. Wardrop et al. (2018) provide an overview of these studies. Top-down approaches to distribute population remain popular in the literature. For example, Stevens et al. (2015) use a dasymetric approach to redistribute population counts from the census using a Random Forest model-based weighting scheme in Cambodia, Vietnam, and Kenya. Using a combination of regression and tree-based methods, Anderson et al. (2014) predict population density for districts in Peru where no direct samples were available, using two coarse satellite-based covariates: Normalized Difference Vegetation Index (NDVI) and the daytime Land Surface Temperature (LST). 7 Our contribution to this literature is threefold. First, relatively few studies use "bottomup" techniques (Wardrop et al., 2018). Early studies using bottom-up style methods either use simple area measures and/or coarse satellite imagery, or mostly focus on developed countries where there are fewer data limitations (Sutton et al., 2001;Biljecki et al., 2016;Li and Weng, 2005). 8 Second, we compare prediction accuracies between imageries of various resolutions to ascertain if, and by how much, higher resolution imageries are better at predicting density compared to lower resolution imageries. Prior studies have not examined the trade-offs across 5 It is important here to clarify that even though we do not directly rely on the census, the census remains essential to provide the sampling frame for subsequent and periodic surveys. Our method, hence, does not reduce the importance of the census.
6 For instance, as we will show in section 4, WorldPop 2010 that utilizes the 2001 Sri Lankan census (a gap of nine years) for redistribution, performs much worse than WorldPop 2015 that uses the 2012 census (a gap of three years). 7 We also speak to a parallel literature that uses other sources of big data such as mobile phones to map population distribution (Deville et al., 2014) or estimate population maps based on age structure (Alegana et al., 2015), and to an emerging literature that uses satellite data and machine learning techniques to predict human welfare and demographic variables, and urban market boundaries (McBride and Nichols, 2016;Athey, 2017;Engstrom et al., 2017;Jean et al., 2016;Vogel et al., 2018). Third, we validate out-of-sample predictions derived from a household survey against the census. Such validations are not commonly performed by studies using "bottom-up" methods (Wardrop et al., 2018), except Biljecki et al. (2016) and Harvey (2002). The estimates are more accurate than previous studies that make use of two-dimensional satellite data. Our out-of-sample R 2 of 0.78 is higher than the 0.72 reported for Australia (Harvey, 2002). Our median RE at 28% is lower than those reported by Biljecki et al. (2016) in the Netherlands, which ranges from 42% to 85.4%. These higher errors could be attributed to their usage of fewer satellite indicators, linear (rather than non-linear) regressions and simple (rather than stratified) random sampling for the survey. 9 Overall, the results demonstrate the feasibility of national statistics offices utilizing geospatial data in combination with frequently conducted surveys to produce accurate local population estimates in the between-census years and in areas where surveys are not conducted. Improved availability of such statistics would provide useful inputs into a wide variety of policy decisions. The rest of the paper is organized as follows: section 2 describes the data, section 3 presents the model, section 4 presents the results, and section 5 finally concludes the paper and discusses policy implications.
2 Data sources and description

Population sources
Our main measures of population density at the village level are the Census of Population and Housing, Sri Lanka, 2012, and the 2012-13 Sri Lankan Household Income and Expenditure Survey (HIES). The HIES is a detailed survey of about 25,000 households in all the districts and sub-districts of Sri Lanka, but only covers about 2,421 of the total 14,103 villages of the country. The HIES follows a two-stage stratification process: the census blocks form the Primary Sampling Units (PSUs), and the households form the Secondary Sampling Units (SSUs) or Final Sampling Units (FSUs).
The PSUs used in the HIES are census blocks, which are portions of villages. 10 Therefore, the HIES can yield direct density estimates only at the PSU-level, and not at the villagelevel. Unfortunately, it is not possible to model population density at the PSU-level due to the lack of a PSU-level Geographical Information System (GIS) boundary file. We therefore indirectly estimate the village population density using the HIES survey weights, under the assumption that the survey weights reflect the inverse probability of housing unit being selected for the sample. The steps used to estimate village level population density from the HIES are described in Appendix A.

Publicly available areal estimates of population density
We validate several publicly available gridded population data sets against each other and with the census, and compare them with our survey-based population density predictions. We examine the following five sources:

Satellite data
We use various sources of geo-spatial data in our models. The low-resolution public indicators are from three sources: (1) Night time lights from the Visible Infrared Imaging Radiometer Suite (VIIRS) at a resolution of 750 m per pixel. We use the maximum and mean intensity of two months, namely March and September, 2014; (2) Global Forest Change data based on Hansen et al. (2013), from which we use mean tree cover in 2000, and gain and loss in forest area between 2000 and 2014, at a resolution of 30 meters per pixel; (3) Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER)'s elevation and slope data at a resolution of 30 meters per pixel.
We also use publicly available built-up area measures based on higher resolution imagery. The Global Urban Footprint (GUF) (year 2012) and Global Urban Footprint plus (GUF+) (year 2015) provide built-up estimates at a resolution of 12 meters per pixel. 12 Global Human Settlement Layer's (GHSL) built-up estimates are at approximately 38 meters per pixel. We also use Facebook's High Resolution Settlement Layer (HRSL) built-up area measure, available at a resolution of 30 meters per pixel.
Additionally, we use contextual and object features from Engstrom et al. (2017) which used proprietary imagery for 1,360 villages within 55 sub-districts obtained from DigitalGlobe. These images cover 3,500 km 2 of area in Sri Lanka, and are mostly obtained for the years 2011 and 2012, although some images were also captured for the year 2010. The list of 55 sub-districts is available in Appendix B. Object features extracted include the number of cars, building count and size, roof type, shadow pixels, road length and type, type of farm land (paddy or plantations). 13 Further, seven contextual features are calculated: Fourier transform, Gabor filter, Histogram of Oriented Gradients (HoG), Line Support Regions (LSR), Pantex, Normalized Difference Vegetation Index (NDVI), and Speed-Up Robustness Features (SURF). 14 More details on these objects and features can be found in Engstrom et al. (2017). Table 1 concisely presents the key variable types, their source, time frame, resolution type, and geographical coverage. Tables 2 and 3 present the summary statistics for the population, geographic, and satellite imagery based variables for the national sample and the 55 subdistricts, respectively.

Predicting population density
We use Poisson regressions to model population density at the village level. This takes the following form: where P v is the population density (persons/km 2 ) of village v. X v consists of the set of satellite imagery-based indicators defined for village v. We include the natural logarithm of village area (Ln area v ), and an indicator for urban villages (ρ S ). 15 η d represents binary indicator 12 While GUF was made with TerraSAR-X/TanDEM-X, which are radar data sets, GUF+ adds Sentinel and Landsat which are optical remotely sensed data which substantially improves the estimation.
13 Supplementary Figure S1 shows examples of developed area building classification, with the raw image (left) and CNN based building classification (right) in Panel A, and a sample car classification in Panel B. In a test sample, the average precision of the building classification was 71 percent, indicating a moderate share of false positives.
14 The contextual and object features are identified using both deep learning-based Convolutional Neural Networks (CNN) and classification of spectral and textural characteristics (Engstrom et al., 2017). 15 While the official sectoral classifications in Sri Lanka include three categories: the urban, rural, and estate sectors, we combine the second and the third to represent two categories: (1) Urban (2) Rural and estate.
variables for Sri Lanka's districts.
To prevent model over-fitting in using the entire set of satellite imagery in X v , we employ the least absolute shrinkage and selection operator (LASSO) regularization which estimates a regression model with an added constraint that enforces parsimony. We follow a two-step procedure. First, a full set of variables is included to conduct LASSO regularization to choose variables. Second, the final Poisson model is estimated using the chosen variables. Equation 2 presents the objective function for LASSO regularization in a model using intercept β 0 and the vector of predictor coefficients τ (of length p) in step 1.
where l is the value of the log likelihood function of the Poisson model using parameters β 0 and τ , and λ is a non-negative regularization parameter. While setting λ = 0 yields unconstrained Poisson regression estimates, a large λ penalizes the absolute values of the coefficients, β. 16 Fivefold cross validation is applied to choose the value of λ that minimizes the root-mean squared error (RMSE) across the folds. The in-sample R 2 from this step indicates the goodness of the model fit. Since LASSO also generally shrinks the magnitude of all coefficients towards zero (Varian, 2014), we avoid biased predictors by running a simple Poisson model based on the Lasso selected variables. From this, we obtain the out-of-sample R 2 , mean absolute error (MAE), and RMSE, all of which indicate out-of-sample predictive accuracy. First, we implement the above model using the village level census population density as the dependent variable. We conduct this exercise for the entire country separately using different imagery resolution types: (1) low-resolution publicly available indicators; (2) higher resolution publicly available indicators plus the indicators mentioned in (1). Doing this enables us to compare prediction accuracy between higher versus lower resolution data in the national sample. In the 55 sub-districts where the features from Engstrom et al. (2017) are available, we implement a model using: (3) spectral and texture-features plus indicators used in (2); and (4) object-features plus the indicators used in (3). For the 55 sub-districts, we compare prediction accuracy successively from including the set of imagery-based indicators in (1) 16 For estimations, we use the glmnet package in R that estimates the model parameters by minimizing the penalized log likelihood function. Glmnet provides an option to penalize the minimand based on the sum of the absolute values of the parameters (LASSO), or based on the sum of squares of the parameters (ridge), or a combination of both (Hastie and Qian, 2018). Glmnet's minimization function is presented in Equation 3. We set α = 1 to use the LASSO method.
through (4). We also repeat the analysis separately for urban areas and for rural and estate areas (grouped together) to check if prediction accuracy varies across sectors. Finally, focusing on the 55 sub-districts, we reduce the training sample size by half and quarter, to examine if the prediction accuracy is sensitive to the size of the training sample. As a robustness check, we also implement a flexible random forest algorithm using population density as the response variable, using the four sets of satellite indicators described above. For each model, we report out-of-bag R 2 , RMSE, and MAE. 17 Next, we focus on an overlapping sample between the sampled villages in the HIES and those in the 55 sub-districts for which proprietary features are available. These amount to 1,360 villages. In 414 villages of this sample, we model the HIES population density as a function of the all public and proprietary satellite indicators, and apply the procedure described above. Since this is a survey-based model, we use the inverse of village population for village v as weights, which is given by 1 V illage P opulationv . In addition, there could be factors correlated with satellite-derived indicators that affect the probability of selection of the village into the HIES. We therefore adjust the weights based on the predicted probability of the village being sampled in HIES, based on satellite indicators available for all villages (Horvitz and Thompson, 1952;Wooldridge, 2002). This probability is obtained from the following probit model: where IN HIES v is a binary indicator for whether village v is sampled into HIES, and lasso v represents all LASSO-selected variables from the Poisson model. We estimate this model using data for all 1,360 villages in the 55 sub-districts. The predicted probability of the village selection into the HIES, IN HIES v is used for correcting the weights in the following way : Using these above two sets of weights, we estimate the model in HIES-villages. Then, we predict out-of-sample densities for the remaining 946 non-HIES villages and report outof-sample accuracy measures by comparing them with the actual census density. To provide additional context, we compare the accuracy of this model to predictions from a similar censusbased model and with density estimates from the publicly available "top-down" population products. Further, we calculate population count predictions by multiplying these density predictions with village-level area, and report their accuracy with respect to the census population.

Results
The sub-sections below present (a) the results comparing publicly available "top-down" population estimates with each other and with the census; (b) the predictive accuracy of satellite imagery-based indicators for census density, using LASSO-based Poisson regression models; and (c) out-of-sample density predictions based on a model using the survey (HIES), and comparing their accuracies with other estimates.

Validation of population products
The validation exercise indicates that publicly available population count estimates are accurate at the sub-district level but not necessarily at the village level. At the sub-district level, except LandScan and WorldPop 2010, the R 2 for all other products with the census are above 0.9 (panel A in Table 4). At the village level, panel B in Table 4 shows that WorldPop 2015 has the highest association with the census (R 2 of 0.988), followed by Facebooks HRSL (0.841), GHSL (0.672), GPW 2015 (0.595), GPW 2010 (.404), WorldPop 2010 (0.107) and LandScan (0.0006) in that order. These results qualitatively hold if we use correlation coefficients as a measure of association (see supplementary tables S2 and S3). Clearly, the products using the most recent census and using higher resolution imagery-based covariates are performing better than the others. For example, GPW and WorldPop 2010 estimates use the 2001 census for redistribution, and hence it is not surprising that they perform poorly in comparison to the 2012 census. 18 However, even the best performing products may themselves become inaccurate with the passage of time as their source data in the census becomes outdated. 19 18 Similarly, an earlier unpublished version of Facebook's population data which we worked with, only yielded a correlation coefficient of 0.5 with the 2012 Sri Lankan census at the village level because Facebook initially used the 2001 Sri Lankan census for calibration. Their estimates were later updated by using the most recent census, which now gives an R 2 of 0.841 and a correlation coefficient of 0.917. 19 We also assess the consistency of four publicly available built-up area measures, namely, GUF, GUF+, GHSL, and Facebook, against each other, and the validate the built-up area estimates for the 55 sub-districts by Engstrom et al. (2017) against Facebook's. All estimates of built-up area at the village level are reasonably consistent with each other. At the national level, the R 2 between estimates from Facebook and each of the first three sources are 0.76 or greater (supplementary Table S4). In the 55 sub-districts, the R 2 between Facebook estimates and those from Engstrom et al. (2017) is also high at 0.794 (supplementary Table S5). These patterns also hold if correlation coefficients are employed for the analysis (see supplementary tables S6 and S7).

Accuracy of population density predictions based on census population counts
Poisson regression results using the census data for all villages, based on the variables selected from LASSO regularization indicate that satellite indicators have exceptionally strong predictive power in predicting village level population density (Table 5). At the national level, publicly available indicators explain a large amount of variation in village population density. In this case, the value added in using high-resolution indicators (out-of-sample R 2 is 0.75) as opposed to low resolution indicators (out-of-sample R 2 is 0.702) is minimal. In the 55 sub-districts, adding proprietary contextual features to public imagery-based models does not improve the prediction accuracy (out-of-sample R 2 remains at around 0.7). However, adding object classifiers to the model improves the out-of-sample R 2 by 10 points (0.83). 20 The results differ across sectors. Using the full national sample, the publicly available satellite indicators explain more variation in population density in the rural and estate sectors (0.808) than the urban sector (0.569) (panel A in Table 6). In the sample with 55 sub-districts, the predictive power of urban population density increases tremendously when adding the proprietary contextual and object features, relative to only including publicly available indicators (0.804 compared to 0.525). Adding proprietary indicators leads only to a limited improvement in rural areas (0.869 compared to 0.838) (panel B in Table 6). This result is consistent with greater uniformity in the relationship between built-up area and population density in rural areas, where simple measures of built-up area are sufficient to predict population density. In urban areas, due to the complex nature of the relationship between population density and buildings where the latter can be both commercial and residential, advanced measures such as object classifiers and contextual measures are required for better prediction.
Random forest (RF) models provide similar results overall and across sector-types (Table 7). The out-of-bag R 2 , RMSE, and MAE obtained from the RF models are comparable to the corresponding statistics obtained from the post-LASSO Poisson models. The accuracy pattern across resolution types seen in the LASSO results is also maintained in the RF results. Strikingly, the tremendous improvement in predictive accuracy from using very high resolution indicators in urban areas is corroborated by the RF results. These results collectively indicate that LASSO selection performs as well as random forest models in this setting. 21 20 The out-of-sample R 2 obtained in Table 5 are similar to those obtained in models that do not include log village area (supplementary Table S8), especially in models using higher-resolution indicators. The marginal effects of LASSO-selected variables from the model originally using all public and proprietary indicators are provided in supplementary Table S9. This refers to the model which uses the 55-sub-district sample and all imagery in Table 5. 21 While random forest models are known to outperform LASSO models in some contexts (Mullainathan and Spiess, 2017), these results are based on linear LASSO models. The use of Poisson regression in our context may negate the predictive advantage of random forest models, which are traditionally better able to account The size of the training data does not significantly affect the accuracy of the predictions (Table 8). Using a sample of 1,178 villages in the 55 sub-districts, the out-of-sample R 2 using all the available satellite data is 0.842, as seen earlier in Table 5. The accuracy increases marginally to 0.843 if we reduce the training sample by half, with a mild reduction to 0.832 if we further reduce the sample size by another one-half. There is only a small increase in the MAE and RMSE with the reduced sample sizes. Overall, these results indicate that changing the training sample size does not significantly change the prediction accuracy.

Survey-Based Predictions and Validation with the Census
While it is encouraging that satellite data accurately predict census population density, they do not conclusively show that survey data can be used to accurately approximate population density. We now turn to examining this by testing the accuracy of a survey-based model in the 55 sub-districts sample where we estimate a Poisson model of HIES population density in HIES-villages, and obtain out-of-sample density predictions in the non-HIES villages.
The results from a model using both public and proprietary satellite imagery-based indicators are provided in panel A, columns (1)-(3), in Table 9. The R 2 , Spearman Rank Correlation (SRC), and Mean Absolute Error (MAE) between the density predictions and the census density are 0.79, 0.91, and 664, respectively. This model uses the inverse of village population as weights. If the weights are corrected based on probability of the village being selected into the HIES, we obtain similar results: 0.78, 0.92, and 665 for the same measures. Considering the R 2 and SRC, our model performs equally well compared to the census based model, and better than all "top-down" estimates except Facebook and Worldpop 2015 (panel C). 22 In panel B, we present the same results based on a model using only publicly available satellite imagery. Interestingly, similar out-of-sample prediction accuracies are obtained using only public imagery (0.75, 0.92, and 663, and 0.77, 0.92, and 657 for the three measures using the two types of weights).
We multiplied these density predictions with village-level area to obtain population count predictions, which are reported in columns (4)-(6). The relative performance of the population count predictions, in comparison to the top-down estimates, is similar to those of the population density predictions. The R 2 and SRC are generally lower for population counts compared to density predictions, but the MAEs are also lower.
for non-linear relationships. Further, studies predicting population counts and density have specifically shown mixed results on comparative performances between random forest and LASSO-based predictions (Anderson et al., 2014). The mean and the median relative errors (which are same for population density and count predictions) are reported in columns (7) and (8), respectively. 23 The mean and median relative errors of the model are 37% and 28%, and 35% and 29%, without and with the additional weight correction. Using only public imagery, the mean and median REs are 34% and 28%, and 34% and 28% without and with weights correction, respectively. These mean REs are lower than most top-down estimates, except Facebook and WorldPop 2015. The median REs are higher than only Facebook and Worldpop 2015, and comparable to GHSL and GPW 2015 and much lower compared to GPW, Worldpop, and Landscan (all in 2010). 24 The coefficient of variation (CV) of predicted density estimates for each sub-district based on the HIES model in combination with satellite imagery is low, at about 11%. This is four times lower than the CV of the survey estimates themselves, which stands at 45.2%. The increase in precision due to the incorporation of satellite imagery is tantamount to increasing the share of sampled villages in the population from 17%, which is currently the case, to a remarkable 80%. 25 This is striking but not surprising because the model predicts with an out-of-sample R 2 of about 0.80 and the imagery covers all villages.

Discussion and Conclusion
Existing methods of estimating local population use a "top-down" dasymetric mapping approach to distribute census data based on a set of covariates derived from satellite imagery. These are only as accurate as the source data-the census-and hence their accuracy declines as the census itself ages. We propose a "bottom-up" technique by pairing survey data with satellite imagery using Poisson regressions employing variables selected based on LASSO regularization. This model can yield population density predictions in areas where the survey was not conducted, and can be repeated frequently using periodic survey and satellite data to frequently track population changes. We apply the method in the context of Sri Lanka to predict population density at the lowest administrative level, the Gram Niladhari (GN) division (village-level). Our model uses the Sri Lankan Household Income and Expenditure Survey (HIES), and predicts density out-of-sample in the non-HIES villages.
Two aspects of our results stand out. First, our performance is better than the only 23 The relative error with respect to the census is the same for population density and population count because the only difference between the two is the multiplicative factor, village-area. 24 We conducted a similar exercise using population count as the dependent variable in the model, from which we directly obtained population predictions. The accuracies of these predictions are reported in Table S10. Examining the R 2 , SRC, and Mean RE, it is evident that the population predictions derived indirectly using a density-based model by multiplying density predictions with village-area (as in Table 9) performs much better than obtaining population count predictions from a population-based model. 25 The CV of 45.2% was obtained by manually calculating the mean CVs from a synthetic 80% sample of the census. two other studies that report out-of-sample validation of "bottom-up" population estimations using similar satellite data (Harvey, 2002;Biljecki et al., 2016). Second, our predictions outperform many "top-down" estimates with the exception of Facebook and WorldPop 2015. Although ours is not the most accurate, it crucially does not directly rely on the census data.
The main reason why this method is useful is because surveys are more frequent and less expensive than censuses, and satellite data can be acquired routinely and with complete spatial coverage, and hence can potentially be more efficiently utilized to track population changes in small areas. Frequent up-to-date small area population estimations are important inputs for governments in applications such as tracking disaster management, delivering policy programs, understanding migration patterns, and many more. The Sri Lankan Department of Census and Statistics conducts the HIES every three years; the most recent was fielded in 2016, and the one before that was in 2012-13, and the next one is planned for 2019. Furthermore, Sri Lanka runs a continual Labor Force Survey that could also be used for this purpose. Therefore, the results support the case for the Department of Census and Statistics to keep careful track of the number of buildings identified in the relisting phase, and use that information to generate revised local population estimates.      Note: GPW=Gridded Population of the World; GHSL=Global Human Settlement Layer. Discrepancy is the absolute value of the difference between population estimates and the census at the village level. Percent relative to the census is (absolute-discrepancy/census-population) X 100. Note: The results are based on a Poisson regression model whose dependent variable is the census village population density, and the independent variables are selected based on LASSO regularization among the following: Relevant satellite imagery sources as described in panel B in Table 1, district fixed effects, a binary indicator for urban villages, and log-village area. The models with "No satellite imagery" use only the district fixed effects, an indicator for urban villages, and log village-area as independent variables. "Lowresolution public" indicators refer to the low-resolution and public indicators as defined in Table 1. "All public" indicators refers to all the public imagery based indicators, including the high-resolution imagery based indicators described in Table 1. "Proprietary texture" indicators include very high-resolution texture features based on Digital Globe from Engstrom et al. (2017) as defined in Table 1. "All" imagery refers to all public and proprietary texture indicators, and object identifiers, such as cars, roofs etc., as defined as in Table 1. The out-of sample R 2 is obtained from stata's crossfold command using five-fold cross-validation. Note: The results are based on a Poisson regression model whose dependent variable is census village population density, and the independent variables are selected based on LASSO regularization among the following: Relevant satellite imagery sources as described in panel B in Table 1, district fixed effects, a binary indicator for urban villages, and log village area. The models with "No satellite imagery" use only the district fixed effects, an indicator for urban villages, and log village-area as independent variables. "Low-resolution public" indicators refer to the low-resolution and public indicators as defined in Table 1. "All public" indicators refers to all the public imagery based indicators, including the high-resolution imagery based indicators described in Table 1. "Proprietary texture" indicators include very high-resolution texture features based on Digital Globe from Engstrom et al. (2017) as defined in Table 1. "All" imagery refers to all public and proprietary texture indicators, and object identifiers, such as cars, roofs etc., as defined as in Table 1. The out-of sample R 2 is obtained from stata's crossfold command using five-fold cross-validation.    probability that the PSU is selected is equal to the share of census housing units in that PSU (Department of Census and Statistics, 2015). To estimate population at the village level, we utilize the following application of Bayes rule: This decomposition is a mathematical identity rather than a description of the actual sample design. It is useful because the first term in Equation 3 can be rewritten as follows: The first term on the right-hand side of Equation 6 is the sample weight for household h in a particular village. The second is the number of households in that village that were sampled in the survey. The final term is the probability that the village is in the sample, which needs to be estimated using publicly available indicators on the population of each village in the 2011 census. 27 We merge census population with the HIES data at the village level and estimate the probability of village selection as the following: 26 93.5% of the PSUs in the sample are located in single PSU-villages. 27 The village population estimates are available at LankaStatMap (http://www.map.statistics.gov.lk/) The probability that a village was selected for the sample is equal to one minus the probability that the village was not selected. The probability that a village was not selected is equal to the product, across all other villages in the sample, of that village not being selected each time a PSU is drawn for the sample. Since the sample is without replacement, in each draw the probability that a village is not selected is equal to one minus the share of remaining This procedure generates estimates that correspond reasonably closely to the census. However, accuracy could be improved by obtaining direct counts of the number of housing units in each sample PSU, rather than indirectly obtaining estimates from the sample weights. This would eliminate sources of approximation error, such as dropping three percent of the villages that contain multiple PSUs. Therefore, the estimated accuracy of predictions using the HIESbased density measure can be interpreted as a lower bound. Because these approximation errors are largely uncorrelated with the satellite-based independent variables in the model, using the HIES-based density approximation as the dependent variable only slightly decreases the accuracy of the predictions reported in Table 9. This suggests that household survey weights, under the proper conditions, can be combined with readily available census-based population counts to obtain reasonable estimates of population density at fine geographic levels in surveyed areas.