Policy Research Working Paper 11058 Dynamic, High-Resolution Wealth Measurement in Data-Scarce Environments Zhuo Zheng Timothy Wu Richard Lee David Newhouse Talip Kilic Marshall Burke Stefano Ermon David B. Lobell Development Economics Development Data Group February 2025 Policy Research Working Paper 11058 Abstract Accurate and comprehensive measurement of household measurement problems, by providing the most accurate livelihoods is critical for monitoring progress toward pov- measurement of local-level variation in household asset erty alleviation and targeting social assistance programs for wealth across countries and cities, as well as changes in those who most need it. However, the high cost of tradi- household asset wealth over time. Experiments that artifi- tional data collection has historically made comprehensive cially restrict data availability show the model’s ability to measurement a difficult task. This paper evaluates alterna- achieve high performance with limited data. The proposed tive satellite-based deep learning approaches using detailed approach demonstrates the promise of combining satel- household census extracts from four African countries to lite imagery, publicly available geo-features, and new deep accelerate progress toward comprehensive, fine-scale, and learning architectures for hyperlocal and dynamic measure- dynamic measurement of asset wealth at scale. The results ment of wealth in data-scarce environments. indicate that transformer architectures solve multiple open This paper is a product of the Development Data Group, Development Economics. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at dnewhouse@worldbank.org and tkilic@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Dynamic, High-Resolution Wealth Measurement in Data-Scarce Environments Zhuo Zheng,a Timothy Wu,a Richard Lee,b,c David Newhouse,d Talip Kilic,d Marshall Burke,c,e Stefano Ermon,a and David B. Lobellb,c aDepartment of Computer Science, Stanford University, Stanford, 94305, CA, USA bDepartment of Earth System Science, Stanford University, Stanford, 94305, CA, USA cCenter on Food Security and the Environment, Stanford University, Stanford, 94305, CA, USA dDevelopment Economics Data Group, World Bank Group, Washington DC, 20433, DC, USA eDepartment of Environmental Social Science, Stanford University, Stanford, 94305, CA, USA ARTICLE INFO Keywords: Economic well-being High-resolution Poverty mapping Satellite image Deep learning JEL codes: C45, I32 Accurate, up-to-date, and highly resolved measurements of available satellite imagery and/or mobile phone data, combined economic well-being are essential for monitoring and achieving with early machine learning and deep learning architectures, to international goals of poverty alleviation. These goals include the show how these new sources of information could be used to United Nations’ Sustainable Development Goal 1 of “No Poverty," support broad- scale measurement of wealth and poverty [4, 17, which is nearing its original 2030 deadline, as well as countless 29]. Subsequent studies introduced further refinements that used other international and regional poverty targets. Granular publicly available or proprietary geospatial data to improve estimates of household poverty and wealth are critical for satellite-based wealth measurement [8, 12, 2]. These advances understanding whether these goals are being met, as well as for confirmed that leveraging satellite images and machine learning can targeting and evaluating anti-poverty interventions in regions be an accurate, inexpensive, and scalable solution to estimate where progress is lagging [6]. wealth [6, 21, 23]. Official poverty measurement in low- and middle-income Here we assemble a large-scale, multi-resolution, and multi- countries has long relied on household surveys, an indispensable temporal wealth dataset using national censuses or extracts but time-consuming tool for livelihood measurement. Given the obtained from National Statistics Offices and multi-spectral technical capacity needed for reliable survey measurement and satellite imagery from multiple public and private sensors of the substantial logistical difficulties in carrying out nationally or varying resolutions. Our dataset comprises over 12 million sub- nationally representative livelihood surveys, such surveys households in four African countries (Malawi, Mozambique, are often infrequently completed in much of the world, rendering Burkina Faso, and Madagascar) and, uniquely, contains precisely com- prehensive and timely measures of poverty and related georeferenced measurements within two Malawian cities as well outcomes unavailable for many periods in many regions [6, 9]. as repeated measurements of the same locations over time – two Meanwhile, survey data are typically based on samples meant to features lacking in prior studies. be representative at larger spatial scales and are thus usually We use these data to make four contributions relative to earlier inadequate for generating reliable estimates at the village or work. First, we directly test a new type of deep learning model – neighborhood level – the level at which anti-poverty interventions specifically, vision transformers – against earlier deep learning often need to be targeted. Consequently, there is a pressing need architectures based on convolutional neural networks (CNNs) that for more cost-effective and scalable alternatives to local-level are common in the literature, as well as against simpler models that livelihood measurement that can complement and scale existing use geospatial features and a tabular machine learning approach household survey–based efforts. (XGBoost) for prediction. Specifically, we design a conditioning In recent years, the abundance of publicly available remote module that enables our transformer model to handle multi-modal sensing data and recent advances in machine learning have inputs, integrating both satellite imagery and geospatial features transformed the livelihood measurement landscape, progressively simultaneously (see “Methods”). We test models that use satellite shifting from national censuses and related household surveys to imagery from Landsat (30m/pixel), PlanetScope (3m/pixel), efforts to combine this information with information from and/or SkySat (0.5m/pixel) sensors. We then compare these more satellites and other sensors. Early studies used coarse, publicly sophisticated methods and inputs with simpler methods that rely Page 1 of 11 solely on predefined geospatial features from a range of country individually and conduct country-wise five-fold cross sources. validation for each model. The CNN model uses only Landsat These comparisons are important because simpler approaches satellite imagery as input. XGBoost utilizes geospatial features that rely on publicly available data could be both easier and cheaper (geo-features) either alone or combined with satellite image to implement at scale, particularly for public organizations statistical features. All models were trained to predict the asset interested in their widespread application, and so understanding wealth index (AWI) [29]. Estimates of AWI were generated and performance tradeoffs across model architectures and inputs is linked to imagery at fine administrative levels in Madagascar, critical to under- standing how to scale promising new Malawi, and Mozambique. In Burkina Faso, AWI estimates are measurement approaches. only available for 334 communes (Table 1). This reduced the Second, a key advantage in our setting is the use of accurate, effective sample size of the training data in Burkina Faso, as seen high-resolution data from national censuses or extracts for model in the clumped pattern of survey-measured asset wealth shown in training and evaluation. In contrast to earlier work that relied the right panel of Figure 1d. This difference is reflected in the primarily on publicly available household survey data results. As shown in Figure 1a, a naive transformer model that characterized by spatially imprecise location data and limited does not condition on geo-features consistently outperforms other household samples, our data cover a much larger set of households models across the Malawi, Mozambique, and Madagascar in a given location and in some cases are precisely georeferenced. datasets. In those countries, predictions using the transformer Comparisons against such “gold standard" data allow us to model achieve 2 values of 0.83, 0.70, and 0.62, respectively, understand whether model prediction errors are a result of when trained on the full census extract. In Burkina Faso, because inaccurate predictions or noise in the measure of ground truth – of the smaller effective sample size, XGBoost using satellite an understanding that was often elusive in earlier work in imagery and geospatial features achieves the best average developing countries [6, 29]. In addition, it allows us to consider a performance among the models (62.9% of variation explained). A wide range of sample sizes to assess the minimum training data naive transformer, when using only satellite imagery, remains requirements for advanced machine learning methods to produce competitive (57.4% of variation explained). We also limit the accurate estimates. To quantify the importance of training sample number of training samples to 1%, 5%, 10%, 25%, and 50% of the size for performance, we extensively test the extent to which original training dataset to analyze how model performance varies additional training data affects model performance across with training sample size. Based on the results from these four multiple settings. countries, we empirically identify 10% as a critical inflection point Third, our high-resolution census data enables a novel for model performance, below which the accuracy of the estimates understanding of how satellite imagery and other geospatial data deteriorates rapidly. can be used to predict variation in livelihoods within urban areas in Another key factor in reducing training sample collection costs Africa – a capability that was again hard to evaluate in previous for wealth prediction is the number of households aggregated per settings given limited samples and spatial noise in training data. sample. To analyze this factor, we randomly sample 10 households This could be particularly consequential in urban environments per administrative area to construct the training sample, yielding that exhibit substantial spatial variability in livelihoods even a “10-household” training dataset for each country. We then train within small spatial domains. Using comprehensive and precisely a naive transformer model on this “10-household” training set georeferenced census data from two cities in Malawi, we are able while still evaluating its performance on the original full house- to train and test models using different resolutions of satellite hold test set. The results (Figure 1b) indicate that our transformer data, and we find that the models are surprisingly accurate in models trained with the 10-household data exhibit comparable predicting street- and neighborhood-level variation in wealth performance to those trained with data on all households. Using within these cities. Malawi as an example, the 10-household data only include Fourth, the censuses and extracts allow us to evaluate whether approximately 23% of all surveyed households. The performance imagery-based models can make accurate predictions of changes gap between models trained on all households and the 10- in wealth over time. Previous efforts were again substantially household sample is only 3 percentage points (82% versus 79%). constrained by ground data that did not repeatedly sample the In contrast, when reducing the number of training samples, the same locations in different surveys [29]. As a result, it remains performance gap between models trained on full training samples unclear whether an imagery-based model trained largely to predict and those trained on 25% of the samples is as large as 12 spatial variation in wealth or consumption would be able to predict percentage points (82% versus 70%). temporal variation, as the latter is typically both smaller and These results offer significant insights into data-efficient potentially driven by changes that are harder to detect in imagery. wealth measurement. When at least 10 households are available Repeated census data from the same locations 10 years apart in per image, surveying more enumeration areas takes precedence Malawi and Mozambique allows us to evaluate whether models over surveying additional households within the enumeration area can indeed extract information from imagery capable of predicting for predicting wealth with our transformer model. Geospatial temporal variation in asset wealth. features, recognized as valuable auxiliary data for improving economic measurement [21], are widely used in wealth prediction. Here we design a conditioning module (see “Methods”) for our Results transformer model, enabling the efficient fusion of geospatial Performance on prediction of country-level wealth features and deep visual features. The results, as shown in Figure For country-level wealth prediction, we train each model on each 1c, suggest that geospatial features significantly improve the - Page 2 of 11 Figure 1: Performance of country-level asset wealth index predictions. a. Performance comparison for four countries across four different machine learning methods trained on various fractions of the census extract. Negative R2 values are not shown. Transformer results do not integrate geo-features and are trained using asset wealth constructed from all sample households. b. Performance comparison between Transformer models trained with asset wealth constructed from all households versus 10 households per administrative area. c. Performance comparison between Transformer models with and without integrating geospatial features. d. Scatterplot of survey-measured asset wealth against predicted wealth from the best-performing fold. model performance across all countries, especially in Burkina the full census in Mozambique, for which the naive transformer Faso due to the lower effective sample size resulting from linking model is slightly more accurate. The four gridded wealth maps in images to survey data at the commune level. Geospatial features Figure 2, with a 4.8 km/pixel resolution, are generated solely using appear particularly beneficial when the training sample size is our transformer model and Landsat imagery. Without the need for smaller, indicating that the model can struggle to learn optimal geospatial feature preparation, the entire mapping process can be visual representations from raw imagery at smaller sample sizes, at completed within an hour using 8 NVIDIA RTX A4000 GPUs. which point geospatial features serve as a valuable supplement for This means that our approach has great potential to accelerate wealth prediction. Of the methods shown in Figure 1, the trans- granular wealth measurement at a national scale. former model with geo-features shown in Figure 1c yields estimates with the highest 2 in all cases except one, when training using Page 3 of 11 a b c d Figure 2: Maps of country-level predicted asset wealth index. a. Country-level wealth asset map for Malawi in 2018. b. Country-level wealth asset map for Mozambique in 2017. c. Country-level wealth asset map for Madagascar in 2018. d. Country-level wealth asset map for Burkina Faso in 2019. AWI values are generated from country-specific models and are therefore not comparable across countries. Performance on prediction of change in country-level along the channel dimension before feeding into the final wealth regression network. Unlike the previous setting, XGBoost only takes bitemporal satellite images as input since no geospatial We further evaluate country-level wealth change prediction for features are available for Malawi in 2008 and Mozambique in each country via fivefold cross-validation. Following [29], the 2007. The results (Figure 3a) show that deep learning models CNN model uses concatenated bitemporal Landsat images along trained on the full sample can capture a remarkable 52% of the the channel dimension as input. Our transformer model processes variation in Malawi and 42% in Mozambique. The deep models each of the bitemporal images individually through a weight- outperform XGBoost when given the same input data, which shared, single-image encoder and concatenates encoded features implies that representation also matters in the Page 4 of 11 spatiotemporal Figure 3: Performance of country-level predictions of decadal change in asset wealth index. a. Performance comparison for two countries across three machine learning methods trained on various fractions of the census extract. Negative R2 values are not shown. b. Country-level wealth asset change map for Malawi from 2008 to 2018. c. Performance comparison between Transformer models trained with asset wealth constructed from all households per administrative area and 10 households per administrative area. d. Country-level wealth asset change map for Mozambique from 2007 to 2017. e. Scatterplot of survey-measured asset wealth against predicted wealth from the best-performing fold. measurement of wealth. Our transformer model slightly country-level wealth prediction. outperforms the commonly used CNN in estimating decadal We further demonstrate the scalability of our transformer wealth changes in Mozambique and achieves comparable model on decadal wealth change mapping of Malawi (Figure 3b) performance in Malawi. This difference may be attributed to and Mozambique (Figure 3d). Unlike single-temporal wealth variations in training sample sizes and model complexities. maps, bitemporal wealth change maps provide deeper insights Mozambique has approximately 10× more training samples than into the dynamics of economic development, allowing for the Malawi, which could explain why the more flexible Transformer identification of regions experiencing significant growth or model outperforms CNN estimates in Mozambique. As with the decline over time. We find that the southern part of Malawi cross-sectional results above, we simulate two scenarios of exhibits more negative changes, indicating a decline in wealth data scarcity for predicting change: (i) restricting the number of over the decade, whereas the northern and some central areas are sampled enumeration areas; and (ii) reducing the number of relatively neutral or slightly positive. In Mozambique, most households aggregated per sample to 10. The results (Figure 3c) regions show an overall increase in wealth, with southern regions suggest that reducing the number of sampled locations degrades showing more wealth gains compared to the northern regions. accuracy more than reducing the number of households There are a few isolated areas, notably a blue region near the aggregated per sample, consistent with experimental results of northern part, which experienced a decrease in wealth. In both Page 5 of 11 a b c d e f g Asset wealth index h I Figure 4: Performance of city-level asset wealth index prediction. a. Performance comparison for two cities across four different machine learning methods trained on various fractions of the census. b. Performance comparison between Transformer models estimated using Skysat imagery versus Planetscope imagery c. Performance comparison between Transformer models with and without integrating geospatial features. d. Scatterplot of survey- measured asset wealth against predicted wealth from the best-performing fold. e. Satellite-based national wealth asset map for Malawi. f. Country-level wealth asset map for Lilongwe. g. City-level wealth asset map for Lilongwe. h. Country-level wealth asset map for Blantyre. i. City-level wealth asset map for Blantyre. Page 6 of 11 countries, the wealth distribution changes are not uniform, spatial details in urban areas. In this case, our transformer model suggesting that certain areas are benefiting more from economic can learn more accurate wealth representations solely from satellite growth while others are falling behind. This wealth disparity images, even when trained on varying sample sizes. could provide insights into economic policies, development Overall, we demonstrate accurate large-scale, city-level wealth programs, or external factors such as climate impacts that have mapping in two cities in Malawi, i.e., Lilongwe and Blantyre. influenced these changes. Compared to country-level wealth maps at a 4.8km resolution (Figures 4f and 4 h), our 0.3km resolution city-level wealth maps (Figures 4g and 4i) provide an unprecedentedly granular Performance on prediction of city-level wealth spatial distribution of wealth across these two cities and with strong The Landsat-based wealth prediction models above produced performance explaining up to 76% of the variation for Lilongwe wealth maps with a spatial resolution of 4.8 km. For some and up to 67% for Blantyre (Figure 4d). applications, such as targeting aid within urban areas, finer- resolution wealth maps are of interest. To that end, we utilized Discussion household- level census data from two cities to test wealth prediction using high-resolution satellite imagery (PlanetScope This paper proposes and evaluates the use of a vision transformer and SkySat). Following the same settings with the above two architecture to solve multiple open problems pertaining to subsections, we evaluate the CNN, transformer, and XGBoost combining survey and satellite data to produce wealth estimates at models, as shown in Figure 4a. The CNN and Transformer models fine spatial scales. For cross-sectional wealth predictions, careful consistently perform better than XGBoost (with or without evaluations using georeferenced census extracts from four geospatial features) across both cities. This highlights the countries show that estimates from the transformer model perform importance of deep visual representation from high-resolution well. When paired with Landsat imagery, R2 values for satellite imagery for city-level wealth prediction, which constitutes transformer models that incorporate geo-features outperform the main gap between deep learning–based models and XGBoost. commonly used CNN and XGboost models for asset index Our transformer model outperforms the CNN by a noticeable prediction in all four countries, for all sample sizes considered. margin in Blantyre, especially when utilizing the full dataset, Across all countries, accuracy degrades rapidly when using less demonstrating greater scalability with increased data while also than 10% of the census extract for training. In Mozambique and achieving comparable performance to the CNN in Lilongwe. Madagascar, estimates produced using transformer models Across both cities, all models exhibit significant improvements as explained approximately 20 to 30 percentage points more of the the training data fraction increases; however, performance gains variation in wealth than estimates produced using XGboost and begin to plateau after approximately 25%-50% of the data, geo-features. Trans- former models also outperform CNNs in all resulting in diminishing gains beyond that threshold. four countries, by amounts up to approximately 5 percentage We compare two kinds of commonly used proprietary high- points in Madagascar and Mozambique. Incorporating geo- resolution, multi-spectral satellite imagery, i.e., PlanetScope (3m) features into the transformer architecture improves performance and SkySat (0.5m), as shown in Figure 4b. The results indicate by 5 to 10 percentage points at small sample sizes in that SkySat consistently outperforms PlanetScope across various Mozambique, Madagascar, and Burkina Faso. training data fractions in both cities. Both sensors capture 4-band Transformer models also perform well when predicting varia- (red, green, blue, and near-infrared) satellite imagery, but they tion within cities at 0.3km scales, achieving 2 up to 0.76 in differ in spatial resolution and swath width. This highlights the Lilongwe and 0.67 in Blantyre. However, incorporating geo- importance of urban spatial detail in accurately measuring wealth features at this scale reduced performance, because they are using our transformer model. While PlanetScope demonstrates constructed from lower-resolution Landsat imagery. Finally, the lower average performance than SkySat, its broader swath width transformer model also generates more accurate estimates than and high revisit frequency yield more comprehensive satellite CNNs and XGboost when predicting decadal changes in the asset wealth index in Mozambique and Malawi. Model predictions imagery, facilitating large-scale wealth mapping with achieve 2 values of 0.57 in Malawi and 0.42 in Mozambique at commendable accuracy (Figures 4g and 4i). Consequently, fine spatial levels, despite the lack of available geo-features. This SkySat is well-suited for wealth measurement in local areas with is a large improvement over the 0.15 to 0.17 2 reported by [29] high accuracy requirements, and PlanetScope is more suitable for in comparable settings, and demonstrates the feasibility of large-scale wealth mapping to obtain macro insights. combining transformer models with imagery to estimate wealth While city-level wealth prediction is promising, training a changes at granular levels, given sufficient training data. city-level transformer still requires sufficient samples that are The results point to the benefits of applying transformer expensive to collect. We also evaluate whether integrating models that incorporate geospatial features to generate high- geospatial features can reduce the required training samples for resolution predictions of asset wealth. This in turn underscores the city-level wealth prediction. As presented in Figure 4c, we find importance of developing tools, documentation, and training that unlike for country-level predictions, geospatial features materials to make estimation feasible for national statistics generally reduce model performance. This is because geospatial offices, international organizations, and other data providers. In features are al- ways derived from low-resolution satellite addition, developing and evaluating methods for estimating the imagery, e.g., Landsat (30m), to achieve global coverage. These uncertainty associated with predictions is crucial to facilitate low-resolution geospatial features introduce spatial errors into implementation. high-resolution satellite image–based wealth prediction. For The results also highlight the importance of having access to a example, a coarse land cover map can ignore small but important Page 7 of 11 critical mass of training data to estimate predictive models. In Table 1 general, when the number of images we used to train the model fell National census data details. below 10% of the population, predictive performance deteriorated Country # Images # Admin areas # Households Surveyed rapidly. However, the results were far more robust to restricting the Malawi (2008) 3,432 12,412 572,764 Malawi (2018) 3,432 18,700 796,925 size of the sample used to generate the training data labels. Mozambique (2007) 37,325 45,244 4,797,372 Furthermore, prediction accuracy remained high in Burkina Faso, Mozambique (2017) 37,325 67,218 6,119,847 despite a reduced effective sample size of the training data due to Burkina Faso (2019) 13,142 334 875,872 Madagascar (2018) 31,182 14,328 4,518,322 linking satellite images to survey data at a much higher geographic level. Future work could investigate methods to further improve performance when training transformer models using the types of administrative area level, and then each pixel of the images is labeled based on the administrative area to which they belong. small samples typically collected for household surveys. Finally, we average the pixel-wise asset wealth index map to obtain Finally, the results demonstrate the potential of using trans- a scalar value as the ground truth for each image. Since a different former models to predict changes in wealth and household well- PCA is constructed for each country, a value of 0 in one country being more generally. Future work can examine the extent to which does not correspond to the same level of wealth as a 0 in another the parameters in change models are stable across time and/or country. space. This could point the way toward the use of geospatial data to generate approximate micro estimates of welfare change in Satellite imagery. For country-level wealth and its change pre- settings where survey data are unavailable. diction, we collected daylight Landsat (30m/pixel) satellite imagery for Malawi, Mozambique, Madagascar, and Burkina Methods Faso, where Malawi and Mozambique have bitemporal image pairs. We describe the details of our large-scale, multi-resolution, and Our Landsat imagery dataset was constructed using a 3-year multi-temporal wealth dataset, wealth and its change prediction median of cloud-free pixels, centered around the census year for approaches, and evaluation methods. each country. For countries with two census periods, imagery from Landsat 5 and Landsat 7 was used for the earlier census, Multi-resolution and multi-temporal wealth dataset. We while Landsat 8 was utilized for the more recent census. Each utilize data from four low-income countries (Malawi, Mozambique, Landsat image has a fixed size of 150×150 pixels, resulting in each Burkina Faso, and Madagascar) and two cities in Malawi as study image covering 20.25km2 (4.5×4.5km2). These images have six areas in Africa. These countries were selected due to the bands that are red, green, blue, near-infrared, short-wave infrared availability of location identifiers in available census extracts. This 1, and short-wave infrared 2. For city-level wealth prediction, we allows us to pinpoint models on the most comprehensive scale to collect PlanetScope (3m/pixel) and SkySat (0.5m/pixel) multi- date, to tune a general model to specific countries and even cities, spectral satellite imagery to cover each administrative area in Lilongwe and Blantyre. Based on the average size of and to robustly simulate the impact of data scarcity on model administrative areas, we empirically define the size of each grid performance across three spatiotemporal scenarios: country-level as 0.3×0.3km2, which results in each PlanetScope image with wealth level prediction, decadal country-level wealth change 100×100 pixels and each SkySat image with 600×600 pixels. prediction, and city-level wealth level prediction. Despite using 2018 household-level census data, we utilized Asset wealth index (AWI). We construct the asset wealth index available PlanetScope and SkySat imagery acquired in April using data from the national census questionnaire. We utilize full 2023. These images contain red, green, blue, and near-infrared census data in Mozambique and Madagascar, and census extracts bands. in Malawi and Burkina Faso, resulting in a total dataset Geospatial features. In addition, we supplement satellite imagery comprising over 12 million households across four countries, with with publicly available processed geospatial features, which we re- more than 700,000 households represented in each country’s fer to as geo-features. These geo-features capture population, dataset. In contrast, previous studies [29, 23] leveraging DHS data developmental, and environmental statistics. These features are included approximately 500,000 households across 23 African population structure [18], population density [28], annual rainfall countries, while LSMS data measured about 9,000 households [1], minimum and maximum temperature [1], nighttime lights [3, across five countries. Table 1 provides a full description of the 11], terra net primary product [10], aqua net primary production size of the datasets. [24], cellphone tower count [22], impervious surface change year From the census questionnaire, we rank seven housing [15], land cover type [5], GHSL [26], building counts [27], building characteristics (housing type, wall material, roof material, floor areas [27], soil pH [16], and soil organic carbon [16]. A visual material, water source, toilet type, and energy source) on a scale representation of one datapoint is provided in Figure 5; note that of 1 to 5. Additionally, we assess the presence of six assets (radio, final asset wealth labels are combined into a single scalar value. television, landline, car, motorbike, and bicycle) using a binary classification (ownership/non-ownership). This data is then Training wealth measurement models standardized and used to construct a principal components The comparisons include a tree-based model, namely extreme analysis (PCA) model, from which the first principal component gradient boosting (XGBoost) [7] and two advanced deep learning was extracted as the asset wealth index [13, 25, 29]. Asset wealth models (convolutional neural networks and transformers based on index labels are then aggregated from the household to the an encoder-linear architecture [29]. Based on empirical and Page 8 of 11 transformer architecture, we first adopt SwinV2-T as the backbone to extract deep hierarchical features. As with the CNN, two MLP layers are appended to predict AWI. To integrate geospatial features into this transformer model, we provide a conditioning mechanism that adopts a standard cross-attention layer to incorporate geospatial features into deep features in a learnable way. SwinV2-T produces four deep hierarchical features; therefore, we adopt four cross-attention layers for conditioning. Through four times conditioning, the final deep feature is well integrated with geospatial features. The last deep feature is used for wealth regression based on the above two MLP layers. For wealth change prediction, we employ a Siamese network architecture that shares a SwinV2-T backbone across the bitemporal images, i.e., we extract deep features for each image with SwinV2- T independently. We then concatenated these two deep feature sets along the channel axis and fed the resulting tensor into two MLP layers for predicting wealth change. Implementation details of deep models. We train all deep models using the same configuration. All models are trained end-to-end by minimizing the mean square error loss with the AdamW optimizer [20]. Each model is trained for 100 epochs. The total batch size of 32, a constant learning rate of 1e-4, and a weight decay of 1e-2 are used. Training data augmentation adopts D4 dihedral group transformations to alleviate overfitting. (MLP) layers are used to predict AWI. For wealth change prediction, the main difference lies in feature extraction. We concatenated the bitemporal images along the channel axis and fed the result into a CNN to extract deep features. Figure 5: An example of satellite image with geospatial features and asset wealth index label. This is a case of a country-level training sample. Model evaluation Data splits and cross-validation. To ensure a robust evaluation systematic observations from [14], we choose SwinV2-T [19] as of model performance, we employed five-fold cross-validation, a representative backbone for the transformer model. training five distinct models for each country or city. Each model is trained on four folds and tested on the remaining one. The fold XGBoost. An XGBoost regression model is used in this paper. splits were created by uniformly sampling administrative areas, to Through experimentation, we determined that simply inputting the keep all the images within the same administrative area together, image-level channel moments as XGBoost input features resulted such that all five folds have approximately the same number of in the best performance. We used three moments (mean, standard images. The R2 is used as the metric for both level and change deviation, and skew) when using satellite imagery only and one prediction. moment (mean only) when using satellite imagery and geospatial features. For wealth change predictions, we input the moments for Simulating data scarcity. To investigate the model’s performance all channels for both years into a single XGBoost model trained to under data scarcity, we simulate two scenarios of limited data directly predict the AWI change. availability. (i) Restricting the number of images. We reduce the number of images that we sample into our training data. When Convolutional neural network (CNN). Following [29], we build a restricting data, we uniformly and randomly sample images within CNN model with ResNet-18 for wealth level prediction and each of the four training folds. We conducted experiments using change prediction. For level prediction, we first use ResNet-18 to 1%, 5%, 10%, 25%, 50%, and 100% of the samples in the full extract deep features and then compute an embedding vector via training set. (ii) Restricting the number of households within the global average pooling layer. Two multilayer perceptron images. We include an alternative "10-household asset wealth" (MLP) layers are appended on the last deep feature to predict label. For the creation of our 10-household AWI labels, AWI. For wealth change prediction, the main difference lies in households were sampled uniformly from each enumeration area. feature extraction. We concatenated the bitemporal images along After this, the creation of the 10-household AWI labels was the channel axis and fed the result into a CNN to extract deep identical to the creation of full AWI labels described above. These features. labels only use data from 10 households per enumeration area, while our full asset wealth labels use hundreds to thousands of Vision transformer and its multi-modal variant. For the households per enumeration area. Page 9 of 11 Acknowledgments We thank Lina Cardona, Carlos Da Maia, Francis Mulangu, Mario Negre, Soudiki Soubeiga, and Michael Weber for their help obtaining data; Haishan Fu, Olivier Dupriez, Craig Hammer, and Jed Friedman for their support; and Brian Amaro, Nahum Maru, and Rohan Sikand for help with initial analysis. This project was partially funded by the Knowledge for Change Program’s Phase IV-funded programmatic research project “Understanding Trends in Sub-National Differences in Economic Well-Being in Low- and Middle-Income Countries" and by the Keck Foundation. Author Contributions DN, TK, MB, SE, and DL conceived of the project and designed analysis; RL processed data; ZZ and TW led the analysis; ZZ, DL, DN, and MB wrote the paper. Code and Data Availability Code to conduct analysis and generate figures is available at https://github.com/Z-Zheng/dynamic_highres_poverty. We do not currently have permission from country national statistics offices to share the household level data or image-level labels. Page 10 of 11 References [1] Abatzoglou, J.T., Dobrowski, S.Z., Parks, S.A., Hegewisch, K.C., 2018. Ter- [21] Newhouse, D., 2024. Small area estimation of poverty and wealth using raclimate, a high-resolution global dataset of monthly climate and climatic geospatial data: What have we learned so far? Calcutta Statistical Association water balance from 1958–2015. Scientific Data 5, 170191. doi:10.1038/ Bulletin 76, 7–32. sdata.2017.191. [22] OpenCelliD, 2024. Opencellid: The world’s largest open database of cell [2] Ayush, K., Uzkent, B., Burke, M., Lobell, D., Ermon, S., 2020. Generating towers. Accessed: 2024-06-08. URL: https://opencellid.org. data governed by interpretable poverty maps using object detection in satellite images. arXiv Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). preprint arXiv:2002.01612 . [23] Pettersson, M.B., Kakooei, M., Ortheden, J., Johansson, F.D., Daoud, A., [3] Baugh, K., Elvidge, C.D., Ghosh, T., Ziskin, D., 2010. Development of a 2023. Time series of satellite imagery improve deep learning estimates of 2009 stable lights product using dmsp-ols data, in: Proceedings of the Asia- neighborhood-level poverty in africa., in: IJCAI, pp. 6165–6173. Pacific Advanced Network 30, p. 114. [24] Running, S., Zhao, M., 2021. Modis/aqua net primary production gap-filled [4] Blumenstock, J., Cadamuro, G., On, R., 2015. Predicting poverty and wealth yearly l4 global 500m sin grid v061 [data set]. URL: https://doi.org/10. from mobile phone metadata. Science 350, 1073–1076. 5067/MODIS/MYD17A3HGF.061. accessed 2024-06-08. [5] Buchhorn, M., Lesiv, M., Tsendbazar, N.E., Herold, M., Bertels, L., Smets, [25] Sahn, D.E., Stifel, D., 2003. Exploring alternative measures of welfare in the B., 2020. Copernicus global land cover layers—collection 2. Remote absence of expenditure data. Review of income and wealth 49, 463–489 Sensing 12. URL: https://www.mdpi.com/2072-4292/12/6/1044, doi:10.3390/ [26] Schiavina, M., Melchiorri, M., Pesaresi, M., 2023. Ghs-smod r2023a rs12061044. - ghs settlement layers, application of the degree of urbanisation [6] Burke, M., Driscoll, A., Lobell, D.B., Ermon, S., 2021. Using satellite methodology (stage i) to ghs-pop r2023a and ghs- built-s r2023a, imagery to understand and promote sustainable development. Science 371, multitemporal (1975-2030). URL: http://data. europa.eu/89h/a0df7a6f- eabe8628. 49de-46ea-9bde-563437a6e2ba, doi:10.2905/ A0DF7A6F-49DE-46EA-9BDE- [7] Chen, T., Guestrin, C., 2016. XGBoost: A scalable tree boosting system, in: 563437A6E2BA. [Dataset] Proceedings of the 22nd acm sigkdd international conference on knowledge [27] Sirko, W., Kashubin, S., Ritter, M., Annkah, A., Bouchareb, Y.S.E., Dauphin, discovery and data mining, pp. 785–794. Y.N., Keysers, D., Neumann, M., Cissé, M., Quinn, J., 2021. Continental- [8] Chi, G., Fang, H., Chatterjee, S., Blumenstock, J.E., 2022. Microestimates of scale building detection from high resolution satellite imagery. CoRR wealth for all low-and middle-income countries. Proceedings of the National abs/2107.12283. URL: https://arxiv.org/abs/2107.12283, arXiv:2107.12283. Academy of Sciences 119, e2113658119. [28] WorldPop, School of Geography and Environmental Science, University of [9] Dang, H.A.H., Serajuddin, U., 2020. Tracking the sustainable development Southampton, Department of Geography and Geosciences, University of goals: Emerging measurement challenges and further reflections. World Louisville, Departement de Geographie, Universite de Namur, Center for Development 127, 104570. International Earth Science Information Network (CIESIN), Columbia [10] Didan, K., 2021. Modis/terra vegetation indices 16-day l3 global 500m sin University, 2018. Global high resolution population denominators project grid v061 [data set]. URL: https://doi.org/10.5067/MODIS/MOD13A1.061. - funded by the bill and melinda gates foundation (opp1134076). URL: accessed 2024-06-08. https://dx.doi.org/10.5258/SOTON/WP00674. accessed: 2024-06-08. [11] Elvidge, C.D., Baugh, K., Zhizhin, M., Hsu, F.C., Ghosh, T., 2017. Viirs [29] Yeh, C., Perez, A., Driscoll, A., Azzari, G., Tang, Z., Lobell, D., Ermon, S., night-time lights. International Journal of Remote Sensing 38, 5860–5879. Burke, M., 2020. Using publicly available satellite imagery and deep learning [12] Engstrom, R., Hersh, J., Newhouse, D., 2022. Poverty from space: Using to understand economic well-being in africa. Nature communications 11, 2 high resolution satellite imagery for estimating economic well-being. The World Bank Economic Review 36, 382–412. [13] Filmer, D., Pritchett, L.H., 2001. Estimating wealth effects without expen- diture data—or tears: an application to educational enrollments in states of india. Demography 38, 115–132. [14] Goldblum, M., Souri, H., Ni, R., Shu, M., Prabhu, V.U., Somepalli, G., Chattopadhyay, P., Ibrahim, M., Bardes, A., Hoffman, J., Chellappa, R., Wilson, A.G., Goldstein, T., 2023. Battle of the backbones: A large-scale comparison of pretrained models across computer vision tasks, in: Thirty- seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track. [15] Gong, P., Li, X., Wang, J., Bai, Y., Chen, B., Hu, T., Liu, X., Xu, B., Yang, J., Zhang, W., Zhou, Y., 2020. Annual maps of global artificial impervious area (gaia) between 1985 and 2018. Remote Sensing of Environment 236, 111510. URL: https://www.sciencedirect.com/science/article/pii/ S0034425719305292, doi:https://doi.org/10.1016/j.rse.2019.111510. [16] Hengl, T., Miller, M.A.E., Križan, J., Shepherd, K.D., Sila, A., Kilibarda, M., Antonijević, O., Glušica, L., Doğan, I., Shutcha, M.N., Leenaars, J.G.B., Wolf, J.W., van den Bosch, R., Kempen, B., de Jesus, J.M., Ribeiro, E., MacMillan, R.A., 2021. African soil properties and nutrients mapped at 30 m spatial resolution using two-scale ensemble machine learning. Scientific Reports 11, 6130. doi:10.1038/s41598-021-85639-y. [17] Jean, N., Burke, M., Xie, M., Davis, W.M., Lobell, D.B., Ermon, S., 2016. Combining satellite imagery and machine learning to predict poverty. Sci- ence 353, 790–794. [18] Linard, C., Gilbert, M., Snow, R.W., Noor, A.M., Tatem, A.J., 2012. Popula- tion distribution, settlement patterns and accessibility across africa in 2010. PLoS ONE 7, e31743. doi:10.1371/journal.pone.0031743. [19] Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., et al., 2022. Swin transformer v2: Scaling up capacity and resolution, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12009–12019. [20] Loshchilov, I., 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 . Page 11 of 11