Policy Research Working Paper              11058




     Dynamic, High-Resolution Wealth
  Measurement in Data-Scarce Environments
                           Zhuo Zheng
                           Timothy Wu
                           Richard Lee
                         David Newhouse
                            Talip Kilic
                         Marshall Burke
                          Stefano Ermon
                         David B. Lobell




Development Economics
Development Data Group
February 2025
Policy Research Working Paper 11058


  Abstract
 Accurate and comprehensive measurement of household                                measurement problems, by providing the most accurate
 livelihoods is critical for monitoring progress toward pov-                        measurement of local-level variation in household asset
 erty alleviation and targeting social assistance programs for                      wealth across countries and cities, as well as changes in
 those who most need it. However, the high cost of tradi-                           household asset wealth over time. Experiments that artifi-
 tional data collection has historically made comprehensive                         cially restrict data availability show the modelâ€™s ability to
 measurement a difficult task. This paper evaluates alterna-                        achieve high performance with limited data. The proposed
 tive satellite-based deep learning approaches using detailed                       approach demonstrates the promise of combining satel-
 household census extracts from four African countries to                           lite imagery, publicly available geo-features, and new deep
 accelerate progress toward comprehensive, fine-scale, and                          learning architectures for hyperlocal and dynamic measure-
 dynamic measurement of asset wealth at scale. The results                          ment of wealth in data-scarce environments.
 indicate that transformer architectures solve multiple open




 This paper is a product of the Development Data Group, Development Economics. It is part of a larger effort by the
 World Bank to provide open access to its research and make a contribution to development policy discussions around the
 world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may
 be contacted at dnewhouse@worldbank.org and tkilic@worldbank.org.




         The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
         issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
         names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
         of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
         its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


                                                       Produced by the Research Support Team
Dynamic, High-Resolution Wealth Measurement in Data-Scarce Environments
Zhuo Zheng,a Timothy Wu,a Richard Lee,b,c David Newhouse,d Talip Kilic,d Marshall Burke,c,e
Stefano Ermon,a and David B. Lobellb,c
aDepartment  of Computer Science, Stanford University, Stanford, 94305, CA, USA
bDepartment  of Earth System Science, Stanford University, Stanford, 94305, CA, USA
cCenter on Food Security and the Environment, Stanford University, Stanford, 94305, CA, USA
dDevelopment Economics Data Group, World Bank Group, Washington DC, 20433, DC, USA
eDepartment of Environmental Social Science, Stanford University, Stanford, 94305, CA, USA



ARTICLE INFO
Keywords:
Economic well-being
High-resolution
Poverty mapping
Satellite image
Deep learning

JEL codes:
C45, I32




    Accurate, up-to-date, and highly resolved measurements of                      available satellite imagery and/or mobile phone data, combined
economic well-being are essential for monitoring and achieving                     with early machine learning and deep learning architectures, to
international goals of poverty alleviation. These goals include the                show how these new sources of information could be used to
United Nationsâ€™ Sustainable Development Goal 1 of â€œNo Poverty,"                    support broad- scale measurement of wealth and poverty [4, 17,
which is nearing its original 2030 deadline, as well as countless                  29]. Subsequent studies introduced further refinements that used
other international and regional poverty targets. Granular                         publicly available or proprietary geospatial data to improve
estimates of household poverty and wealth are critical for                         satellite-based wealth measurement [8, 12, 2]. These advances
understanding whether these goals are being met, as well as for                    confirmed that leveraging satellite images and machine learning can
targeting and evaluating anti-poverty interventions in regions                     be an accurate, inexpensive, and scalable solution to estimate
where progress is lagging [6].                                                     wealth [6, 21, 23].
    Official poverty measurement in low- and middle-income                             Here we assemble a large-scale, multi-resolution, and multi-
countries has long relied on household surveys, an indispensable                   temporal wealth dataset using national censuses or extracts
but time-consuming tool for livelihood measurement. Given the                      obtained from National Statistics Offices and multi-spectral
technical capacity needed for reliable survey measurement and                      satellite imagery from multiple public and private sensors of
the substantial logistical difficulties in carrying out nationally or              varying resolutions. Our dataset comprises over 12 million
sub- nationally representative livelihood surveys, such surveys                    households in four African countries (Malawi, Mozambique,
are often infrequently completed in much of the world, rendering                   Burkina Faso, and Madagascar) and, uniquely, contains precisely
com- prehensive and timely measures of poverty and related                         georeferenced measurements within two Malawian cities as well
outcomes unavailable for many periods in many regions [6, 9].                      as repeated measurements of the same locations over time â€“ two
Meanwhile, survey data are typically based on samples meant to                     features lacking in prior studies.
be representative at larger spatial scales and are thus usually                        We use these data to make four contributions relative to earlier
inadequate for generating reliable estimates at the village or                     work. First, we directly test a new type of deep learning model â€“
neighborhood level â€“ the level at which anti-poverty interventions                 specifically, vision transformers â€“ against earlier deep learning
often need to be targeted. Consequently, there is a pressing need                  architectures based on convolutional neural networks (CNNs) that
for more cost-effective and scalable alternatives to local-level                   are common in the literature, as well as against simpler models that
livelihood measurement that can complement and scale existing                      use geospatial features and a tabular machine learning approach
household surveyâ€“based efforts.                                                    (XGBoost) for prediction. Specifically, we design a conditioning
    In recent years, the abundance of publicly available remote                    module that enables our transformer model to handle multi-modal
sensing data and recent advances in machine learning have                          inputs, integrating both satellite imagery and geospatial features
transformed the livelihood measurement landscape, progressively                    simultaneously (see â€œMethodsâ€). We test models that use satellite
shifting from national censuses and related household surveys to                   imagery from Landsat (30m/pixel), PlanetScope (3m/pixel),
efforts to combine this information with information from                          and/or SkySat (0.5m/pixel) sensors. We then compare these more
satellites and other sensors. Early studies used coarse, publicly                  sophisticated methods and inputs with simpler methods that rely


                                                                                                                                        Page 1 of 11
solely on predefined geospatial features from a range of                country individually and conduct country-wise five-fold cross
sources.                                                                validation for each model. The CNN model uses only Landsat
     These comparisons are important because simpler approaches         satellite imagery as input. XGBoost utilizes geospatial features
that rely on publicly available data could be both easier and cheaper   (geo-features) either alone or combined with satellite image
to implement at scale, particularly for public organizations            statistical features. All models were trained to predict the asset
interested in their widespread application, and so understanding        wealth index (AWI) [29]. Estimates of AWI were generated and
performance tradeoffs across model architectures and inputs is          linked to imagery at fine administrative levels in Madagascar,
critical to under- standing how to scale promising new                  Malawi, and Mozambique. In Burkina Faso, AWI estimates are
measurement approaches.                                                 only available for 334 communes (Table 1). This reduced the
     Second, a key advantage in our setting is the use of accurate,     effective sample size of the training data in Burkina Faso, as seen
high-resolution data from national censuses or extracts for model       in the clumped pattern of survey-measured asset wealth shown in
training and evaluation. In contrast to earlier work that relied        the right panel of Figure 1d. This difference is reflected in the
primarily on publicly available household survey data                   results. As shown in Figure 1a, a naive transformer model that
characterized by spatially imprecise location data and limited          does not condition on geo-features consistently outperforms other
household samples, our data cover a much larger set of households       models across the Malawi, Mozambique, and Madagascar
in a given location and in some cases are precisely georeferenced.      datasets. In those countries, predictions using the transformer
Comparisons against such â€œgold standard" data allow us to               model achieve í µí±…í µí±…2 values of 0.83, 0.70, and 0.62, respectively,
understand whether model prediction errors are a result of              when trained on the full census extract. In Burkina Faso, because
inaccurate predictions or noise in the measure of ground truth â€“        of the smaller effective sample size, XGBoost using satellite
an understanding that was often elusive in earlier work in              imagery and geospatial features achieves the best average
developing countries [6, 29]. In addition, it allows us to consider a   performance among the models (62.9% of variation explained). A
wide range of sample sizes to assess the minimum training data          naive transformer, when using only satellite imagery, remains
requirements for advanced machine learning methods to produce           competitive (57.4% of variation explained). We also limit the
accurate estimates. To quantify the importance of training sample       number of training samples to 1%, 5%, 10%, 25%, and 50% of the
size for performance, we extensively test the extent to which           original training dataset to analyze how model performance varies
additional training data affects model performance across               with training sample size. Based on the results from these four
multiple settings.                                                      countries, we empirically identify 10% as a critical inflection point
     Third, our high-resolution census data enables a novel             for model performance, below which the accuracy of the estimates
understanding of how satellite imagery and other geospatial data        deteriorates rapidly.
can be used to predict variation in livelihoods within urban areas in        Another key factor in reducing training sample collection costs
Africa â€“ a capability that was again hard to evaluate in previous       for wealth prediction is the number of households aggregated per
settings given limited samples and spatial noise in training data.      sample. To analyze this factor, we randomly sample 10 households
This could be particularly consequential in urban environments          per administrative area to construct the training sample, yielding
that exhibit substantial spatial variability in livelihoods even        a â€œ10-householdâ€ training dataset for each country. We then train
within small spatial domains. Using comprehensive and precisely         a naive transformer model on this â€œ10-householdâ€ training set
georeferenced census data from two cities in Malawi, we are able        while still evaluating its performance on the original full house-
to train and test models using different resolutions of satellite       hold test set. The results (Figure 1b) indicate that our transformer
data, and we find that the models are surprisingly accurate in          models trained with the 10-household data exhibit comparable
predicting street- and neighborhood-level variation in wealth           performance to those trained with data on all households. Using
within these cities.                                                    Malawi as an example, the 10-household data only include
     Fourth, the censuses and extracts allow us to evaluate whether     approximately 23% of all surveyed households. The performance
imagery-based models can make accurate predictions of changes           gap between models trained on all households and the 10-
in wealth over time. Previous efforts were again substantially          household sample is only 3 percentage points (82% versus 79%).
constrained by ground data that did not repeatedly sample the           In contrast, when reducing the number of training samples, the
same locations in different surveys [29]. As a result, it remains       performance gap between models trained on full training samples
unclear whether an imagery-based model trained largely to predict       and those trained on 25% of the samples is as large as 12
spatial variation in wealth or consumption would be able to predict     percentage points (82% versus 70%).
temporal variation, as the latter is typically both smaller and              These results offer significant insights into data-efficient
potentially driven by changes that are harder to detect in imagery.     wealth measurement. When at least 10 households are available
Repeated census data from the same locations 10 years apart in          per image, surveying more enumeration areas takes precedence
Malawi and Mozambique allows us to evaluate whether models              over surveying additional households within the enumeration area
can indeed extract information from imagery capable of predicting       for predicting wealth with our transformer model. Geospatial
temporal variation in asset wealth.                                     features, recognized as valuable auxiliary data for improving
                                                                        economic measurement [21], are widely used in wealth prediction.
                                                                        Here we design a conditioning module (see â€œMethodsâ€) for our
Results                                                                 transformer model, enabling the efficient fusion of geospatial
Performance on prediction of country-level wealth                       features and deep visual features. The results, as shown in Figure
For country-level wealth prediction, we train each model on each        1c, suggest that geospatial features significantly improve the
-
                                                                                                                              Page 2 of 11
Figure 1: Performance of country-level asset wealth index predictions. a. Performance comparison for four countries across four different machine
learning methods trained on various fractions of the census extract. Negative R2 values are not shown. Transformer results do not integrate geo-features
and are trained using asset wealth constructed from all sample households. b. Performance comparison between Transformer models trained with asset
wealth constructed from all households versus 10 households per administrative area. c. Performance comparison between Transformer models with
and without integrating geospatial features. d. Scatterplot of survey-measured asset wealth against predicted wealth from the best-performing fold.


model performance across all countries, especially in Burkina                 the full census in Mozambique, for which the naive transformer
Faso due to the lower effective sample size resulting from linking            model is slightly more accurate. The four gridded wealth maps in
images to survey data at the commune level. Geospatial features               Figure 2, with a 4.8 km/pixel resolution, are generated solely using
appear particularly beneficial when the training sample size is               our transformer model and Landsat imagery. Without the need for
smaller, indicating that the model can struggle to learn optimal              geospatial feature preparation, the entire mapping process can be
visual representations from raw imagery at smaller sample sizes, at           completed within an hour using 8 NVIDIA RTX A4000 GPUs.
which point geospatial features serve as a valuable supplement for            This means that our approach has great potential to accelerate
wealth prediction. Of the methods shown in Figure 1, the trans-               granular wealth measurement at a national scale.
former model with geo-features shown in Figure 1c yields estimates
with the highest í µí±…í µí±…2 in all cases except one, when training using
                                                                                                                                        Page 3 of 11
  a                                                                     b




  c                                                                     d




Figure 2: Maps of country-level predicted asset wealth index. a. Country-level wealth asset map for Malawi in 2018. b. Country-level wealth asset
map for Mozambique in 2017. c. Country-level wealth asset map for Madagascar in 2018. d. Country-level wealth asset map for Burkina Faso in 2019.
AWI values are generated from country-specific models and are therefore not comparable across countries.

Performance on prediction of change in country-level                        along the channel dimension before feeding into the final
wealth                                                                      regression network. Unlike the previous setting, XGBoost only
                                                                            takes bitemporal satellite images as input since no geospatial
We further evaluate country-level wealth change prediction for              features are available for Malawi in 2008 and Mozambique in
each country via fivefold cross-validation. Following [29], the             2007. The results (Figure 3a) show that deep learning models
CNN model uses concatenated bitemporal Landsat images along                 trained on the full sample can capture a remarkable 52% of the
the channel dimension as input. Our transformer model processes             variation in Malawi and 42% in Mozambique. The deep models
each of the bitemporal images individually through a weight-                outperform XGBoost when given the same input data, which
shared, single-image encoder and concatenates encoded features              implies   that    representation       also    matters     in    the
                                                                                                                                  Page 4 of 11
spatiotemporal




Figure 3: Performance of country-level predictions of decadal change in asset wealth index. a. Performance comparison for two countries across
three machine learning methods trained on various fractions of the census extract. Negative R2 values are not shown. b. Country-level wealth asset
change map for Malawi from 2008 to 2018. c. Performance comparison between Transformer models trained with asset wealth constructed from all
households per administrative area and 10 households per administrative area. d. Country-level wealth asset change map for Mozambique from 2007
to 2017. e. Scatterplot of survey-measured asset wealth against predicted wealth from the best-performing fold.


measurement of wealth. Our transformer model slightly                      country-level wealth prediction.
outperforms the commonly used CNN in estimating decadal                          We further demonstrate the scalability of our transformer
wealth changes in Mozambique and achieves comparable                       model on decadal wealth change mapping of Malawi (Figure 3b)
performance in Malawi. This difference may be attributed to                and Mozambique (Figure 3d). Unlike single-temporal wealth
variations in training sample sizes and model complexities.                maps, bitemporal wealth change maps provide deeper insights
Mozambique has approximately 10Ã— more training samples than                into the dynamics of economic development, allowing for the
Malawi, which could explain why the more flexible Transformer              identification of regions experiencing significant growth or
model outperforms CNN estimates in Mozambique. As with the                 decline over time. We find that the southern part of Malawi
cross-sectional results above, we simulate two scenarios of                exhibits more negative changes, indicating a decline in wealth
data scarcity for predicting change: (i) restricting the number of         over the decade, whereas the northern and some central areas are
sampled enumeration areas; and (ii) reducing the number of                 relatively neutral or slightly positive. In Mozambique, most
households aggregated per sample to 10. The results (Figure 3c)            regions show an overall increase in wealth, with southern regions
suggest that reducing the number of sampled locations degrades             showing more wealth gains compared to the northern regions.
accuracy more than reducing the number of households                       There are a few isolated areas, notably a blue region near the
aggregated per sample, consistent with experimental results of             northern part, which experienced a decrease in wealth. In both
                                                                                                                                   Page 5 of 11
     a                                                                     b




     c                                                                     d




     e                                              f                                              g

                  Asset wealth index




                                                    h                                              I




Figure 4: Performance of city-level asset wealth index prediction. a. Performance comparison for two cities across four different machine learning
methods trained on various fractions of the census. b. Performance comparison between Transformer models estimated using Skysat imagery versus
Planetscope imagery c. Performance comparison between Transformer models with and without integrating geospatial features. d. Scatterplot of survey-
measured asset wealth against predicted wealth from the best-performing fold. e. Satellite-based national wealth asset map for Malawi. f. Country-level
wealth asset map for Lilongwe. g. City-level wealth asset map for Lilongwe. h. Country-level wealth asset map for Blantyre. i. City-level wealth asset
map for Blantyre.




                                                                                                                                       Page 6 of 11
countries, the wealth distribution changes are not uniform,             spatial details in urban areas. In this case, our transformer model
suggesting that certain areas are benefiting more from economic         can learn more accurate wealth representations solely from satellite
growth while others are falling behind. This wealth disparity           images, even when trained on varying sample sizes.
could provide insights into economic policies, development                   Overall, we demonstrate accurate large-scale, city-level wealth
programs, or external factors such as climate impacts that have         mapping in two cities in Malawi, i.e., Lilongwe and Blantyre.
influenced these changes.                                               Compared to country-level wealth maps at a 4.8km resolution
                                                                        (Figures 4f and 4 h), our 0.3km resolution city-level wealth
                                                                        maps (Figures 4g and 4i) provide an unprecedentedly granular
Performance on prediction of city-level wealth                          spatial distribution of wealth across these two cities and with strong
The Landsat-based wealth prediction models above produced               performance explaining up to 76% of the variation for Lilongwe
wealth maps with a spatial resolution of 4.8 km. For some               and up to 67% for Blantyre (Figure 4d).
applications, such as targeting aid within urban areas, finer-
resolution wealth maps are of interest. To that end, we utilized        Discussion
household- level census data from two cities to test wealth
prediction using high-resolution satellite imagery (PlanetScope         This paper proposes and evaluates the use of a vision transformer
and SkySat). Following the same settings with the above two             architecture to solve multiple open problems pertaining to
subsections, we evaluate the CNN, transformer, and XGBoost              combining survey and satellite data to produce wealth estimates at
models, as shown in Figure 4a. The CNN and Transformer models           fine spatial scales. For cross-sectional wealth predictions, careful
consistently perform better than XGBoost (with or without               evaluations using georeferenced census extracts from four
geospatial features) across both cities. This highlights the            countries show that estimates from the transformer model perform
importance of deep visual representation from high-resolution           well. When paired with Landsat imagery, R2 values for
satellite imagery for city-level wealth prediction, which constitutes   transformer models that incorporate geo-features outperform
the main gap between deep learningâ€“based models and XGBoost.            commonly used CNN and XGboost models for asset index
Our transformer model outperforms the CNN by a noticeable               prediction in all four countries, for all sample sizes considered.
margin in Blantyre, especially when utilizing the full dataset,         Across all countries, accuracy degrades rapidly when using less
demonstrating greater scalability with increased data while also        than 10% of the census extract for training. In Mozambique and
achieving comparable performance to the CNN in Lilongwe.                Madagascar, estimates produced using transformer models
Across both cities, all models exhibit significant improvements as      explained approximately 20 to 30 percentage points more of the
the training data fraction increases; however, performance gains        variation in wealth than estimates produced using XGboost and
begin to plateau after approximately 25%-50% of the data,               geo-features. Trans- former models also outperform CNNs in all
resulting in diminishing gains beyond that threshold.                   four countries, by amounts up to approximately 5 percentage
     We compare two kinds of commonly used proprietary high-            points in Madagascar and Mozambique. Incorporating geo-
resolution, multi-spectral satellite imagery, i.e., PlanetScope (3m)    features into the transformer architecture improves performance
and SkySat (0.5m), as shown in Figure 4b. The results indicate          by 5 to 10 percentage points at small sample sizes in
that SkySat consistently outperforms PlanetScope across various         Mozambique, Madagascar, and Burkina Faso.
training data fractions in both cities. Both sensors capture 4-band         Transformer models also perform well when predicting varia-
(red, green, blue, and near-infrared) satellite imagery, but they       tion within cities at 0.3km scales, achieving í µí±…í µí±…2 up to 0.76 in
differ in spatial resolution and swath width. This highlights the       Lilongwe and 0.67 in Blantyre. However, incorporating geo-
importance of urban spatial detail in accurately measuring wealth       features at this scale reduced performance, because they are
using our transformer model. While PlanetScope demonstrates             constructed from lower-resolution Landsat imagery. Finally, the
lower average performance than SkySat, its broader swath width          transformer model also generates more accurate estimates than
and high revisit frequency yield more comprehensive satellite           CNNs and XGboost when predicting decadal changes in the asset
                                                                        wealth index in Mozambique and Malawi. Model predictions
imagery, facilitating large-scale wealth mapping with
                                                                        achieve í µí±…í µí±…2 values of 0.57 in Malawi and 0.42 in Mozambique at
commendable accuracy (Figures 4g and 4i). Consequently,
                                                                        fine spatial levels, despite the lack of available geo-features. This
SkySat is well-suited for wealth measurement in local areas with
                                                                        is a large improvement over the 0.15 to 0.17 í µí±…í µí±…2 reported by [29]
high accuracy requirements, and PlanetScope is more suitable for
                                                                        in comparable settings, and demonstrates the feasibility of
large-scale wealth mapping to obtain macro insights.
                                                                        combining transformer models with imagery to estimate wealth
     While city-level wealth prediction is promising, training a
                                                                        changes at granular levels, given sufficient training data.
city-level transformer still requires sufficient samples that are
                                                                            The results point to the benefits of applying transformer
expensive to collect. We also evaluate whether integrating
                                                                        models that incorporate geospatial features to generate high-
geospatial features can reduce the required training samples for
                                                                        resolution predictions of asset wealth. This in turn underscores the
city-level wealth prediction. As presented in Figure 4c, we find
                                                                        importance of developing tools, documentation, and training
that unlike for country-level predictions, geospatial features
                                                                        materials to make estimation feasible for national statistics
generally reduce model performance. This is because geospatial
                                                                        offices, international organizations, and other data providers. In
features are al- ways derived from low-resolution satellite
                                                                        addition, developing and evaluating methods for estimating the
imagery, e.g., Landsat (30m), to achieve global coverage. These         uncertainty associated with predictions is crucial to facilitate
low-resolution geospatial features introduce spatial errors into
                                                                        implementation.
high-resolution satellite imageâ€“based wealth prediction. For                 The results also highlight the importance of having access to a
example, a coarse land cover map can ignore small but important
                                                                                                                              Page 7 of 11
critical mass of training data to estimate predictive models. In          Table 1
general, when the number of images we used to train the model fell        National census data details.
below 10% of the population, predictive performance deteriorated            Country               # Images # Admin areas # Households Surveyed
rapidly. However, the results were far more robust to restricting the       Malawi (2008)            3,432       12,412                572,764
                                                                            Malawi (2018)            3,432       18,700                796,925
size of the sample used to generate the training data labels.               Mozambique (2007)       37,325       45,244              4,797,372
Furthermore, prediction accuracy remained high in Burkina Faso,             Mozambique (2017)       37,325       67,218              6,119,847
despite a reduced effective sample size of the training data due to         Burkina Faso (2019)     13,142          334                875,872
                                                                            Madagascar (2018)       31,182       14,328              4,518,322
linking satellite images to survey data at a much higher geographic
level. Future work could investigate methods to further improve
performance when training transformer models using the types of           administrative area level, and then each pixel of the images is
                                                                          labeled based on the administrative area to which they belong.
small samples typically collected for household surveys.
                                                                          Finally, we average the pixel-wise asset wealth index map to obtain
     Finally, the results demonstrate the potential of using trans-       a scalar value as the ground truth for each image. Since a different
former models to predict changes in wealth and household well-            PCA is constructed for each country, a value of 0 in one country
being more generally. Future work can examine the extent to which         does not correspond to the same level of wealth as a 0 in another
the parameters in change models are stable across time and/or             country.
space. This could point the way toward the use of geospatial data
to generate approximate micro estimates of welfare change in              Satellite imagery. For country-level wealth and its change pre-
settings where survey data are unavailable.                               diction, we collected daylight Landsat (30m/pixel) satellite
                                                                          imagery for Malawi, Mozambique, Madagascar, and Burkina
Methods                                                                   Faso, where Malawi and Mozambique have bitemporal image pairs.
We describe the details of our large-scale, multi-resolution, and         Our Landsat imagery dataset was constructed using a 3-year
multi-temporal wealth dataset, wealth and its change prediction           median of cloud-free pixels, centered around the census year for
approaches, and evaluation methods.                                       each country. For countries with two census periods, imagery
                                                                          from Landsat 5 and Landsat 7 was used for the earlier census,
Multi-resolution and multi-temporal wealth dataset. We                    while Landsat 8 was utilized for the more recent census. Each
utilize data from four low-income countries (Malawi, Mozambique,          Landsat image has a fixed size of 150Ã—150 pixels, resulting in each
Burkina Faso, and Madagascar) and two cities in Malawi as study           image covering 20.25km2 (4.5Ã—4.5km2). These images have six
areas in Africa. These countries were selected due to the                 bands that are red, green, blue, near-infrared, short-wave infrared
availability of location identifiers in available census extracts. This   1, and short-wave infrared 2. For city-level wealth prediction, we
allows us to pinpoint models on the most comprehensive scale to           collect PlanetScope (3m/pixel) and SkySat (0.5m/pixel) multi-
date, to tune a general model to specific countries and even cities,      spectral satellite imagery to cover each administrative area in
                                                                          Lilongwe and Blantyre. Based on the average size of
and to robustly simulate the impact of data scarcity on model
                                                                          administrative areas, we empirically define the size of each grid
performance across three spatiotemporal scenarios: country-level
                                                                          as 0.3Ã—0.3km2, which results in each PlanetScope image with
wealth level prediction, decadal country-level wealth change
                                                                          100Ã—100 pixels and each SkySat image with 600Ã—600 pixels.
prediction, and city-level wealth level prediction.
                                                                          Despite using 2018 household-level census data, we utilized
Asset wealth index (AWI). We construct the asset wealth index             available PlanetScope and SkySat imagery acquired in April
using data from the national census questionnaire. We utilize full        2023. These images contain red, green, blue, and near-infrared
census data in Mozambique and Madagascar, and census extracts             bands.
in Malawi and Burkina Faso, resulting in a total dataset
                                                                          Geospatial features. In addition, we supplement satellite imagery
comprising over 12 million households across four countries, with
                                                                          with publicly available processed geospatial features, which we re-
more than 700,000 households represented in each countryâ€™s
                                                                          fer to as geo-features. These geo-features capture population,
dataset. In contrast, previous studies [29, 23] leveraging DHS data
                                                                          developmental, and environmental statistics. These features are
included approximately 500,000 households across 23 African
                                                                          population structure [18], population density [28], annual rainfall
countries, while LSMS data measured about 9,000 households
                                                                          [1], minimum and maximum temperature [1], nighttime lights [3,
across five countries. Table 1 provides a full description of the
                                                                          11], terra net primary product [10], aqua net primary production
size of the datasets.
                                                                          [24], cellphone tower count [22], impervious surface change year
    From the census questionnaire, we rank seven housing
                                                                          [15], land cover type [5], GHSL [26], building counts [27], building
characteristics (housing type, wall material, roof material, floor
                                                                          areas [27], soil pH [16], and soil organic carbon [16]. A visual
material, water source, toilet type, and energy source) on a scale
                                                                          representation of one datapoint is provided in Figure 5; note that
of 1 to 5. Additionally, we assess the presence of six assets (radio,
                                                                          final asset wealth labels are combined into a single scalar value.
television, landline, car, motorbike, and bicycle) using a binary
classification (ownership/non-ownership). This data is then
                                                                          Training wealth measurement models
standardized and used to construct a principal components
                                                                          The comparisons include a tree-based model, namely extreme
analysis (PCA) model, from which the first principal component
                                                                          gradient boosting (XGBoost) [7] and two advanced deep learning
was extracted as the asset wealth index [13, 25, 29]. Asset wealth
                                                                          models (convolutional neural networks and transformers based on
index labels are then aggregated from the household to the
                                                                          an encoder-linear architecture [29]. Based on empirical and

                                                                                                                                  Page 8 of 11
                                                                               transformer architecture, we first adopt SwinV2-T as the backbone
                                                                               to extract deep hierarchical features. As with the CNN, two MLP
                                                                               layers are appended to predict AWI. To integrate geospatial
                                                                               features into this transformer model, we provide a conditioning
                                                                               mechanism that adopts a standard cross-attention layer to
                                                                               incorporate geospatial features into deep features in a learnable
                                                                               way. SwinV2-T produces four deep hierarchical features;
                                                                               therefore, we adopt four cross-attention layers for conditioning.
                                                                               Through four times conditioning, the final deep feature is well
                                                                               integrated with geospatial features. The last deep feature is used
                                                                               for wealth regression based on the above two MLP layers. For
                                                                               wealth change prediction, we employ a Siamese network
                                                                               architecture that shares a SwinV2-T backbone across the bitemporal
                                                                               images, i.e., we extract deep features for each image with SwinV2-
                                                                               T independently. We then concatenated these two deep feature sets
                                                                               along the channel axis and fed the resulting tensor into two MLP
                                                                               layers for predicting wealth change.

                                                                               Implementation details of deep models. We train all deep models
                                                                               using the same configuration. All models are trained end-to-end
                                                                               by minimizing the mean square error loss with the AdamW
                                                                               optimizer [20]. Each model is trained for 100 epochs. The total
                                                                               batch size of 32, a constant learning rate of 1e-4, and a weight
                                                                               decay of 1e-2 are used. Training data augmentation adopts D4
                                                                               dihedral group transformations to alleviate overfitting. (MLP)
                                                                               layers are used to predict AWI. For wealth change prediction, the
                                                                               main difference lies in feature extraction. We concatenated the
                                                                               bitemporal images along the channel axis and fed the result into a
                                                                               CNN to extract deep features.
Figure 5: An example of satellite image with geospatial features and
asset wealth index label. This is a case of a country-level training sample.   Model evaluation
                                                                               Data splits and cross-validation. To ensure a robust evaluation
systematic observations from [14], we choose SwinV2-T [19] as                  of model performance, we employed five-fold cross-validation,
a representative backbone for the transformer model.                           training five distinct models for each country or city. Each model
                                                                               is trained on four folds and tested on the remaining one. The fold
XGBoost. An XGBoost regression model is used in this paper.                    splits were created by uniformly sampling administrative areas, to
Through experimentation, we determined that simply inputting the               keep all the images within the same administrative area together,
image-level channel moments as XGBoost input features resulted                 such that all five folds have approximately the same number of
in the best performance. We used three moments (mean, standard                 images. The R2 is used as the metric for both level and change
deviation, and skew) when using satellite imagery only and one                 prediction.
moment (mean only) when using satellite imagery and geospatial
features. For wealth change predictions, we input the moments for              Simulating data scarcity. To investigate the modelâ€™s performance
all channels for both years into a single XGBoost model trained to             under data scarcity, we simulate two scenarios of limited data
directly predict the AWI change.                                               availability. (i) Restricting the number of images. We reduce the
                                                                               number of images that we sample into our training data. When
Convolutional neural network (CNN). Following [29], we build a                 restricting data, we uniformly and randomly sample images within
CNN model with ResNet-18 for wealth level prediction and                       each of the four training folds. We conducted experiments using
change prediction. For level prediction, we first use ResNet-18 to             1%, 5%, 10%, 25%, 50%, and 100% of the samples in the full
extract deep features and then compute an embedding vector via                 training set. (ii) Restricting the number of households within
the global average pooling layer. Two multilayer perceptron                    images. We include an alternative "10-household asset wealth"
(MLP) layers are appended on the last deep feature to predict                  label. For the creation of our 10-household AWI labels,
AWI. For wealth change prediction, the main difference lies in                 households were sampled uniformly from each enumeration area.
feature extraction. We concatenated the bitemporal images along                After this, the creation of the 10-household AWI labels was
the channel axis and fed the result into a CNN to extract deep                 identical to the creation of full AWI labels described above. These
features.                                                                      labels only use data from 10 households per enumeration area,
                                                                               while our full asset wealth labels use hundreds to thousands of
Vision transformer and its multi-modal variant. For the                        households per enumeration area.
                                                                                                                                    Page 9 of 11
Acknowledgments

We thank Lina Cardona, Carlos Da Maia, Francis Mulangu, Mario
Negre, Soudiki Soubeiga, and Michael Weber for their help
obtaining data; Haishan Fu, Olivier Dupriez, Craig Hammer, and
Jed Friedman for their support; and Brian Amaro, Nahum Maru,
and Rohan Sikand for help with initial analysis. This project was
partially funded by the Knowledge for Change Programâ€™s Phase
IV-funded programmatic research project â€œUnderstanding Trends
in Sub-National Differences in Economic Well-Being in Low- and
Middle-Income Countries" and by the Keck Foundation.


Author Contributions
DN, TK, MB, SE, and DL conceived of the project and designed
analysis; RL processed data; ZZ and TW led the analysis; ZZ, DL,
DN, and MB wrote the paper.

Code and Data Availability
Code to conduct analysis and generate figures is available at
https://github.com/Z-Zheng/dynamic_highres_poverty. We do not
currently have permission from country national statistics offices
to share the household level data or image-level labels.




                                                                     Page 10 of 11
References
 [1] Abatzoglou, J.T., Dobrowski, S.Z., Parks, S.A., Hegewisch, K.C., 2018. Ter-                [21] Newhouse, D., 2024. Small area estimation of poverty and wealth using
     raclimate, a high-resolution global dataset of monthly climate and climatic                     geospatial data: What have we learned so far? Calcutta Statistical Association
     water balance from 1958â€“2015. Scientific Data 5, 170191. doi:10.1038/                           Bulletin 76, 7â€“32.
     sdata.2017.191.                                                                            [22] OpenCelliD, 2024. Opencellid: The worldâ€™s largest open database of cell
 [2] Ayush, K., Uzkent, B., Burke, M., Lobell, D., Ermon, S., 2020. Generating                       towers. Accessed: 2024-06-08. URL: https://opencellid.org. data governed by
     interpretable poverty maps using object detection in satellite images. arXiv                    Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
     preprint arXiv:2002.01612 .                                                                [23] Pettersson, M.B., Kakooei, M., Ortheden, J., Johansson, F.D., Daoud, A.,
 [3] Baugh, K., Elvidge, C.D., Ghosh, T., Ziskin, D., 2010. Development of a                         2023. Time series of satellite imagery improve deep learning estimates of
     2009 stable lights product using dmsp-ols data, in: Proceedings of the Asia-                    neighborhood-level poverty in africa., in: IJCAI, pp. 6165â€“6173.
     Pacific Advanced Network 30, p. 114.                                                       [24] Running, S., Zhao, M., 2021. Modis/aqua net primary production gap-filled
 [4] Blumenstock, J., Cadamuro, G., On, R., 2015. Predicting poverty and wealth                      yearly l4 global 500m sin grid v061 [data set]. URL: https://doi.org/10.
     from mobile phone metadata. Science 350, 1073â€“1076.                                             5067/MODIS/MYD17A3HGF.061. accessed 2024-06-08.
 [5] Buchhorn, M., Lesiv, M., Tsendbazar, N.E., Herold, M., Bertels, L., Smets,                 [25] Sahn, D.E., Stifel, D., 2003. Exploring alternative measures of welfare in the
     B., 2020. Copernicus global land cover layersâ€”collection 2. Remote                              absence of expenditure data. Review of income and wealth 49, 463â€“489
       Sensing 12. URL: https://www.mdpi.com/2072-4292/12/6/1044, doi:10.3390/                  [26] Schiavina, M., Melchiorri, M., Pesaresi, M., 2023. Ghs-smod r2023a
       rs12061044.                                                                                   - ghs settlement layers, application of the degree of urbanisation
 [6]   Burke, M., Driscoll, A., Lobell, D.B., Ermon, S., 2021. Using satellite                       methodology (stage i) to ghs-pop r2023a and ghs- built-s r2023a,
       imagery to understand and promote sustainable development. Science 371,                       multitemporal (1975-2030).              URL: http://data. europa.eu/89h/a0df7a6f-
       eabe8628.                                                                                     49de-46ea-9bde-563437a6e2ba,            doi:10.2905/   A0DF7A6F-49DE-46EA-9BDE-
 [7]   Chen, T., Guestrin, C., 2016. XGBoost: A scalable tree boosting system, in:                   563437A6E2BA. [Dataset]
       Proceedings of the 22nd acm sigkdd international conference on knowledge                 [27] Sirko, W., Kashubin, S., Ritter, M., Annkah, A., Bouchareb, Y.S.E., Dauphin,
       discovery and data mining, pp. 785â€“794.                                                       Y.N., Keysers, D., Neumann, M., CissÃ©, M., Quinn, J., 2021. Continental-
 [8]   Chi, G., Fang, H., Chatterjee, S., Blumenstock, J.E., 2022. Microestimates of                 scale building detection from high resolution satellite imagery. CoRR
       wealth for all low-and middle-income countries. Proceedings of the National                   abs/2107.12283. URL: https://arxiv.org/abs/2107.12283, arXiv:2107.12283.
       Academy of Sciences 119, e2113658119.                                                    [28] WorldPop, School of Geography and Environmental Science, University of
 [9]   Dang, H.A.H., Serajuddin, U., 2020. Tracking the sustainable development                      Southampton, Department of Geography and Geosciences, University of
       goals: Emerging measurement challenges and further reflections. World                         Louisville, Departement de Geographie, Universite de Namur, Center for
       Development 127, 104570.                                                                      International Earth Science Information Network (CIESIN), Columbia
[10]   Didan, K., 2021. Modis/terra vegetation indices 16-day l3 global 500m sin                     University, 2018. Global high resolution population denominators project
       grid v061 [data set]. URL: https://doi.org/10.5067/MODIS/MOD13A1.061.                         - funded by the bill and melinda gates foundation (opp1134076). URL:
       accessed 2024-06-08.                                                                          https://dx.doi.org/10.5258/SOTON/WP00674. accessed: 2024-06-08.
[11]   Elvidge, C.D., Baugh, K., Zhizhin, M., Hsu, F.C., Ghosh, T., 2017. Viirs                 [29] Yeh, C., Perez, A., Driscoll, A., Azzari, G., Tang, Z., Lobell, D., Ermon, S.,
       night-time lights. International Journal of Remote Sensing 38, 5860â€“5879.                     Burke, M., 2020. Using publicly available satellite imagery and deep learning
[12]   Engstrom, R., Hersh, J., Newhouse, D., 2022. Poverty from space: Using                        to understand economic well-being in africa. Nature communications 11, 2
       high resolution satellite imagery for estimating economic well-being. The
       World Bank Economic Review 36, 382â€“412.
[13]   Filmer, D., Pritchett, L.H., 2001. Estimating wealth effects without expen-
       diture dataâ€”or tears: an application to educational enrollments in states of
       india. Demography 38, 115â€“132.
[14]   Goldblum, M., Souri, H., Ni, R., Shu, M., Prabhu, V.U., Somepalli, G.,
       Chattopadhyay, P., Ibrahim, M., Bardes, A., Hoffman, J., Chellappa, R.,
       Wilson, A.G., Goldstein, T., 2023. Battle of the backbones: A large-scale
       comparison of pretrained models across computer vision tasks, in: Thirty-
       seventh Conference on Neural Information Processing Systems Datasets and
       Benchmarks Track.
[15]   Gong, P., Li, X., Wang, J., Bai, Y., Chen, B., Hu, T., Liu, X., Xu, B., Yang,
       J., Zhang, W., Zhou, Y., 2020. Annual maps of global artificial impervious
       area (gaia) between 1985 and 2018. Remote Sensing of Environment
       236,      111510.        URL:       https://www.sciencedirect.com/science/article/pii/
       S0034425719305292, doi:https://doi.org/10.1016/j.rse.2019.111510.
[16]   Hengl, T., Miller, M.A.E., KriÅ¾an, J., Shepherd, K.D., Sila, A., Kilibarda,
       M., AntonijeviÄ‡, O., GluÅ¡ica, L., DoÄŸan, I., Shutcha, M.N., Leenaars, J.G.B.,
       Wolf, J.W., van den Bosch, R., Kempen, B., de Jesus, J.M., Ribeiro, E.,
       MacMillan, R.A., 2021. African soil properties and nutrients mapped at 30
       m spatial resolution using two-scale ensemble machine learning. Scientific
       Reports 11, 6130. doi:10.1038/s41598-021-85639-y.
[17]   Jean, N., Burke, M., Xie, M., Davis, W.M., Lobell, D.B., Ermon, S., 2016.
       Combining satellite imagery and machine learning to predict poverty. Sci-
       ence 353, 790â€“794.
[18]   Linard, C., Gilbert, M., Snow, R.W., Noor, A.M., Tatem, A.J., 2012. Popula-
       tion distribution, settlement patterns and accessibility across africa in 2010.
       PLoS ONE 7, e31743. doi:10.1371/journal.pone.0031743.
[19]   Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang,
       Z., Dong, L., et al., 2022. Swin transformer v2: Scaling up capacity and
       resolution, in: Proceedings of the IEEE/CVF conference on computer vision
       and pattern recognition, pp. 12009â€“12019.
[20]   Loshchilov, I., 2017. Decoupled weight decay regularization. arXiv preprint
       arXiv:1711.05101 .

                                                                                                                                                                   Page 11 of 11