Policy Research Working Paper                       10252




   Program Targeting with Machine Learning
           and Mobile Phone Data
          Evidence from an Anti-Poverty Intervention
                       in Afghanistan

                                Emily L. Aiken
                              Guadalupe Bedoya
                            Joshua E. Blumenstock
                                Aidan Coville




Development Economics
Development Impact Evaluation Group
December 2022
Policy Research Working Paper 10252


  Abstract
  Can mobile phone data improve program targeting? By                                households. The paper shows that machine learning meth-
  combining rich survey data from the baseline of a “big                             ods leveraging mobile phone data can identify ultra-poor
  push” anti-poverty program in Afghanistan implemented                              households nearly as accurately as survey-based measures of
  in 2016 with detailed mobile phone logs from program                               consumption and wealth; and that combining survey-based
  beneficiaries, this paper studies the extent to which machine                      measures with mobile phone data produces classifications
  learning methods can accurately differentiate ultra-poor                           more accurate than those based on a single data source.
  households eligible for program benefits from ineligible




 This paper is a product of the Development Impact Evaluation Group, Development Economics. It is part of a larger
 effort by the World Bank to provide open access to its research and make a contribution to development policy discussions
 around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The
 authors may be contacted at gbedoya@worldbank.org or acoville@worldbank.org.




          The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
          issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
          names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
          of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
          its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


                                                        Produced by the Research Support Team
 Program Targeting with Machine Learning and Mobile
     Phone Data: Evidence from an Anti-Poverty
             Intervention in Afghanistan∗
      Emily L. Aiken†           Guadalupe Bedoya‡              Joshua E. Blumenstock†
                                     Aidan Coville‡




      Keywords : Targeting; Machine Learning; Mobile Phone Data; Afghanistan
      JEL: I32, I38, O12, O38, C55


  ∗
     We thank Seungmin Lee, Maria Camila Ayala and Thomas Escande for excellent research
assistance. This work was supported by DARPA and NIWC under contract N66001-15-C-4066,
the NSF under grant IIS-1942702, and by the World Bank’s Knowledge for Change Program. The
U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes not
withstanding any copyright notation thereon. The views, opinions, and/or findings expressed are
those of the authors and should not be interpreted as representing the official views or policies of
the Department of Defense, the U.S. Government, the World Bank and its affiliated organizations,
or those of the Executive Directors of the World Bank or the governments they represent.
   †
     School of Information, University of California, Berkeley
   ‡
     Development Impact Evaluation Department, World Bank
1     Introduction
Each year, hundreds of billions of dollars are spent on targeted social protection programs.
The importance of these programs increased dramatically in the past 18 months: In 2020,
global extreme poverty increased for the first time in two decades, and most countries ex-
panded their social protection programs, with more than 1.1 billion new recipients receiving
government-led social assistance payments (Gentilini et al., 2020).
    Determining who should be eligible for program benefits — targeting — is a central
challenge in the design of these programs (Hanna & Olken, 2018; Lindert et al., 2020).
In high-income countries, targeting frequently relies on tax records or other administrative
data on income. In low- and middle-income countries (LMICs), where a large fraction of
the workforce is informal, programs often require primary data collection. The difficulty
and cost of collecting data, and the variable quality of what gets collected, can introduce
significant errors in the targeting process (Deaton, 2016; Jerven, 2013; Grosh et al., in press).
These issues are exacerbated in fragile and conflict-affected countries, where two-thirds of
the world’s poor are expected to reside by 2030 (Corral et al., 2020).
    This paper evaluates the extent to which non-traditional administrative data, processed
with machine learning, can be used for program targeting. Specifically, we match call detail
records (CDR) from a large mobile phone operator in Afghanistan to household survey data
from the Afghan government’s Targeting the Ultra-Poor (TUP) anti-poverty program. Eli-
gibility for the TUP program was determined through a hybrid targeting method, combining
a community wealth ranking (CWR) and a short follow-up survey. Our analysis assesses the
accuracy of three counterfactual targeting approaches at identifying the actual beneficiaries
of the TUP program: (i) our CDR-based method, which applies machine learning to data
from the mobile phone company; (ii) an asset-based wealth index, which uses asset ownership
to approximate poverty; and (iii) consumption, a common benchmark for measuring poverty
in LMICs.
    Our analysis produces three main results. First, by comparing errors of inclusion and
exclusion using the program’s hybrid method as a benchmark, we find that the CDR-based
method is nearly as accurate as the commonly employed asset and consumption-based meth-
ods for identifying the phone-owning ultra-poor households. Second, we find that methods
combining CDR data with measures of assets and consumption are more accurate than meth-
ods using any single data source. Third, we find that when non-phone-owning households
are included in the analysis, the CDR-based method remains accurate if non-phone-owning
households are classified as ultra-poor; however, targeting performance is quite poor if house-

                                               2
holds without phones are ineligible for benefits. After presenting these main results, we com-
pile data from several existing targeting programs to give an indication of the substantial
reduction in marginal costs associated with CDR-based targeting.
   These results connect two distinct strands of prior work. The first is the literature
on program targeting, which studies the effectiveness of different mechanisms for identify-
ing program beneficiaries. In LMICs, research has focused on the performance of proxy
means tests (PMTs) (Grosh & Baker, 1995; Filmer & Pritchett, 2001; Brown et al., 2018),
community-based targeting strategies (CBTs) (Alatas et al., 2012; Fortin et al., 2018), and
related approaches (Banerjee et al., 2007; Karlan & Thuysbaert, 2019; Premand & Schnitzer,
2020). A meta-analysis by Coady et al. (2004), which includes 8 PMTs and 14 community-
based programs, finds little difference in targeting accuracy between the two methods — but
notes that targeting is regressive in a quarter of programs reviewed. In addition to issues
with targeting accuracy, the current methods available for poverty targeting in LMICs are
time- and resource-intensive, and may be infeasible in fragile or conflict-affected areas or in
contexts where social interaction is limited, such as during a pandemic.
   The second body of work explores the extent to which non-traditional sources of data, in
conjunction with machine learning, might help address data gaps in LMICs (e.g., Blumen-
stock, 2016; Burke et al., 2021). Much of this work focuses on estimating the geographic
distribution of poverty at fine spatial granularity, using data from satellites (Jean et al., 2016;
Engstrom et al., 2017), mobile phones (Blumenstock et al., 2015; Hernandez et al., 2017),
social media (Fatehkia et al., 2020; Sheehan et al., 2019), or some combination of these data
sources (Steele et al., 2017; Pokhriyal & Jacques, 2017; Chi et al., 2022).
   Most relevant to our current analysis, two prior papers investigate whether mobile phone
use can approximate the wealth of individual mobile subscribers. Blumenstock et al. (2015)
show that CDR data are predictive of an individual-level asset-based wealth index among a
nationally representative sample of 856 Rwandan mobile phone owners (r = 0.68). Blumen-
stock (2018b) finds similar results with a sample of 1,234 male heads of households in the
Kabul and Parwan districts of Afghanistan. While these results show that phone data can
be used to predict poverty levels, they do not evaluate whether those poverty estimates are
of sufficient quality for real-world policy applications.
   Our paper connects these two literatures by rigorously assessing the extent to which
phone-based estimates of poverty can help with program targeting (Blumenstock, 2020). We
believe the analysis will be especially relevant to the increasing number of interventions that
rely on mobile money to distribute cash payments (Gentilini et al., 2020), and the growing


                                                3
number of contexts where mobile phone data are being made available for humanitarian
purposes (Milusheva et al., 2021). For example, in just the past few years, mobile money
was used to make cash transfer payments in countries including Bangladesh (Ali & May,
2021), Ghana (Karlan et al., 2021), Liberia (USAID, 2021), and Malawi (Paul et al., 2021).
Mobile phone data has been used to guide cash transfers in Colombia (Gentilini et al., 2020),
the Democratic Republic of Congo (Gentilini et al., 2021), Pakistan (Gentilini et al., 2020),
and Togo (Aiken et al., 2021).1
        The context of our empirical analysis – identifying ultra-poor households in Afghanistan
– is a particularly challenging environment for data collection and program targeting, as
62% of the households classified as not ultra -poor still fall below the national poverty line.
In such environments, when traditional options for targeting are not feasible, these methods
may provide a viable alternative for identifying households with the greatest need. Given the
policy relevance of these results, we conclude our analysis by discussing important ethical
and logistical considerations that may influence how CDR methods are used to support
targeting efforts in practice.


2         Data and Methods

2.1        Targeting the ‘Ultra-Poor’
Our empirical analysis relies on survey data collected as part of the Targeting the Ultra-
Poor (TUP) program implemented by the government of Afghanistan with support from the
World Bank. The TUP program was a “big push,” providing multi-faceted benefits to 7,500
ultra-poor households in six provinces of Afghanistan between 2015 and 2018 (Bedoya et
al., 2019). Our analysis uses data from the baseline and targeting surveys from an impact
evaluation of the TUP program conducted in Balkh province.

Ultra-Poor Designation Eligibility for the TUP program was determined based on ge-
ographic criteria,2 followed by a two-step process including a community wealth ranking
(CWR) and a follow-up in-person survey. CWRs were conducted separately in each village,
coordinated by a local NGO and village leaders, in collaboration with the government team.
    1
     The anti-poverty program implemented in Togo and described by Aiken et al. (2021) was based on the
methods developed and evaluated in this paper. Due to the time-sensitive nature of the COVID-19 response
described in Aiken et al. (2021), the two academic articles are in circulation concurrently.
   2
     The poorest villages were identified by the availability of veterinary services, financial institutions, and
social services, and being relatively accessible (Bedoya et al., 2019).


                                                       4
The CWR was followed by an in-person survey to determine whether nominated households
met a set of qualifying criteria, coordinated by the NGO and government representatives,
and based on a measure of multiple deprivation.
      For a household to be designated as ultra-poor, and therefore eligible for program benefits,
it had to be considered extreme-poor in the CWR (43% of households), and also meet at
least three of six criteria:

   1. Financially dependent on women’s domestic work or begging

   2. Owns less than 800 square meters of land or living in a cave

   3. Primary woman under 50 years old

   4. No adult men income earners

   5. School-age children working for pay

   6. No productive assets

      Ultimately, 11% of the households classified as extreme-poor in the community wealth
ranking step — 6% of the total population in the study villages — were classified as ultra-
poor and were thus eligible for TUP benefits.


2.2       Household Surveys
To facilitate Bedoya et al.’s (2019) impact evaluation of the TUP program, household surveys
were conducted in 80 of the poorest villages of Balkh province. A total of 2,852 households
were surveyed, with ultra-poor households (N =1,173) oversampled relative to non-ultra-
poor households (N =1,679).3 Surveys were conducted between February and April 2016,
following the CWR and eligibility verification. This survey window was timed to occur in
the late winter and early spring, a few months before the harvesting season for wheat in
Balkh.
      The household survey was a long-form in-person survey that took approximately 3 hours
for each household to complete. The survey covered a wide range of topics, including several
modules related to household poverty and deprivation that feature in our analysis.
  3
      In our analysis, we restrict to the 2,814 households for which asset and consumption data are nonmissing.




                                                       5
Consumption The consumption module of the TUP survey captured information on
household food consumption for the week prior to the interview and non-food expendi-
tures for the month or year prior to the interview. These are used to construct monthly
per capita consumption values, as detailed in Bedoya et al. (2019). Based on these data, we
measure the logarithm of per capita monthly consumption, using the same approach that the
Afghan government used to determine the national poverty line. This monthly consumption
aggregate thus captures a short-term (weekly) measure of food consumption during one of
the planting seasons, as well as a medium-term (monthly and annual) measure of non-food
expenditures (Deaton, 1997; Ravallion, 1998).

Asset Index We use survey data on household assets to construct a wealth index for each
household, which provides an indication of each household’s wealth relative to others in the
survey. Specifically, we calculate the first principal component of variation in household asset
ownership based on the sixteen items listed in Table S1, across the 2,814 households with
complete asset data, after standardizing each asset variable to zero mean and unit variance.
This wealth index explains 25.3% of the variation in asset ownership. Figure S1 shows the
distribution of the underlying asset index components and Table S1 shows the direction
of the first principal component. Broadly, we expect that the asset index will provide an
indication of each household’s long-term economic status, relative to other households in the
survey.

Other Variables The TUP surveys collected several other covariates that we use in sub-
sequent analysis. These include a food security index (composed of variables relating to the
skipping and downsizing of meals, separately for adults and children), a financial inclusion
index (composed of access to banking and credit, knowledge of banking and credit, and sav-
ings), and a psychological well-being index for the primary woman (standardized weighted
scores on the Center for Epidemiological Studies Depression scale, the World Values Survey
happiness and satisfaction questions, and Cohen’s Stress Scale) – see Bedoya et al. (2019).
The survey also collected data from each household on mobile phone ownership. Nearly all
(99%) households with a cell phone provided their phone numbers and consented to the use
of their call detail records for this study.

Sample Representativity Portions of our analysis are restricted to the 535 households
from the TUP survey with phone numbers that match to our CDR (see Section 2.3). Table
1 and S2 compare characteristics of these households to the full survey population. There

                                               6
are some systematic differences: the 535-household sample is wealthier, which is consistent
with households in the subsample being required to own at least one phone. For instance,
while 88% of non-ultra-poor households in the TUP survey own at least one phone, only
72% of ultra-poor households own at least one phone.

Comparing Survey-Based Measures of Well-Being and Deprivation As shown
in Table 1 and Figure S3, the two survey-based measures of well-being are only weakly
correlated. In the full sample, the correlation between the asset index and consumption is
just 0.37; in the matched subsample, the correlation is 0.34. These modest correlations may
be due in part to the fact that, as discussed above, the consumption data capture short- and
medium-term deprivation, whereas the asset index is a better indicator of long-term wealth.
Measurement error may also weaken these empirical correlations.
   Also notable is the weak relationship between the two survey-based measures of depri-
vation and the ground truth ultra-poor designation: while the ultra-poor population makes
up 27% of the overall sub-sample, less than half of the ultra-poor fall into the bottom 27%
of the sample by wealth index or consumption. These differences may be partly attributable
to measurement error, but they surely also arise from the fact that they are conceptually
distinct constructs: while the consumption and asset indices focus primarily on economic
flows and stocks, respectively, the ultra-poor designation was designed to be more holistic
and multidimensional, informed in part by community perceptions of vulnerability (Sen,
1992; Alkire et al., 2015).
   The fact that the ultra-poor designation is not strongly correlated with the survey mea-
sures of consumption and wealth has important implications for the targeting analysis pre-
sented below. In particular, it suggests – and our later results affirm – that a policy targeted
solely on assets or consumption data will do a poor job of differentiating between ultra-poor
and non-ultra-poor. The relatively weak correlation between consumption and the asset
index also hints at a later finding that targeting based on a combination of the two data
sources performs better than targeting on a single source in isolation.

Sample Weights Since the TUP survey oversampled the ultra-poor (by a factor of roughly
12), portions of our analysis use sample weights to adjust for population representativeness.
When sample weights are applied, it is explicitly noted; if not mentioned, no weights are
applied. After sample weights are applied, the ultra-poor make up 5.98% of the overall
population, and 4.63% of our matched subsample.



                                               7
2.3    Mobile Phone Metadata
In a follow-up survey conducted in 2018, we requested informed consent from survey re-
spondents to obtain their mobile phone CDR and match them to the survey data collected
through the TUP project. CDR contain detailed information on:

   • Calls: Phone numbers for the caller and receiver, time and duration of the call, and
      cell tower through which the call was placed

   • Text messages: Phone numbers for the caller and recipient, time of the message

   • Recharges: Time and amount of the recharge

   For participants who consented, we match baseline survey data (collected November 2015
- April 2016) to CDR covering that same period, obtained from one of Afghanistan’s main
mobile phone operators. For households with multiple phones and a designated household
head (N=65), we match to CDR for the phone belonging to the household head. For house-
holds where the household head does not have a phone and someone else does (N=17), we
match to CDR for one of the households’ phones selected at random. In total, for the 535
households in our sample, 629,543 transactions took place in the months of November 2015
to April 2016, broken down into 310,883 calls, 305,756 text messages, and 12,904 recharges.
   From these CDR, we compute a set of 797 behavioral indicators that capture aggregate
aspects of each individual’s mobile phone use (de Montjoye et al., 2016). This set includes
indicators relating to an individual’s communications (for example, average call duration
and percent initiated conversations), their network of contacts (for example, the entropy of
their contacts and the balance of interactions per contact), their spatial patterns based on
cell tower locations (for example, the number of unique antennas visited and the radius of
gyration), and their recharge patterns (including the average amount recharged and the time
between recharges). The distributions of a sample of these indicators are shown in Figure S4.


2.4    Machine Learning Predictions
CDR-Based Method Extending the approach described in Blumenstock et al. (2015),
we test the extent to which ultra-poor status can be predicted from CDR. This analysis uses
the 535 households who match to CDR to train a supervised machine learning algorithm to
predict ultra-poverty status from the mobile phone features. The intuition — also highlighted



                                             8
in Figure S4 — is that ultra-poor individuals use their phones very differently than non-
ultra-poor individuals, and machine learning algorithms can use those differences to predict
ultra-poor status.
       Our main analysis uses a gradient boosting model, which generally out-performs several
other common machine learning algorithms for this task (see Table S3). The feature impor-
tances for the trained model are shown in Table S2. To limit the potential for overfitting,
probabilistic predictions are generated via 10-fold cross-validation, with folds stratified to
preserve class balance.4 Additional details on the machine learning methods are provided in
Appendix A.

Combined Methods We also evaluate several approaches that use data from multiple
sources to predict ultra-poor status. Our main combined method trains a logistic regres-
sion to classify the ultra-poor and non-ultra-poor households using the predicted ultra-poor
probability from the CDR-based method (i.e., the output of the gradient boosting algorithm
described above), as well as asset and consumption data collected in the TUP survey. For
comparison, we similarly evaluate the performance of methods that combine only two of the
available data sources (i.e., assets plus consumption, assets plus CDR, and consumption plus
CDR). Predictions for each of the combined methods are pooled over 10-fold cross-validation.


2.5       Targeting Accuracy Evaluation
Evaluation on Matched Subsample Our main analysis focuses on the 535 households
for which we observe both CDR and survey data, and evaluates whether machine learning
methods leveraging CDR data can accurately identify households designated as ultra-poor
by the TUP program (using the two-step hybrid approach described in Section 2.1). We
compare the performance of the CDR-based method to the performance of methods based
on the wealth index, consumption data, and combinations of these data sources.5 Each tar-
geting method is evaluated based on classification accuracy, errors of exclusion (ultra-poor
households misclassified as non-ultra-poor) and errors of inclusion (non-ultra-poor house-
   4
     While cross validation is a standard evaluation strategy in the machine learning literature, for robustness
we present results using a basic single train-test split in Table S6.
   5
     The CDR-based method uses supervised learning to model the ultra-poverty outcome, whereas the asset-
and consumption-based approaches do not. To assess the importance of this difference, we experiment with
applying machine learning methods to the asset and consumption data to model the ultra-poverty outcome.
In results shown in Table S4, we find that a machine-learned asset predictor provides slight improvements
on the standard asset-based wealth index and consumption measures. We continue to use the standard asset
and consumption measures as benchmarks in the remainder of the paper, however, as they are the targeting
methods most frequently used in practice.


                                                       9
holds misclassified as ultra-poor). We focus on the ultra-poor designation as the ‘ground
truth’ status of the household, against which other methods are evaluated, since it is the
most carefully vetted measure of well-being for this population, and the proxy that the
government used to target TUP benefits.
       To evaluate the performance of the CDR-based and combined methods, we pool out-
of-sample predictions across the ten cross-validation folds, so that every household in our
dataset is associated with a CDR-based predicted probability of ultra-poor status that is
produced out-of-sample.6 To account for class imbalance, we evaluate model accuracy using
a “quota method”, by selecting a cut-off threshold for ultra-poor qualification such that each
method identifies the proportion of ultra-poor households in our subsample; this cut-off also
balances inclusion and exclusion errors. This quota-based approach reflects a scenario in
which a program has a fixed budget constraint; it is also frequently used in the targeting
literature (Alatas et al., 2012; Schnitzer & Stoeffler, 2021). In our 535-household matched
dataset this threshold is 27%; in other samples (see following subsection), the percentage is
different. We evaluate each method for precision (positive predictive value) and recall (sensi-
tivity). To capture the trade-off between inclusion and exclusion errors for varying values of
this threshold, we also construct receiver operating characteristic (ROC) and precision-recall
curves for each method and consider the area under the ROC curve (AUC) as a measure
of targeting quality. For each evaluation metric (precision, recall, and AUC), we bootstrap
1,000 samples from the original dataset to calculate the standard deviation of the mean of
the accuracy metric. Each bootstrapped sample is the same size as the original dataset,
drawn with replacement.

Accounting for Households without Phones In order to focus our attention on how
differences in the data used for program targeting affect targeting performance, our main
results are based on the sample of 535 households for whom we have both survey data and
mobile phone data. We also present results that show how performance is affected when the
analysis includes TUP households for whom we do not have mobile phone data (typically
because they do not have a phone or because they use a different phone network than the one
who provided CDR). We provide analysis that targets such households (1) before households
with CDR, or (2) after households with CDR (see Section 3.4). These results are evaluated
on three different samples:

   1. Matched Sample: The 535 households for whom could match survey responses to CDR.
   6
    In Table S6, we show that results are unchanged when we use a single train-test split, instead of 10-fold
cross-validation.


                                                     10
    2. Balanced Sample: This sample includes the 535 matched households as well as the 472
      households in the TUP survey who report not owning any phone. It excludes house-
      holds that own a phone on a different phone network than the one who provided CDR.
      The motivation for this sample is to provide an indication of targeting performance in
      a regime in which CDR can be used to target all phone-owning households. In addition
      to applying sample weights from the survey, households that do not own a phone are
      downweighted so that the balance of phone owners to non-phone-owners (with sample
      weights applied) is the same as in the baseline survey as a whole (with sample weights
      applied, 84% phone owners).

    3. Full Sample: All 2,814 households in the TUP baseline survey for which asset and
      consumption data are available, with sample weights applied.

    Note that the quota used to evaluate targeting changes for each sample, based on the
number of households that are ultra-poor in the sample. For the matched sample, the
targeting quota is 27.29%; for the balanced sample and full sample the quotas are 5.47% and
6.02%, respectively.


3     Results

3.1     Performance of Targeting Methods
Our first set of results evaluate the extent to which different targeting methods can correctly
identify ultra-poor households. This analysis compares the performance of CDR-based tar-
geting methods to asset-based and consumption-based targeting, using the sample of 535
households for which survey data and CDR data are both available.
    An overview of these results is provided in Figure 1. Figure 1a shows the distribution of
assets and consumption, as well as the distribution of predicted probabilities of being non-
ultra-poor generated by the CDR-based and combined methods, separately for the ultra-poor
and non-ultra-poor. The dashed vertical line indicates the threshold at which point 27% of
households are classified as ultra-poor; we use this quota because 27% of households in this
sample were designed as ultra-poor by TUP. Figure 1b provides confusion matrices that
compare the true status (rows) against the classification made by each method (columns).
These confusion matrices are also used to calculate the measures of precision and recall
reported in Table 2 Panel A.


                                              11
       We find that the CDR-based method (precision and recall of 42%) is close in accuracy to
methods relying on assets (precision and recall of 49%) or consumption (precision and recall
of 45%). To evaluate the trade-off between inclusion errors and exclusion errors resulting
from selecting alternative cut-off thresholds, Figure 1c shows the ROC curve associated with
each classification method. The Area Under the Curve (AUC) scores for these curves, listed
in Table 2, are comparable among methods, with assets (AUC=0.73) slightly superior to
consumption (AUC=0.71) and the CDR-based method (AUC=0.68). The corresponding
Precision-Recall curves are shown in Figure S5.


3.2       Comparison of Errors across Methods
To better understand the nature of the mis-classification errors arising from the different
datasets used for targeting, Table 3 compares the characteristics of correctly and incorrectly
classified households for three different methods (targeting on assets, consumption, and
CDR). Panel A highlights differences between ultra-poor households correctly classified as
ultra-poor (True Positives) and ultra-poor households mis -classified as non-ultra-poor (False
Negatives, also referred to as exclusion errors). Likewise, Panel B highlights differences be-
tween non-ultra-poor households correctly classified as non-ultra-poor (True Negatives), and
non-ultra-poor households mis-classified as ultra-poor (False Positives, or inclusion errors).
This analysis uses the matched sample (see Table 2) to highlight differences that arise when
switching from one targeting dataset to another, on a population of households that are
observed in all three datasets.7
       Across methods, false negatives (exclusion errors) have higher levels of food security,
financial inclusion, and psychological well-being than true positives – that is, all three tar-
geting methods misclassify ultra-poor households as non-ultra-poor when those ultra-poor
households are better-off, according to other observable characteristics not used in the target-
ing. Likewise, false positives (inclusion errors) tend to fare worse than true negatives across
these same measures. The exact pattern of differences depends on the targeting method;
for instance, asset-based targeting (first set of columns) tends to misclassify ultra-poor as
non-ultra poor when they have assets (the difference of -2.21 is large), but errors are not
systematically correlated with consumption (the difference of -0.19 is relatively small). The
CDR-based method in particular tends to prioritize households that score low on these alter-
native measures of well-being. These patterns suggest that the CDR-based targeting method
   7
    Similar analysis could also be performed using the balanced sample or the full sample; however, results
would conflate differences caused by the targeting data (the current focus of Table 3) with the differences
that arise from considering (or excluding) households without mobile phones (the current focus of Table 2).


                                                    12
may capture aspects of well-being that are not captured by standard survey-based measures
of poverty such as wealth and consumption.
    To test for systematic misclassification of certain types of households, Table 4 displays
the overlap in errors of exclusion and inclusion between methods. Our results suggest that
the three classifiers misidentify the same households at a rate only slightly above random.8


3.3     Combining Targeting Methods
Since the different targeting methods identify different populations as ultra-poor, there may
be complementarities between asset, consumption, and CDR data. As shown in Panel A
of Table 2, we find that a combined method, which takes as input the wealth index, total
consumption, and the output of the CDR-based method, performs better (AUC = 0.78)
than methods using any one data source (AUC = 0.68 - 0.73). As shown in Table S5, the
full method also outperforms methods based on any two data sources (AUC = 0.75 - 0.76).
The method that combines CDR and asset data (AUC = 0.76) may, however, be more
practical than the combined method, since consumption data is difficult to collect for large
populations.


3.4     Targeting Households without Phones
An important limitation of CDR-based targeting is that households without phones do not
generate CDR. Here, we show how targeting performance is impacted when households
without phones are included in the analysis. This analysis uses two additional samples of
TUP households to evaluate targeting performance: (i) the balanced sample, which adds all
of the 472 households without phones to the sample of 535 for whom we have matched CDR;
the balanced sample is intended to illustrate the performance of CDR-based targeting if
CDR were available from all operators in Afghanistan — though it relies on the assumption
that phone-owners observed on our mobile network are representative of all phone owners
in Afghanistan (an assumption that is not fully satisfied, as shown in Table 1); and (ii) the
full sample, which includes all 2,814 households surveyed in the TUP baseline with complete
asset and consumption data; this sample includes an additional 1,807 households who report
   8
     The rates of overlap should be interpreted relative to the expected overlap in errors for random classifiers.
Based on our selection of thresholds such that 27% of the sample is identified as ultra-poor, our three
classifiers misidentify 15%-27% of the non-ultra-poor and 51%-65% of the ultra-poor. If these classifiers
were random, we would expect approximately 20% overlap in inclusion errors and 55% overlap in exclusion
errors.




                                                       13
owning a phone, but whose number does not match to any number in the CDR provided to
us by the single mobile operator.9
       Results in Panels B and C of Table 2 show the performance of each targeting approach
on the balanced and full sample, respectively. Note that as described in Section 2.5, different
targeting quotas are applied for each panel based on the proportion of each sample that is
ultra-poor. In the CDR-based and combined approaches, we report performance when the
households without CDR are targeted first (i.e. households without CDR are targeted in a
random order and then the households predicted to be poorest are targeted until the quota
is reached) as well as when households without CDR are targeted last (i.e., after the 535
households with phones are targeted, households without phones are included in a random
order until the quota is reached).
       Unsurprisingly, these results suggest that CDR-based targeting is not effective when a
large portion of the target population does not own a phone (e.g. Panel C of Table 2, where
only 16% of the sample has matching CDR). However, when we simulate more realistic
levels of phone ownership in Panel B (84% of the households, based on our survey data),
CDR-based targeting is once again comparable to asset- or expenditure-based targeting,
particularly when households without phones are targeted first (AUC = 0.72, 0.70, 0.68 for
assets, consumption, and CDR, respectively). On the other hand, if households without
phones are targeted last (for example, if program administrators base targeting wholly on
CDR and provide no benefits to any household without a phone), the CDR-based method
only improves marginally on random targeting.10


3.5       Additional Tests and Simulations
Our main analysis considers the household head to be the unit of analysis. As described in
Section 2.3, this analysis is based on matching survey-based indices to phone data from the
household head, which is consistent with the design of the TUP program and the TUP survey
   9
      These 1,807 households include households that report owning a phone on a different network (this
network is estimated to have around 30% market share in Afghanistan), as well as phones on our network
that were not active during the six-month period of CDR that we analyze.
   10
      A key nuance in this analysis is that for the CDR-based and combined methods where households
without phones are targeted first, the precision and recall measures in Table 2 correspond to programs that
only target households without phones (at random), as the number of households without phones exceeds
the budget constraint of the program. The AUC score, on the other hand, is a summary statistic that
represents targeting accuracy at all counterfactual targeting thresholds, and thus is not sensitive to the
budget constraint — which explains the contrast between AUC and precision and recall in Table 2 Panels B
and C. The ROC curves (Figure 1) and Precision-Recall curves (Figure S5) highlight how budget constraint
affects precision and recall.



                                                    14
sample frame. An alternative approach matches survey data reported by the household head
to all phone numbers associated with the household. As shown in Table S7, the predictive
accuracy of these models is slightly attenuated relative to the benchmark results (Table S3).
       We also explore the extent to which CDR can be used to predict other measures of socioe-
conomic status. Our main analysis focuses on the household’s ultra-poor designation as the
ground truth measure of poverty, since this label was both carefully curated and the actual
criterion used to determine TUP eligibility. In Table S8, we report the accuracy with which
CDR (obtained from the household head, who is typically male) can predict consumption
and asset-based wealth (elicited from the primary woman of each household).11 In general,
these machine learning models trained to directly predict consumption or asset-based wealth
do not perform well. This result contrasts with prior work documenting the predictive ability
of CDR for measuring asset-based wealth (e.g. Blumenstock et al., 2015). We suspect a key
difference in our setting – aside from the fact that we are matching CDR to socioeconomic
status at the household rather than the individual level – is the homogeneity of the bene-
ficiary population: whereas Blumenstock et al. (2015) uses machine learning to predict the
wealth of a nationally-representative sample of Rwandan phone owners, our sample consists
of 535 individuals from the poorest villages of a single province in Afghanistan, where even
the relatively wealthy households are quite poor.


4        Discussion
Our key finding is that, in a sample of 535 phone-owning households in poor villages in
Afghanistan, machine learning methods leveraging phone data are nearly as accurate at
identifying ultra-poor households as standard asset- and consumption-based methods. Fur-
ther, we find that methods combining survey data with CDR perform better than methods
using a single data source. However, as we demonstrate empirically, low rates of phone own-
ership — or the inability to access data from all operators — can undermine the value of
CDR-based targeting. In our setting, the CDR-based approach still works well if households
without phones are targeted before the CDR-based algorithm selects the poorest households
with phones. However, this approach may not be appropriate in other contexts where phone
ownership is less predictive of wealth, or where potential beneficiaries have the ability to
                                             orkegren et al., 2020).
strategically underreport phone ownership (Bj¨
       As mobile phone penetration rates continue to rise in LMICs (GSMA, 2020), and as
  11
   Due to the design of the TUP survey, which interviewed women in the household, we cannot avoid this
mismatch between the survey respondent and the phone owner.


                                                 15
programs increasingly rely on mobile phones and money to distribute benefits (cf. Gentilini
et al., 2020), CDR-based targeting methods will likely play a more prominent role in the set of
options considered by policy makers and program administrators — particularly in contexts
like Afghanistan, where traditional targeting benchmarks are missing or unreliable. In just
the past few years, for instance, data from mobile phone operators was used in the design of
social assistance programs in Colombia, the Democratic Republic of Congo, Pakistan, and
Togo (Gentilini et al., 2020, 2021; Aiken et al., 2021). We conclude by highlighting a few
policy considerations important for CDR-based targeting.

Speed and cost An advantage of CDR-based targeting is that it can be used in contexts
where face-to-face contact is not feasible, dramatically reducing the time required to imple-
ment a targeted program. While it typically takes many months (or years) to implement
a proxy-means test (PMT), community-based targeting (CBT), or consumption-based tar-
geting, a CDR-based model can be trained in just a few weeks (see Appendix C). Likewise,
the marginal costs per household screened are substantially lower with CDR-based targeting
than with CBT, PMT, or consumption-based targeting. For instance, Table S9 uses cost
estimates obtained from the literature (and detailed in Table S10) to estimate targeting costs
for the TUP program.12 Whereas the marginal costs of screening an individual with a CBT
or PMT are estimated at $2.20 and $4.00, respectively, the marginal cost of screening with
CDR is negligible (see Appendix C).13 For the entire TUP program, which screened around
125,721 households in six provinces, CBT and PMT would add an additional estimated
$276,586 and $502,884, respectively, corresponding to 2.18% and 3.97% of the total program
budget.

Data access and privacy Access to phone data is necessary for CDR-based targeting.
As we show, targeting performance degrades considerably when CDR are not available for
subsets of the population. Encouragingly, the past several years have been characterized by a
trend towards public sector access to CDR, particularly in the context of the COVID-19 pan-
demic, during which mobile network operators shared CDR with governments, researchers,
  12
     In our cost calculations we obtain estimates for a CBT, rather than the hybrid approach used in the
TUP program, as there is more information available on CBT-only costs in the literature. However, as the
CBT cost can be interpreted as a lower bound for the cost of a hybrid approach, our qualitative results also
apply to a hybrid approach.
  13
     Marginal costs of CDR-based targeting are negligible because we assume no contact with screened
individuals is required. In practice, it may be desirable to solicit informed consent to access CDR. If consent
were collected in-person, the marginal costs would approach that of a PMT; if collected over the phone,
there would still be significant cost savings, see Appendix Section C).


                                                      16
and NGOs for social protection purposes (cf. Gentilini et al., 2020, 2021; Aiken et al., 2021).
CDR have also been shared with the public sector for public health and humanitarian aid
applications (Milusheva et al., 2021). Access issues aside, CDR contain private and sen-
sitive data, including phone numbers and location traces. While much has been written
about enabling responsible use of CDR for humanitarian response (e.g. de Montjoye et al.,
2018; Oliver et al., 2020), to date no consistent privacy standards exist. Informed consent
can increase participant agency, but also complicates the implementation logistics. Data
minimization may provide a complementary pathway to privacy: as reflected in the feature
importances in Table S2, our models rely primarily on only a fraction of the features we
derive from mobile phone data — it may therefore be possible to restrict models to fea-
tures that minimize privacy risk (such as statistics that do not involve contact networks or
mobility patterns) without compromising model accuracy. Finally, there may be ways to
incorporate differential privacy or other privacy enhancing technologies into a CDR-based
targeting system, but such privatization would likely decrease targeting accuracy (Hu et al.,
2015).

Algorithmic transparency and strategic behavior Using CDR to determine program
eligibility may introduce incentives for people to manipulate if and how they use their phone.
For instance, while a program that targeted households without phones first might make
sense in the context of one-off emergency response, it could not be deployed in equilibrium,
as it would introduce undesirable incentives for people to not use their phones. In less
extreme settings, we might still expect strategic manipulation of how people use their phones,
if they know such behavior is being monitored. These considerations are not unique to
CDR, as degrees of manipulation have been documented in social programs that use proxy
means tests and other traditional targeting mechanisms (Camacho & Conover, 2011; Banerjee
et al., 2018). While complex machine learning algorithms like the one presented in this
paper may obfuscate the logic behind targeting decisions and thus reduce the scope for
manipulation, this is not a ‘solution.’ Society often demands transparency in algorithmic
decision-making, as black-box decisions are difficult to audit or hold to account. There is
therefore a tension between the goals of increasing transparency and reducing manipulation,
though recent advances in machine learning explore mechanisms for pursuing both objectives
           orkegren et al., 2020).
at once (Bj¨

Centralized vs. local knowledge CDR-based methods enable a top-down, centralized
and standardized approach to program targeting, rather than a bottom-up approach that

                                              17
prioritizes local knowledge that can be elicited, for example, through community wealth
rankings. While the empirical results in this paper indicate that the efficiency gains from
CDR-based targeting are substantial, it may reinforce existing power structures (Taylor,
2016; Blumenstock, 2018a; Abebe et al., 2021). Efficiency gains should also be consid-
ered within the context of evidence suggesting that participating communities may prefer
community-based approaches (Alatas et al., 2012), but also may perceive them as less legit-
imate (Premand & Schnitzer, 2020).

  To summarize, our results suggest that there is potential for using CDR-based methods
to determine eligibility for economic aid or interventions, substantially reducing program
targeting overhead and costs. Our results also indicate that CDR-based methods may com-
plement and enhance existing survey-based methods. We note, however, that the practical
and ethical limitations to CDR-based targeting are significant. We emphasize the need to
consider these limitations and the constraints of specific local contexts alongside the efficiency
gains offered by CDR-based targeting.




                                               18
References
Abebe, R., Aruleba, K., Birhane, A., Kingsley, S., Obaido, G., Remy, S. L., & Sadagopan, S.
  (2021, March). Narratives and Counternarratives on Data Sharing in Africa. In Proceedings
  of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 329–341).
  New York, NY, USA: Association for Computing Machinery. Retrieved 2021-03-04, from
  https://doi.org/10.1145/3442188.3445897 doi: 10.1145/3442188.3445897

Aiken, E., Bellue, S., Karlan, D., Udry, C. R., & Blumenstock, J. (2021, July). Machine
  Learning and Mobile Phone Data Can Improve the Targeting of Humanitarian Assistance
  (Working Paper No. 29070). National Bureau of Economic Research. Retrieved from
  https://www.nber.org/papers/w29070 doi: 10.3386/w29070

Alatas, V., Banerjee, A., Hanna, R., Olken, B., & Tobias, J. (2012). Targeting the poor:
  Evidence from a field experiment in Indonesia. American Economic Review , 102 (4), 1206-
  1240.

Ali, S., & May, M. (2021). Bangladesh’s covid-19 response is taking digital finance to new lev-
  els. https://www.cgap.org/blog/bangladeshs-covid-19-response-taking-digital
  -finance-new-levels.

Alkire, S., Foster, J., Seth, S., Santos, M. E., Roche, J. M., & Ballon, P. (2015). Multidi-
  mensional poverty measurement and analysis. Oxford University Press. Retrieved from
  https://EconPapers.repec.org/RePEc:oxp:obooks:9780199689491

Banerjee, A., Duflo, E., Chattopadhyay, R., & Shapiro, J. (2007). Targeting efficiency: How
  well can we identify the poor? Institute for Financial Management and Research Centre
  for Micro Finance, Working Paper Series No. 21 .

Banerjee, A., Hanna, R., Olken, B. A., & Sumarto, S. (2018, December). The (lack of )
  Distortionary Effects of Proxy-Means Tests: Results from a Nationwide Experiment in
  Indonesia (Working Paper No. 25362). National Bureau of Economic Research. Retrieved
  from http://www.nber.org/papers/w25362 doi: 10.3386/w25362

Bedoya, G., Coville, A., Haushofer, J., Isaqzadeh, M., & Shapiro, J. (2019). No household
  left behind: Afghanistan targeting the ultra poor impact evaluation. World Bank Policy
  Research Working Paper , 8877 .



                                              19
  orkegren, D., Blumenstock, J. E., & Knight, S. (2020). Manipulation-proof machine
Bj¨
  learning. arXiv preprint arXiv:2004.03865 .

Blumenstock, J. (2016). Fighting poverty with data. Science , 353 , 753-754.

Blumenstock, J. (2018a). Don’t forget people in the use of big data for development. Nature ,
  561 , 170-172.

Blumenstock, J. (2018b). Estimating economic characteristics with phone data. American
  Economic Review: Papers and Proceedings , 108 , 72-76.

Blumenstock, J. (2020, May). Machine learning can help get COVID-19 aid to those who
  need it most. Nature . Retrieved 2020-05-15, from https://www.nature.com/articles/
  d41586-020-01393-7 doi: 10.1038/d41586-020-01393-7

Blumenstock, J., Cadamuro, G., & On, R. (2015). Predicting poverty and wealth from
  mobile phone data. Science , 350 , 1073-1076.

Brown, C., Ravallion, M., & van de Walle, D. (2018). A poor means test? econometric
  targeting in Africa. Journal of Development Economics , 134 , 109-124.

Burke, M., Driscoll, A., Lobell, D. B., & Ermon, S. (2021). Using satellite imagery to
  understand and promote sustainable development. Science , 371 (6535).

Camacho, A., & Conover, E. (2011, May). Manipulation of Social Program Eligibility.
  American Economic Journal: Economic Policy , 3 (2), 41–65. Retrieved 2017-09-25, from
  https://www.aeaweb.org/articles?id=10.1257/pol.3.2.41 doi: 10.1257/pol.3.2.41

Chi, G., Fang, H., Chatterjee, S., & Blumenstock, J. E. (2022, January). Microestimates of
  wealth for all low- and middle-income countries. Proceedings of the National Academy of
  Sciences , 119 (3). Retrieved 2022-01-14, from https://www.pnas.org/content/119/3/
  e2113658119 (ISBN: 9782113658118 Publisher: National Academy of Sciences Section:
  Social Sciences) doi: 10.1073/pnas.2113658119

Coady, D., Grosh, M., & Hoddinott, J. (2004). Targeting outcomes redux. The World Bank
  Research Observer , 19 (1).

Corral, P., Irwin, A., Krishnan, N., & Mahler, D. G. (2020). Fragility and conflict: on the
  front lines of the fight against poverty. World Bank Publications.



                                             20
Deaton, A. (1997). The analysis of household surveys: a microeconometric approach to
  development policy. World Bank Publications.

Deaton, A. (2016, June). Measuring and understanding behavior, welfare, and poverty.
  American Economic Review , 106 (6), 1221-43. Retrieved from https://www.aeaweb.org/
  articles?id=10.1257/aer.106.6.1221 doi: 10.1257/aer.106.6.1221

de Montjoye, Y., Gambs, S., Blondel, V., Canright, G., De Cordes, N., Deletaille, S., . . .
  others (2018). On the privacy-conscientious use of mobile phone data. Scientific data ,
  5 (1), 1–6.

de Montjoye, Y., Rocher, L., & Pentland, A. (2016). bandicoot: a python toolbox for mobile
  phone metadata. Journal of Machine Learning Research , 17 , 1-5.

Engstrom, R., Hersh, J. S., & Newhouse, D. L. (2017). Poverty from space: using high-
  resolution satellite imagery for estimating economic well-being. World Bank Policy Re-
  search Working Paper (8284).

Fatehkia, M., Tingzon, I., Orden, A., Sy, S., Sekara, V., Garcia-Herranz, M., & Weber, I.
  (2020). Mapping socioeconomic indicators using social media advertising data. EPJ Data
  Science , 9 (1), 22.

Filmer, D., & Pritchett, L. (2001). Wealth effects without expenditure data—or tears: An
  application to educational enrollments in states of India. Demography , 39 , 115-132.

Fortin, S., Kameli, Y., Kone, K., Belem, B., Sangho, H., & Savy, M. (2018). Targeting vul-
  nerable households in rural mali: Effectiveness of a community-based methodology, with or
                                                          ´ emiologie et de Sant´
  without addition of a proxy-mean test, 2016. Revue d’Epid´                    e Publique ,
  66 , S353.    Retrieved from https://www.sciencedirect.com/science/article/pii/
  S0398762018310174 (European Congress of Epidemiology “Crises, epidemiological tran-
  sitions and the role of epidemiologists”) doi: https://doi.org/10.1016/j.respe.2018.05.317

Gentilini, U., Almenfi, M., Orton, I., & Dale, P. (2020, May). Social Protection and Jobs
  Responses to COVID-19: A Real-Time Review of Country Measures. World Bank Policy
  Brief . Retrieved from https://openknowledge.worldbank.org/handle/10986/33635

Gentilini, U., Khosla, S., & Almenfi, M. (2021). Cash in the city.

Grosh, M., & Baker, J. L. (1995). Proxy means tests for targeting social programs. Living
  standards measurement study working paper , 118 , 1–49.

                                             21
Grosh, M., Leite, P., & Wai-Poi, M. (in press). A New Look at Old Dilemmas: Revisiting
  Targeting in Social Assistance. The World Bank.

GSMA. (2020). Mobile economy. https://www.gsma.com/mobileeconomy/wp-content/
  uploads/2020/03/GSMA MobileEconomy2020 Global.pdf.

Hanna, R., & Olken, B. (2018). Universal basic incomes versus targeted transfers: Anti-
  poverty programs in developing countries. Journal of Economic Perspectives , 32 , 201-226.

Hernandez, M., Hong, L., Frias-Martinez, V., & Frias-Martinez, E. (2017). Estimating
  poverty using cell phone data: evidence from Guatemala. , World Bank Policy Research
  Working Paper Series No. 7969 .

Hu, X., Yuan, M., Yao, J., Deng, Y., Chen, L., Yang, Q., . . . Zeng, J. (2015). Differential
  privacy in telco big data platform. Proceedings of the VLDB Endowment , 8 (12), 1692–
  1703.

Jean, N., Burke, M., Xie, M., Davis, W. M., Lobell, D. B., & Ermon, S. (2016, August).
  Combining satellite imagery and machine learning to predict poverty. Science , 353 (6301),
  790–794. Retrieved 2016-10-07, from http://science.sciencemag.org/content/353/
  6301/790 doi: 10.1126/science.aaf7894

Jerven, P. (2013). Poor numbers. Cornell University Press.

Karlan, D., Lowe, M., Osei, R., Osei-Akoto, I., Roth, B., & Udry, C. (2021). Cash transfers as
  covid-19 relief: Evidence from ghana. https://www.theigc.org/blog/cash-transfers
  -as-covid-19-relief-evidence-from-ghana/.

Karlan, D., & Thuysbaert, B. (2019). Targeting ultra-poor households in Honduras and
  Peru. The World Bank Economic Review , 33 (1), 63-94.

                                                        avez, K. N. (2020). Sourcebook on
Lindert, K., Karippacheril, T. G., Caillava, I. R., & Ch´
  the foundations of social protection delivery systems. World Bank Publications.

Milusheva, S., Lewin, A., Gomez, T. B., Matekenya, D., & Reid, K. (2021). Challenges and
  opportunities in accessing mobile phone data for covid-19 response in developing countries.
  Data & Policy , 3 .

Oliver, N., Lepri, B., Sterly, H., Lambiotte, R., Deletaille, S., De Nadai, M., . . . others (2020).
  Mobile phone data for informing public health actions across the covid-19 pandemic life
  cycle (Vol. 6) (No. 23). American Association for the Advancement of Science.

                                                22
Paul, B. V., Msowoya, C., Archibald, E., Sichinga, M., Peredo, A. C., & Malik, M. A. A.
  (2021). Malawi covid-19 urban cash intervention process evaluation report. World Bank
  Publications.

Pokhriyal, N., & Jacques, D. (2017). Combining disparate data sources for improved
  poverty prediction and mapping. Proceedings of the National Academy of Sciences , 114 ,
  E9783–E9792.

Premand, P., & Schnitzer, P. (2020, 09). Efficiency, Legitimacy, and Impacts of Targeting
  Methods: Evidence from an Experiment in Niger. The World Bank Economic Review .
  Retrieved from https://doi.org/10.1093/wber/lhaa019 (lhaa019) doi: 10.1093/wber/
  lhaa019

Ravallion, M. (1998). Poverty lines in theory and practice (Vol. 133). World Bank Publica-
  tions.

Schnitzer, P., & Stoeffler, Q. (2021). Targeting for social safety nets.

Sen, A. (1992). The political economy of targeting. World Bank Washington, DC.

Sheehan, E., Meng, C., Tan, M., Uzkent, B., Jean, N., Lobell, D., . . . Ermon, S. (2019).
  Predicting economic development using geolocated wikipedia articles. Proceedings of the
  25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining .

Steele, J., Sundsøy, P., Pezzulo, C., Alegana, V., Bird, T., Blumenstock, J., . . . Bengtsson,
  L. (2017). Mapping poverty using mobile phone and satellite data. Journal of the Royal
  Society Interface , 14 .

Taylor, L. (2016). No place to hide? the ethics and analytics of tracking mobility using
  mobile phone data. Environment and Planning D: Society and Space , 34 (2), 319–336.

USAID.      (2021).    Usaid’s direct cash transfer program helps over 85,000 vulnerable
  liberians cope with economic fallout from covid-19.        http://web.archive.org/web/
  20080207010024/http://www.808multimedia.com/winnt/kernel.htm.

World Bank. (2020). Poverty and Shared Prosperity 2020 : Reversals of Fortune. The World
  Bank. Retrieved from https://openknowledge.worldbank.org/handle/10986/34496
  (License: CC BY 3.0 IGO.)



                                              23
Tables and Figures

         Table 1: Summary statistics for different samples of survey respondents

                                          (1)                 (2)             (3)             (4)
 Outcome                             Full sample            Matched        Unmatched       Unmatched
                                  (all observations)       Subsample       Owns Phone      No Phone

 Panel A: Balance of Covariates
 Ultra-Poor                           0.42   (0.49)        0.27   (0.45)    0.40 (0.49)     0.66 (0.47)
 Asset Index                          0.00   (2.01)        1.36   (2.60)    -0.05 (1.76)   -1.35 (0.79)
 Log Expenditures                     4.43   (0.71)        4.64   (0.70)     4.46 (0.70)   4.12 (0.65)
 # Phones                             1.35   (1.18)        1.72   (1.33)    1.59 (1.04)     0.00 (0.00)
 Food Security Index                  0.30   (0.90)        0.35   (0.74)    0.34 (0.93)     0.10 (0.89)
 Financial Inclusion Index            0.15   (1.27)        0.34   (1.39)    0.15 (1.32)    -0.05 (0.79)
 Psychological Well-being Index       0.35   (1.01)        0.38   (1.00)    0.43 (0.97)    -0.02 (1.07)
 CWR Group                            0.62   (0.90)        0.89   (1.02)    0.62 (0.88)     0.26 (0.66)

 Panel B: Correlations Between Outcomes
 Ultra-Poor ←→ Asset Index               -0.32               -0.30             -0.27          -0.14
 Ultra-Poor ←→ Consumption               -0.39               -0.30             -0.39          -0.26
 Asset Index ←→ Consumption               0.37                0.34              0.34          0.15
 N                                       2,814                535              1,807           472

 Notes : Table reports average characteristics, with standard deviations in parentheses, of TUP survey
respondents. Each column represents a different sample of respondents: (1) all respondents in the TUP
survey; (2) Just those respondents who own a phone, where the phone number matches to the CDR
obtained from the mobile phone operator; (3) Respondents who report owning a phone, but whose phone
number does not match to the CDR obtained from the operator; (4) Respondents who report they do not
own a phone.




                                                      24
                          Table 2: Targeting simulation results

                                            (1)             (2)              (3)              (4)
 Targeting Method                          AUC           Accuracy         Precision          Recall

 Panel A: Matched Sample (N=535) - for whom we have survey and CDR data
 Random                                 0.50   (0.028)   0.60   (0.025)   0.27   (0.038)   0.27   (0.038)
 Asset Index                            0.73   (0.024)   0.72   (0.020)   0.49   (0.041)   0.49   (0.041)
 Consumption                            0.71   (0.026)   0.69   (0.023)   0.45   (0.038)   0.45   (0.038)
 CDR                                    0.68   (0.027)   0.69   (0.021)   0.42   (0.042)   0.42   (0.042)
 Combined                               0.78   (0.022)   0.75   (0.020)   0.55   (0.039)   0.55   (0.039)

 Panel B: Balanced Sample (N=1,007) - as above, plus households without phones
 Random                                 0.50   (0.017)   0.90   (0.006)   0.05   (0.010)   0.05   (0.010)
 Asset Index                            0.72   (0.026)   0.90   (0.006)   0.10   (0.013)   0.10   (0.013)
 Consumption                            0.70   (0.028)   0.90   (0.006)   0.15   (0.025)   0.15   (0.025)
 CDR (Target Phoneless First)           0.68   (0.030)   0.90   (0.006)   0.11   (0.035)   0.11   (0.035)
 CDR (Target Phoneless Last)            0.51   (0.028)   0.90   (0.006)   0.12   (0.033)   0.12   (0.033)
 Combined (Target Phoneless First)      0.74   (0.026)   0.90   (0.006)   0.11   (0.046)   0.11   (0.046)
 Combined (Target Phoneless Last)       0.57   (0.022)   0.90   (0.006)   0.18   (0.007)   0.18   (0.007)

 Panel C: Full Sample (N=2,814) - as above, plus households with phones on other networks
 Random                                 0.50   (0.009)   0.89   (0.005)   0.06   (0.007)   0.06   (0.007)
 Asset Index                            0.65   (0.017)   0.89   (0.005)   0.07   (0.014)   0.07   (0.014)
 Consumption                            0.69   (0.015)   0.89   (0.006)   0.08   (0.031)   0.08   (0.031)
 CDR (Target Phoneless First)           0.52   (0.008)   0.89   (0.005)   0.06   (0.008)   0.06   (0.008)
 CDR (Target Phoneless Last)            0.48   (0.008)   0.89   (0.005)   0.08   (0.010)   0.08   (0.010)
 Combined (Target Phoneless First)      0.52   (0.008)   0.89   (0.005)   0.06   (0.008)   0.06   (0.008)
 Combined (Target Phoneless Last)       0.49   (0.008)   0.89   (0.005)   0.09   (0.009)   0.09   (0.009)

 Notes : Four different measures of performance (columns) reported for different targeting methods
(rows), using different samples of survey respondents (panels). Standard deviations, calculated
using 1,000 bootstrap samples, in parentheses. Panel A: The 535-household subsample that is
matched to CDR. Panel B: The 535-household matched sample, plus the 472 households that do
not have a phone; this is meant to approximate targeting performance if CDR from all mobile
networks were available. Sample weights are applied as described in Section 2.5. Panel C: All
2,814 observations from the TUP survey, including households matched to CDR, households that
own phones not matched to CDR, and households without phones, with sample weights applied.
For Panels B and C, we simulate two types of CDR-based targeting: targeting households without
phones first and targeting households without phones last.




                                                 25
                       Table 3: What types of households are misclassified?

 Panel A: Ultra-Poor Households (Differences Between True Positives and False Negatives)
                              Asset Index                 Consumption                       CDR

                         TP       FN       Diff.     TP        FN       Diff.     TP        FN       Diff.

 Ultra-Poor              1.00     1.00     0.00      1.00      1.00     0.00      1.00      1.00     0.00
                         (0.00)   (0.00)   (0.00)    (0.00)    (0.00)   (0.00)    (0.00)    (0.00)   (0.00)
 Asset Index             -1.03    1.18     -2.21     -0.34     0.47     -0.81     -0.09     0.25     -0.34
                         (0.49)   (1.34)   (0.17)    (1.09)    (1.69)   (0.23)    (1.16)    (1.70)   (0.24)
 Consumption             4.21     4.40     -0.19     3.78      4.74     -0.96     4.29      4.32     -0.02
                         (0.70)   (0.62)   (0.11)    (0.32)    (0.56)   (0.07)    (0.60)    (0.71)   (0.11)
 # Phones                0.89     1.63     -0.74     1.02      1.48     -0.46     1.18      1.33     -0.16
                         (0.68)   (1.12)   (0.15)    (0.73)    (1.14)   (0.16)    (0.61)    (1.21)   (0.15)
 Food       Security     -0.59    -0.51    -0.08     -0.83     -0.32    -0.51     -0.51     -0.58    0.07
 Index                   (1.13)   (1.10)   (0.18)    (1.19)    (0.99)   (0.18)    (1.14)    (1.09)   (0.19)
 Financial Inclusion     -0.00    0.29     -0.29     0.10      0.19     -0.09     0.16      0.14     0.02
 Index                   (0.79)   (1.02)   (0.15)    (0.80)    (1.02)   (0.15)    (0.98)    (0.88)   (0.16)
 Psychological Well-     -0.35    -0.13    -0.22     -0.37     -0.12    -0.24     -0.31     -0.17    -0.14
 being Index             (0.92)   (0.94)   (0.15)    (0.86)    (0.98)   (0.15)    (0.81)    (1.02)   (0.15)
 CWR Group               0.09     0.01     0.07      0.02      0.08     -0.06     0.06      0.04     0.03
                         (0.44)   (0.12)   (0.05)    (0.12)    (0.41)   (0.05)    (0.40)    (0.24)   (0.06)

 Panel B: Non-Ultra-Poor Households (Differences Between True Negatives and False Positives)
                              Asset Index                 Consumption                       CDR

                         TN       FP       Diff.     TN        FP       Diff.     TN        FP       Diff.

 Ultra-Poor              0.00     0.00     0.00      0.00      0.00     0.00      0.00      0.00     0.00
                         (0.00)   (0.00)   (0.00)    (0.00)    (0.00)   (0.00)    (0.00)    (0.00)   (0.00)
 Asset Index             2.53     -1.08    3.61      2.06      0.94     1.12      1.94      1.43     0.51
                         (2.62)   (0.50)   (0.16)    (2.92)    (1.75)   (0.26)    (2.87)    (2.27)   (0.30)
 Consumption             4.82     4.57     0.25      4.97      3.98     0.99      4.78      4.74     0.04
                         (0.66)   (0.65)   (0.08)    (0.58)    (0.23)   (0.04)    (0.68)    (0.61)   (0.08)
 # Phones                2.11     0.96     1.15      1.98      1.52     0.46      1.91      1.80     0.11
                         (1.43)   (0.76)   (0.12)    (1.49)    (0.92)   (0.13)    (1.44)    (1.24)   (0.16)
 Food       Security     0.24     -0.16    0.40      0.24      -0.14    0.37      0.15      0.18     -0.02
 Index                   (0.87)   (1.03)   (0.13)    (0.88)    (0.99)   (0.12)    (0.91)    (0.94)   (0.12)
 Financial Inclusion     0.80     -0.01    0.82      0.77      0.18     0.59      0.78      0.17     0.61
 Index                   (4.92)   (0.82)   (0.29)    (4.94)    (1.24)   (0.31)    (4.98)    (1.10)   (0.31)
 Psychological Well-     0.69     0.21     0.47      0.62      0.49     0.13      0.62      0.49     0.13
 being Index             (0.97)   (0.75)   (0.10)    (0.98)    (0.80)   (0.11)    (0.95)    (0.93)   (0.12)
 CWR Group               1.30     0.84     0.46      1.23      1.13     0.10      1.26      1.01     0.25
                         (1.00)   (0.96)   (0.12)    (1.03)    (0.94)   (0.12)    (1.01)    (0.98)   (0.12)

 Notes : Table shows the average characteristics, with standard deviations in parentheses, of households
that are correctly and incorrectly classified by three different targeting approaches (approaches are indi-
cated by column-group headers: Asset Index; Consumption; and CDR), using the matched sample. Panel
A highlights differences between between ultra-poor households correctly classified as ultra-poor (True Pos-
itives, TP) and ultra-poor households mis-classified as non-ultra-poor (False Negatives, FN; i.e., exclusion
errors). Panel B highlights differences between non-ultra-poor households correctly classified as non-ultra-
                                                     26
poor (True Negatives, TN), and non-ultra-poor households     misclassified as ultra-poor (False Positives, FP;
i.e., inclusion errors).
      Table 4: Overlap in targeting errors between methods

                 Asset Index      Consumption         CDR       Combined

 Panel A: Overlap   in Errors of Exclusion
 Asset Index         100.00%          65.33%          57.33%       66.67%
 Consumption          61.25%          100.00%         56.25%       62.50%
 CDR                 51.19%            53.57%        100.00%       63.10%
 Combined            75.76%           75.76%          80.30%      100.00%

 Panel B: Overlap   in Errors of Inclusion
 Asset Index         100.00%           26.67%         22.67%       48.00%
 Consumption          25.00%          100.00%         16.25%       37.50%
 CDR                 20.24%            15.48%        100.00%       46.43%
 Combined            54.55%            45.45%         59.09%      100.00%

 Notes : Table measures the extent to which the targeting errors produced by
each pair of targeting methods overlap in the matched sample. Evaluation is
performed on the matched sample of 535 TUP respondents. Panel A: Overlap
between ultra-poor households that are misclassified as non-ultra-poor (errors
of exclusion) for each targeting method. Panel B: Overlap between non-ultra-
poor households that are misclassified as ultra-poor (errors of inclusion).




                                     27
                         Figure 1: Predicting ultra-poor status from CDR
                                                 .




Notes: Panel A: Comparing the predictive accuracy of assets, consumption, and CDR-based methods for
identifying the ultra-poor in our 535-household matched sample. To adjust for class balance, thresholds for
classification (shown in dashed black vertical lines) are selected such that the correct number of households
are identified as ultra-poor. Panel B: Confusion matrices showing the targeting accuracy of each method
shown in Panel A. Panel C: ROC curves for each of the four targeting methods. In the third subplot, the
CDR-based and combined methods target non-phone-owning households first as described in Section 2.5
                                                     .




                                                     28
                              Online Appendix
A      Machine learning methods and hyperparameters
Although our paper is focused on identifying the ultra-poor with CDR, we experiment with
predicting four measures of ground-truth welfare with CDR features: ultra-poor status (bi-
nary), below the national poverty line (binary), asset index (continuous), and log consump-
tion (continuous). For the binary measures, we experiment with four classification models:
logistic regression (unregularized), logistic regression with L1 penalty, a random forest, and
a gradient boosting model. For the continuous measures, we experiment with four regression
models: linear regression, LASSO regression, a random forest, and a gradient boosting model.
The linear models and random forest are implemented in Python’s scikit-learn package. The
gradient boosting model is implemented with Microsoft’s LightGBM.
    In each case, we produce predictions out-of-sample over 10-fold cross validation. We
use nested cross-validation to tune the hyperparameters of each model over 5-fold cross-
validation within each of the outer folds to avoid any information leakage between folds.
We report both the mean score across the 10 folds as well as the overall score when data
from all folds is pooled together. For the linear models and random forest, missing data is
mean-imputed and each feature is scaled to zero mean and unit variance before fitting models
(these transformations are done separately for each fold, with parameters fitted only on the
training data for each fold). For the gradient boosting model missing values are left as-is and
features are not scaled. We re-fit the model on the entire data, again tuning hyperparameters
over 5-fold cross validation, to report selected hyperparameters and feature importances. We
also report the top 5 features for each model, determined by the magnitude of the coefficient
for the linear models, and by maximum impurity reductions for the tree-based models.
    Hyperparameters are selected from the following grids for each model:

Linear/Logistic Regression

    • Drop features where over X% of observations are missing data: X={50%, 80%, 100%}

    • Drop features with variance under: {0, 0.01, 0.1}

    • Winsorization limit: {0%, 1%, 5%}




                                              29
LASSO Regression

  • Drop features where over X% of observations are missing data: X={50%, 80%, 100%}

  • Drop features with variance under: {0, 0.01, 0.1}

  • Winsorization limit: {0%, 1%, 5%}

  • L1 penalty: {0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100}

Random Forest

  • Drop features where over X% of observations are missing data: X={50%, 80%, 100%}

  • Drop features with variance under: {0, 0.01, 0.1}

  • Winsorization limit: {0%, 1%, 5%}

  • Number of Trees: {20, 50, 100}

  • Maximum Depth: {1, 2, 4, 6, 8, 10, 12}

Gradient Boosting Model

  • Drop features where over X% of observations are missing data: X={50%, 80%, 100%}

  • Drop features with variance under: {0, 0.01, 0.1}

  • Winsorization limit: {0%, 1%, 5%}

  • Number of Trees: {20, 50, 100}

  • Minimum data in leaf: {5, 10}

  • Number of leaves: {5, 10, 20}

  • Learning rate: {0.05, 0.075}




                                            30
B         Abbreviations in Feature Names
Figure S4 and Tables S8, S2, and S7 use a set of abbreviations in CDR feature names. This
appendix lists the relevant abbreviations.

       • BOC: Balance of contacts

       • CD: Call duration

       • IPC: Interactions per contact

       • IT: Interevent time

       • NOI: Number of interactions

       • PPD: Percent pareto durations (percentage of call contacts accounting for 80% of call
         time)

       • PPI: Percent pareto interactions (percentage of contacts accounting for 80% of sub-
         scriber’s interactions)

       • RD: Response delay

       • RR: Response rate

       • WD: Weekday

       • WE: Weekend


C         Cost and Speed Calculations
In the discussion section we provide a cost and speed comparison between targeting meth-
ods, as some of the value-add of the phone-based targeting approach relies on how cheap
and quick it is compared to asset, consumption- or CBT-based targeting approaches. Ad-
ministrative data on targeting costs was not collected as part of the TUP program, so we
turn to other studies of program targeting to estimate the costs of CBT and asset-based (or
PMT) methods. We treat the costs of an asset index-based and PMT approach as equiv-
alent in this section, as they both require comprehensive household surveys.14 We identify
  14
    In practice, an asset-based approach may be slightly cheaper than a PMT, as it does not require con-
ducting a consumption module for a subset of surveys to train a PMT.


                                                  31
three studies that provide variable targeting costs for PMT and CBT methods: Alatas et
al. (2012) provide variable costs for CBT and PMT-based targeting of a single program
in Indonesia; Karlan & Thuysbaert (2019) provide variable costs for CBT and PMT-based
targeting of an ultra-poor program in Honduras and one in Peru; and Schnitzer & Stoeffler
(2021) provide variable costs for three CBT-based programs and four PMT-based programs
in seven countries in Sub-Saharan Africa.15 Table S10 summarizes the cost estimates from
each of these papers; we use the median per-household targeting cost for each method in
our analysis ($2.20 per household for CBT and $4.00 per household for PMT). While using
these global estimates to inform our model of targeting costs in Afghanistan is not ideal,
since no data on targeting costs from the TUP program or other anti-poverty programs is
available for the country, these values are the best available estimate on which to base our
cost analysis.
       We are unable to find any papers that document the targeting cost associated with
consumption-based targeting, as consumption data is rarely used as a real-world targeting
strategy. We therefore consider the costs of consumption to be strictly greater than the costs
of targeting on a PMT, since consumption modules take longer to collect than PMT data in
household surveys. In practice, we expect that the cost of targeting on consumption would
be substantially greater than the cost of targeting on a PMT.
       For phone-based targeting, we associate no cost with the collection and analysis of phone
data. While in some cases phone data may require purchase from the operator, partnerships
between mobile network operators and governments for social protection and public health
applications have not, to date, involved payment (Milusheva et al., 2021). The fixed cost
of mobile data analysis is non-negligible but its contribution to marginal cost is close to
zero as the number of screened households increases. A phone-based targeting method
that collects informed consent from program applicants to analyze phone data would have
nonzero marginal costs, though the cost of consent would depend on the modality of consent
collection. If consent was collected in person, these costs would likely be only slightly lower
than those of a PMT, as every household would need to be surveyed in person. If consent
was collected over the phone via SMS or voice, these costs would likely be significantly lower.
       It is worth noting that our benchmark in this paper is the hybrid model with a CBT
plus verification component, but due to limited estimates in the literature we leave this
strategy out of our cost analysis. We consider the CBT a lower-bound estimate for the
hybrid strategy, and therefore our results would be qualitatively unchanged if the hybrid
  15
    To our knowledge, no studies incorporate fixed targeting costs, as these are typically indistinguishable
from fixed costs of other components of program set-up.


                                                    32
strategy were also considered in cost comparison. Alatas et al. (2012) suggest that there are
synergies in targeting approaches so that combining approaches is less costly than the sum
of the costs of the two approaches individually, but costs are certainly greater than that of
CBT targeting alone.
   Our cost analysis finally relies upon administrative data from the TUP program. The
TUP program in its entirety served 7,500 households across six provinces of Afghanistan.
While there is no data available on the total number of households screened by the TUP
program, the portion of the program in Balkh province that was enrolled in the RCT iden-
tified 1,235 ultra-poor households out of 20,702 households screened (Bedoya et al., 2019).
Assuming similar eligibility rates across Afghanistan, we estimate that the TUP program
as a whole likely screened around 125,721 households. We use this value to estimate total
targeting costs for the TUP program under counterfactual targeting approaches. Eligible
households received benefits totaling $1,688, including a productive asset, cash transfers, a
health voucher, training, biweekly social worker visits, and veterinarian visit once every two
months during the year of intervention. The total benefits dispersed by the program were
therefore on the order of 12.7 million (although the total program costs, including overhead,
were closer to 15 million (Bedoya et al., 2019)); we use the total value of benefits to compare
the costs of program targeting using our set of counterfactual targeting approaches to the
direct costs of program benefits in Table S9. We find that targeting costs for a PMT or
asset-based approach would represent approximately 3.97% of the total benefits delivered in
the program; costs for a CBT approach would represent approximately 2.18% of the total
benefits. In comparison, costs for the phone-based approach would be negligible.
   When it comes to speed, in-person data collection for an asset-based (or PMT) targeting
approach typically takes months or years to prepare and implement (World Bank, 2020). The
CDR-based approach can be rolled out comparatively quickly — but there are still practical
hurdles to implementation. First, training data for the CDR-based poverty prediction model
must be collected, preferably shortly prior to program roll-out (Aiken et al., 2021). While in
the TUP project training data was collected in-person in a household survey, in other contexts
training data collection was expedited via a phone survey (Blumenstock et al., 2015; Aiken
et al., 2021). Even if data is collected over the phone, it will typically take several weeks to
design a survey instrument and collect data. Second, the CDR-based method requires data
from mobile network operators. Data sharing agreements with mobile network operators take
at minimum a few weeks to arrange, and substantially longer in the worst case (Milusheva et
al., 2021). Third, and finally, training a CDR-based poverty prediction model is expensive in


                                              33
terms of memory, computing power, and human capacity, and will likely take several weeks
to implement.




                                          34
Supplementary Tables and Figures




Figure S1: Histograms showing the distribution of each underlying asset used to construct
the asset index.




                                           35
Figure S2: Distributions of asset index and log-transformed consumption, for the entire sur-
vey sample, separately for ultra-poor and non-ultra-poor households, and again separately
for households in the subsample matched to CDR, households outside of the matched sub-
sample that report owning at least one mobile phone, and households outside of the matched
subsample that report not owning a mobile phone.




Figure S3: Correlation between asset index and log-transformed consumption, separately for
the entire survey sample and the matched subsample. We include the LOESS fit, along with
a 95% confidence interval.




                                            36
Figure S4: Kernel density estimates for 16 of the most important features for predicting
ultra-poor status from CDR, with density estimates shown separately from UP and NUP
households. Since many features are near-redundant, rather than showing the raw top 16
features from the table above, we show 16 selected features from the top 50. See Appendix
B for abbreviations in feature names.




Figure S5: Precision-recall curves for each of the four targeting methods. In the third
subplot, the CDR-based and combined methods target non-phone-owning households first
as described in Section 2.5.




                                           37
Table S1: Direction of first principal component of asset ownership

               Asset                          Magnitude

               Radio/CD Player                    0.04
               TV                                 0.37
               TV Dish                            0.29
               VCR/DVD Player                     0.15
               Refrigerator                       0.25
               Generator                          0.11
               Mattress                           0.24
               Mobile Phone                       0.31
               Non-Mobile Phone                   0.06
               Iron                               0.36
               Bed Frame                          0.29
               Jewelry                            0.27
               Mosquito Net                       0.26
               Mosquito Repellent Candle          0.08
               Fan                                0.37
               Camera                             0.16
               Notes : The asset index is calculated over the
              entire 2,814 household sample, without sam-
              ple weights. We standardize each of the fea-
              tures to zero mean and unit variance before
              decomposition. The first principal component
              accounts for 25.28% of the variation in these
              standardized features.




                                    38
                    Table S2: Feature importances (gradient boosting model)

 Feature                                      Importance    Feature                                    Importance

 CD WE Call Median                                      8   IT Recharges Night Min                                3
 % At Home WD Night                                     7   BOC WD Call Median                                    3
 IPC WD Night Call Kurtosis                             7   IT WE Call Min                                        2
 CD Day Call Median                                     6   BOC WD Night Call Kurtosis                            2
 IT WE Call Mean                                        6   IPC Day Call Kurtosis                                 2
 Churn Rate Mean                                        5   % Nocturnal WD Call                                   2
 IPC Day Call Skew                                      5   IT WD Day Text Mean                                   2
 IT Recharges WD Median                                 5   CD Night Call Max                                     2
 % At Home Day                                          4   IT WE Call Skew                                       2
 % Initiated Interactions Day Call                      4   IPC WE Day Call Kurtosis                              2
 % Initiated Interactions WD Day Call                   4   IT WE Text Median                                     2
 BOC WD Call Max                                        4   % At Home WE Night                                    2
 % Initiated Interactions WD Night Call                 3   Entropy Contacts WE Day Call                          2
 PPD Night Call                                         3   # Recharges WD Day                                    2
 IT Recharges WD Night Min                              3   Entropy Antennas WD                                   2
 IT Recharges Night Median                              3   IPC Night Call Skew                                   2
 IPC WD Night Text Mean                                 3   IT WE Night Call Mean                                 2
 IT Recharges Day Kurtosis                              3   # Contacts Day Call                                   2
 IT Night Text Min                                      3   CD WD Call Max                                        2
 IT WE Day Text Median                                  3   IT Day Call Mean                                      2
 # Antennas WD                                          3   IT WD Night Text Min                                  2
 CD WD Night Call Kurtosis                              3   Entropy Antennas Day                                  2
 IPC Night Call Kurtosis                                3   % Initiated Interactions WE Day Call                  2
 IPC WE Night Call Kurtosis                             3   CD WE Day Call Kurtosis                               1
 IPC WE Night Call Skew                                 3   IPC Day Call Std                                      1
 Notes : For our selected machine learning model – the gradient boosting model used to predict ultra-poor status
from CDR features – we display feature importances for the top 50 features. Feature importances for the gradient
boosting model represent the total number of times the feature is used for a split in the entire ensemble of decision
trees. We report feature importances when the model is trained on all 535 observations (rather than over cross
validation). See Appendix B for abbreviations in feature names.




                                                      39
                           Table S3: Details of machine learning models

   Model                    AUC      Top Five Features

   Logistic (No Penalty)     0.53    Reporting # Records, Active Days, Active Days Day, Active Days
                                     Night, Active Days WD
   Logistic (L1 Penalty)     0.66    Reporting # Records, Active Days, Active Days Day, Active Days
                                     Night, Active Days WD
   Random Forest             0.68    NOI Out Day Call, NOI Out WD Day Call, Nois Call, Entropy
                                     Contacts Night Call, NOI Out WE Call
   Gradient Boosting         0.68    CD WE Call Median, % At Home WD Night, IPC WD Night Call
                                     Kurtosis, CD Day Call Median, IT WE Call Mean
   Notes : Each row indicates performance (AUC) of a different machine learning algorithm, trained to
  predict ultra-poor status on the sample of 535 matched households. AUC is reported as the mean
  AUC score over 10-fold cross validation. See Appendix B for details of features.



                            Table S4: Machine learning an asset index

 Model                        AUC Score             Top Five Features

 Logistic (L1 Penalty)        0.60                  TV, TV Dish, Fridge, Mattress, Mobile Phone
 Random Forest                0.73                  Fridge, Iron, Bedframe, Mattress, TV Dish
 Gradient Boosting            0.74                  Mattress, Bedframe, Fridge, Mobile Phone, TV Dish
 Notes : The asset index benchmark we used is constructed following standard procedures based on principal
comnponent analysis (see Table S1). However, it is possible that an alternative asset-based predictor,
trained using machine learning to predict ultra-poor status directly from the 16 underlying components,
could perform better. We test this hypothesis by adapting our machine learning pipeline for identifying
the ultra-poor from CDR to the task of identifying the ultra-poor from asset possession. As with the
CDR-based prediction, we evaluate the model over nested cross validation in our 535-household matched
sample: the model’s predictions are evaluated out-of-sample over 10-fold cross validation, and within each
fold hyperparameters are tuned over 5-fold cross validation. We retrain the model on the entire dataset to
report hyperparameters and feature importances. Hyperparameters are chosen from the same grid as for
the CDR-based models. We display the AUC score and top features for each model.




                                                   40
Table S5: Performance using one, two or three predictor datasets

          Data Sources                                AUC
          Assets                                  0.73   (0.025)
          Consumption                             0.71   (0.000)
          CDR                                     0.68   (0.028)
          Assets + Consumption                    0.76   (0.017)
          Assets + CDR                            0.76   (0.025)
          Consumption + CDR                       0.75   (0.016)
          Assets + Consumption + CDR              0.78   (0.019)
          Notes : AUC scores for targeting methods using a sin-
         gle data source, pair of data sources, and all three data
         sources together (in our 535-household matched sam-
         ple). Standard deviations are calculated from 1,000 boot-
         strapped samples of the same size as the original sample,
         drawn with replacement.



  Table S6: Targeting simulation results for one train-test split

                            (1)        (2)           (3)        (4)
    Targeting Method       AUC      Accuracy      Precision    Recall

    Random                  0.50         0.48        0.28          0.28
    Asset Index             0.68         0.63        0.33          0.33
    Consumption             0.74         0.74        0.53          0.53
    CDR                     0.75         0.70        0.47          0.47
    Combined                0.82         0.74        0.53          0.53

    Notes : Reproduction of main results (Table 2 Panel A) to using a
   single train-test split (for our 535-household matched sample, with
   10% of the observations in the test set).




                                    41
               Table S7: Matching household to multiple phone numbers

 Model                   AUC      Top Five Features

 Logistic (No Penalty)    0.50    Reporting # Records, Active Days, Active Days Day, Active Days
                                  Night, Active Days WD
 Logistic (L1 Penalty)    0.65    Reporting # Records, Active Days, Active Days Day, Active Days
                                  Night, Active Days WD
 Random Forest            0.67    NOI call, NOI Out WE Call, IPC WD Night Call Kurtosis, IPC
                                  Night Call Kurtosis, IT Recharges WD Day Min
 Gradient Boosting        0.66    Churn Rate Std, CD WE Call Median, IPC WD Night Call Kur-
                                  tosis, IPC Day Call Skew, % Initiated Interactions Day Call
 Notes : In our main analysis, for multi-phone households we use only the phone number belonging
to the household head (or to a random household member, where no household head is specified),
leaving 535 household-level observations. Here we consider instead using machine learning methods
to predict individual-level ultra-poverty, with a dataset of 634 individual phone numbers matched to
the ground-truth wealth measures for the associated households. We find that the individual-level
models are slightly less accurate than the household-level models presented in the main paper, but we
focus on the household-level models in the main paper since the household was the unit of targeting
in the TUP program. See Appendix B for abbreviations in feature names.




                                                 42
                  Table S8: Predicting other measures of poverty from CDR

 Model                   R2 or AUC       Top Five Features

 Panel A: Predicting below poverty line (binary)
 Logistic (No Penalty)        0.53       Reporting # Records, Active Days, Active Days Day, Active Days
                                         Night, Active Days WD
 Logistic (L1 Penalty)        0.53       Reporting # Records, Active Days, Active Days Day, Active Days
                                         Night, Active Days WD
 Random Forest                0.56       NOI Out Night Call, BOC Night Call Kurtosis, CD Day Call
                                         Skew, Nois Night Call, IT Night Call Kurtosis
 Gradient Boosting            0.55       IT Night Call Kurtosis, IT Text Max, Radius Gyration WE Night,
                                         Entropy Antennas, NOI Out WD Call

 Panel B: Predicting consumption (continuous)
 Linear Regression           -0.21       % Pareto Recharges WE Night, % Pareto Recharges WE, %
                                         Pareto Recharges Night, Entropy Contacts WD Day Text, PPI
                                         WE Night Text
 LASSO Regression            -0.00       Reporting # Records, PPI Text, PPI Day Text, PPI Night Call,
                                         PPI Night Text
 Random Forest               -0.02       Churn Rate Mean, IPC WE Night Call Kurtosis, IT Recharges
                                         WE Day Skew, IPC WE Night Call Skew, CD WE Call Median
 Gradient Boosting           -0.03       CD WD Night Call Skew, IPC WD Day Text Skew, IT WD Night
                                         Call Min, IT WD Night Call Max, IT WE Night Call Max

 Panel C: Predicting asset index (continuous)
 Linear Regression           -0.06       IPC Text Min, IPC WD Text Min, IPC WD Day Text Min, BOC
                                         WD Text Min, % Initiated Conversations WD
 LASSO Regression             0.00       Active Days WE Day, Active Days WD, Active Days WE, Active
                                         Days, Active Days WD Day
 Random Forest                0.00       IT Night Call Skew, IPC Text Min, IT WE Day Call Median, IT
                                         WE Call Median, Entropy Contacts WE Night Call
 Gradient Boosting           -0.02       IT Text Median, Entropy Antennas WE, Entropy Antennas WD
                                         Night, Entropy Contacts WE Night Call, IT Recharges Night Min

 Panel D: Predicting CWR group (continuous)
 Linear Regression            0.01       PPI Night Text, IT Recharges Day Skew, IPC WE Call Min,
                                         Active Days WE Night, IT Recharges WD Day Skew
 LASSO Regression             0.05       PPI Night Text, Active Days WE Day, Active Days WE Night,
                                         IT Recharges WD Day Skew, IT Recharges Day Skew
 Random Forest                0.04       # Contacts WE Day Call, Entropy Contacts WD Night Call, IPC
                                         Night Call Kurtosis, # Contacts WE Call, IT Call Kurtosis
 Gradient Boosting            0.03       IT Call Kurtosis, IT Recharges Day Skew, # Contacts WE Day
                                         Call, IT Recharges Day Kurtosis, IPC WD Night Call Kurtosis
 Notes : Machine learning results for predicting: (A) Below-poverty-line status, using consumption data and
based on Afghanistan’s national poverty line; (B) Total consumption (log-scale); (C) Asset index; and (D)
Community Wealth Ranking. Performance is evaluated on the sample of 535 matched households. Binary
metrics (A) are evaluated using the mean AUC score over 10-fold cross validation; Continuous metrics (B-D)
are evaluated using the mean R2 score over 10-fold cross validation. See Appendix B for details of features.


                                                    43
             Table S9: Variable costs of different targeting methods

                           Cost per           Total cost Fraction of program costs
 Targeting Method         HH screened        of targeting   spent on targeting
 CBT                          $2.20            $276,586                  2.18%
 PMT                          $4.00            $502,884                  3.97%
 Consumption                  >$4.00          >$502,884                  >3.97%
 Phone                        $0.00               $0                     0.00%
 Notes : Costs for the TUP program, based on costs estimated from the literature. The TUP
program screened an estimated 125,721 households; benefits valued at $1,668 were provided
to each of the 7,500 beneficiary households for a total benefits distribution of approximately
$12.7 million. The total value of benefits is used to obtain the targeting costs as a percentage
of total program costs. For the Phone option, we assume no contact with beneficiaries is
required; if contact were required, for instance to collect informed consent, variable costs
would increase accordingly.




                                              44
Table S10: Costs for CBT and PMT targeting methods obtained from the literature

      Source                                     Location         Cost per household
      Panel A: CBT
      Alatas et al. (2012)            Indonesia                              $1.20
      Karlan and Thuysbaert (2019)    Honduras                               $1.67
      Karlan and Thuysbaert (2019)      Peru                                 $1.90
      Schnitzer and Stoeffler (2021) Burkina Faso                            $5.60
      Schnitzer and Stoeffler (2021)    Niger                                $5.40
      Schnitzer and Stoeffler (2021)   Senegal                               $3.20
      Median                                                                 $2.20

      Panel B: PMT
      Alatas et al. (2012)            Indonesia                              $2.70
      Karlan and Thuysbaert (2019)    Honduras                               $2.62
      Karlan and Thuysbaert (2019)      Peru                                 $3.05
      Schnitzer and Stoeffler (2021) Burkina Faso                            $5.69
      Schnitzer and Stoeffler (2021)    Chad                                 $9.50
      Schnitzer and Stoeffler (2021)     Mali                                $4.00
      Schnitzer and Stoeffler (2021)    Niger                                $6.80
      Median                                                                 $4.00
      Notes : Costs per household screened for two targeting methods obtained from three
     papers in the targeting literature. Costs in Alatas et al. (2012) are provided per-village;
     we use the average of 54 households per village to obtain per-household targeting costs.
     Cost for the CBT in Karlan & Thuysbaert (2019) is provided as part of the cost for a
     hybrid CBT and verification approach; although an individual cost for the cBT alone is
     provided, it is possible this cost excludes some of the mutual costs for the two exercises
     and is therefore an underestimate of costs of a CBT alone. We use the median of the
     distribution of targeting costs in our cost analysis.




                                                 45