Estimating Poverty Rates in Target Populations: An Assessment of the Simple Poverty Scorecard and Alternative Approaches

The performance of the Simple Poverty Scorecard is compared against the performance of established regression-based estimators. All estimates are benchmarked against observed poverty status based on household expenditure (or income) data from household socioeconomic surveys that span nearly a decade and are representative of subnational populations. When the models all adopt the same "one-size-fits-all" training approach, there is no meaningful difference in performance and the Simple Poverty Scorecard is as good as any of the regression-based estimators. The findings change, however, when the regression-based estimators are "trained" on "training sets" that more closely resemble potential subpopulation test sets. In this case, regression-based models outperform the nationally calculated Simple Poverty Scorecard in terms of bias and variance. These findings highlight the fundamental trade-off between simplicity of use and accuracy.


I. Introduction
The World Bank Group (WBG)'s twin goals-eliminating extreme poverty and boosting shared prosperity-have intensified the interest in measuring poverty rates in specific populations targeted by development programs worldwide. Private-sector firms and financial institutions (especially micro-finance institutions, NGOs, agriculture and agribusiness enterprises) are also increasingly selling and buying from the poor products and services and seeking to estimate poverty levels in specific market segments to inform their business strategies and operations. The most rigorous poverty estimation methodologies are, however, not necessarily practical for development practitioners or private sector firms. The best sources of poverty data are based on government-run large-scale nationally-representative household surveys that collect highly detailed socioeconomic household information, cost millions of dollars, and take years to design and implement (Benin and Randriamomonjy, 2008). However useful these survey instruments and data may be for many applications, they do not enable direct estimation of poverty rates in targeted idiosyncratic populations. Even if the relevant national data happen to be available, the alternative approach of using small area estimation or survey to survey imputation methods is very data intensive and requires a level of technical sophistication that makes this approach impractical for practitioners (Elbers, et al, 2003;Christiaensen et al, 2012;Tarozzi and Deaton, 2009). The most popular solution for project-specific poverty estimation is the Simple Poverty Scorecard (SPS) described in Schreiner (2014a), which is implemented with a 10-question survey and (in its basic form) a statistical look-up table. 1 The SPS is based on a logistic regression, but it departs from established econometric approaches. Despite the SPS's widespread use there is little published academic literature assessing its performance when applied to subnational populations across countries and time-periods, as it is actually utilized by researchers. 2 We seek to fill this gap in the literature by evaluating the SPS, performing several thousand statistical experiments across diverse populations and strata in nine separate countries in a collection of surveys representative ay subnational level that span nearly a decade. These experiments assess the national level calculated SPS performance versus the performance of established regression-based estimators. 3 We benchmark all estimates against observed data on poverty status derived from government-run sub-nationally representative socioeconomic household surveys. We find that established regression-based models like ordinary least squares, logistic regression, and lasso regression-trained on "training set" data and tested on "test set" data-outperform the SPS in terms of bias and variance. In many of these experiments, our regression-based estimators perform better than the SPS because they are informed by additional and target population specific information 1 See Appendix 1 for a summary of the methodology. 2 As Schreiner ( 2015) states, "Like all predictive models, the scorecard here is constructed from a single sample and so misses the mark to some unknown extent when applied to a validation sample. Furthermore, it is biased when applied (in practice) to a different population….. (because the relationships between indicators and poverty change over time)." 3 It is important to clarify that this report evaluates the performance of the SPS in estimating poverty for a target population and not the performance of SPS in predicting the poverty status of an individual for targeting based on poverty status or for program eligibility. In fact the SPS can be and has been used for targeting based on individual poverty status but the evaluation of the targeting performance of SPS is outside the scope of this report.
2 reflecting a basic but important point: the SPS does not take advantage of any information beyond its ten survey questions, even though researchers almost always have additional data at their disposal. Using all relevant available information is especially important when targeting groups that significantly differ from the national populations upon which SPSs are based, because the modeled relationship between household characteristics and poverty status in a given national population may not hold true for subpopulations in the country.
We advise researchers attempting to estimate poverty rates in particular samples to adopt an approach based on the established regression-based techniques explored in our experiments. Our recommended approach is more accurate than the SPS, applicable to any population of interest, and able to exploit information beyond the ten SPS survey questions. Because we utilize national household surveys that statistically representative for specific subnational strata, our approach can be used to derive reliable poverty rate estimates for these subnational groups.
This paper is organized as follows: Section II describes the data. Section III explains the methods and models under examination, including the SPS, and Section IV assesses their relative performance. Section V summarizes lessons learned and describes paths for future work.

II. Data
Our analysis draws from 14 national socioeconomic household surveys across 9 countries (see Table  1), which were implemented by the statistical agencies of the various governments. The surveys cover subnationally representative samples ranging in size from 3,579 households (in Sierra Leone, 2003) to 293,715 households (in Indonesia, 2010). Each survey contains a core questionnaire which consists of a household roster listing the sex, age, marital status, educational attainment, household income and/or expenditure information, and labor force experience of all household members. We use the expenditure or income data along with national poverty lines to determine the poverty status of each household. We then use the poverty status to calculate our benchmark poverty rates, referred to below as "observed" poverty rates for various the populations of interest. Hundreds of questions are asked in each survey (e.g., 612 variables total in 21 data sets associated with the Bangladesh 2010 Household Income and Expenditure Survey), and the time and cost of implementation is a major reason that so many researchers have adopted the SPS approach. The SPS and the other models we examine below have been derived from these data sets, and are thus based upon the statistical relationships between observed poverty status and household characteristics. We also use these large-scale household surveys to test the models, assessing which models come closest to the observed poverty rates.

III. Methods and Notation
For each data set in our sample, we estimate the national poverty rate and stratum-specific poverty rates through various approaches. Here we present the notation and a general overview of the approaches used in our analysis. There are a set of households N = 1,…,n. Let denote household i's observed poverty status (i.e., in poverty or not, 1 or 0) as determined by their expenditure (or income) in reference to the national poverty line. While there are 10 questions in each poverty scorecard, some of these questions have categorical responses with more than two categories. Some care is taken to preprocess the explanatory data for analysis. Let denote the number of possible categorical responses for each survey question q = 1, 2,…, 10; for example, if question q = 1 had possible responses A; B; and C, this means there are three response categories to that question. If an respondent were to be asked all 10 questions, it follows that the total number of binary indicators needed to account for all survey responses is ∑ 1 . Hence, in our reformatted data, is a binary indicator (1 or 0) of whether household i provided response j for each j 1, …, p. 4 A. Household-level Poverty Probabilities: Several Approaches Simple Poverty Scorecard (SPS) The SPS is a poverty estimation methodology developed by Mark Schreiner of Microfinance Risk Management, L.L.C. 4 Each SPS is designed for a particular country and a particular year. Design begins with a nationally representative household survey, which is taken as the poverty estimation in that country at the time of the survey. For a given survey, half the data are used as a "training set" to develop (or train) the model that the SPS will use to estimate poverty rates, and the other half (the "test set") are used to validate the accuracy of the constructed model. To train the model, the SPS developer repeatedly analyzes the training set, attempting to identify 10 questions from the national household survey that can reliably predict household-level poverty-status. This is an iterative model-selection process that relies on both statistical methods (logistic regression) and professional judgment, in an effort to identify variables with high predictive power that can be easily collected and verified by surveyors in the field. Details on the development and calibration of SPS for particular data sets in our sample can be found in Schreiner (2010Schreiner ( , 2011aSchreiner ( , b, 2012aSchreiner ( , b, c, 2013aSchreiner ( , b, 2014b. 5 SPSs have been estimated for at least 63 countries, and in many countries they have been periodically updated when new household surveys have become available. Once an SPS has been developed, actually calculating poverty rates for a given data set is straightforward. All that is needed is basic arithmetic, pencil and paper, and a lookup-table that converts each household's survey result (the "poverty score") into an estimated probability that that household is below a specific poverty-line. After converting the results into probabilities for all respondents, the average probability is adjusted by an additive bias-adjustment factor (typically a fraction of a percentage-point) and the result is the poverty-rate estimate. Technology applications to do this are also widely available and used by the private sector. One limitation of the SPS is that its look-up table (which is derived from logistic regression results) is calibrated to deliver certain discrete estimated probabilities for each household. In the case of the Bangladesh 2010 SPS, the lookup table converts any survey result into one of 18 discrete poverty rate estimates. This SPS can estimate a household-level probability of 40.9%, or 50.4%, but nothing in between (see Figure 1). , 2010). This plot compares the household-level poverty probabilities from a logistic regression model (the horizontal axis) against those obtained from the SPS (the vertical axis) using data from the Bangladesh 2010 HIES. As a result of data coarsening in the SPS lookup table, the number of unique probabilities generated by SPS is considerably lower than logit. While there is a strong positive association between the results from both approaches, the figure demonstrates there is information sacrificed in the calculation of SPS probabilities.

Figure 1. Comparing the SPS and Fitted Values with a Simple Logit Model (Bangladesh
SPS authors claim good properties for their estimator when it is applied to data that mirror the national household samples that formed the original SPS training sets. Indeed, the SPS's claims to reliability are, almost without exception, 6 limited to SPS's application to the corresponding national population data. The problem with this claim is that the need to use an SPS to estimate poverty rates in these target populations it is not at all obvious, given that the best and easiest approach would be to derive the numbers directly from the original national data sets (or read the summary statistics published by government agencies). Indeed, the SPS is typically used to estimate poverty rates in specific subnational groups. For example, microfinance institutions frequently and other private sector actors use the SPS to estimate poverty rates in specific subnational groups. Similarly, when the SPS is used by development practitioners to assess poverty rates in project areas, or to compare poverty rates before versus after projects or in treatment versus control samples, the target population is typically a specific subnational group of project participants. The SPS is sometimes used to estimate poverty rates in groups that are significantly poorer than the national average (e.g., female-headed agricultural households in a particular region of a country). The SPS authors are cognizant of the difference between the ideal conditions for SPS use and the way it is actually used by practitioners, and SPS documentation always includes a caveat to this effect, like the one quoted in footnote 4.

Ordinary Least Squares (OLS):
Using observed survey responses for each of the ten poverty scorecard questions, we estimate the following linear probability model ∑ , 1,2, … , .
In this context, each observation in our training sample is given equal weight toward the model's estimation. As is well known, the set of estimated regression coefficients are estimated so as to minimize the Sum of Squared Errors (SSE), i.e., ∑ ∑ ∑ .
In the event the empirical distribution of the training sample is reflective of the researcher's population of interest, the OLS estimator may provide unbiased and consistent estimates of grouplevel poverty rates despite the fact that household-level poverty estimates may fall outside of the unit interval.

Weighted Least Squares (WLS):
In contrast to OLS, WLS does not treat all observations in the training set as equally influential to the estimation procedure. This feature may be desirable the more a national survey is stratified, or if the empirical distribution of survey respondents does not closely approximate the proportion of households in the researcher's population of interest, or if the residuals of the classic least squares models are heteroskedastic ( Greene, 2003). In this report, we 7 weight the influence of each observation according to its estimated inverse-probability weight , which are obtained from the national surveys. As in Equation (OLS), the WLS regression equation model assumes a household's probability of being impoverished is a linear function of observed covariates (i.e., survey responses). However, the WLS estimator minimizes the weighted sum of squared errors (WSSE), such that ∑ ∑ ∑ .

Logistic Regression (Logit):
In addition to OLS and WLS, we implement a simple logistic regression model to compare against poverty rates generated by the poverty score card. We model poverty with the well-known functional form Lasso Penalization: The lasso is a form of penalized regression, similar to ridge regression, whereby regression coefficients are weighted by "shrinkage factors" such that regression coefficients are weighted towards zero (Tibshirani, 1994;Hastie et al., 2009;James et al. 2013). The lasso is commonly used for feature selection in high-dimensional learning problems to decrease the variance of a particular classifier. For our ordinary least squares estimators, we apply the lasso at the trainingset level such that we solve the following problem: where s is a coefficient shrinkage factor, and is a linear estimate of the marginal influence of a survey response on poverty. Philosophically, this procedure is similar to an ordinary least squares regression procedure in which the best-model is determined by that which minimizes the in-sample sum of squared residuals, except regression coefficients are penalized according to prior rule (i.e., the shrinkage factor) on the minimum coefficient size a variable is allowed to have to be included in the final classification model. The conventional penalty used in the lasso is the ℓ penalty, which is defined by ∑ . In Equation OLS Lasso, the coefficient shrinkage factor is therefore defined as ⋅ , where ∈ 0,1 is a tuning factor. The optimal level of is chosen through 10-fold cross validation.

Training and Test Sets:
For each national survey analyzed in this report, we divide the full survey sample into training set and a test set (a random 50-percent of the data in each set), just as the SPS's developers do. The regression-based models (OLS, logit, etc.) described in Section A are trained (i.e., coefficients are estimated) exclusively on the training data set. To evaluate the performance of these models, we project the estimated model parameters (and the equivalent-the SPS's poverty scores) onto the test set. By "project" we mean we use the coefficients estimated for the training data sets and the values in the testing sets to derive the predicted poverty rate in the test data sets. It is important to keep in mind that all the regression-based models we employ at the national and at the stratum level use, on purpose, the same 10 variables used by the SPS model. 7 All results presented in Section IV, including the stratum-specific analyses, use only the test set data.

C. Group-Level Poverty Rates
The objective of the analysis is to estimate poverty rates at various levels. Here we distinguish the poverty rate of the survey sample from the poverty rate of the national population, and we distinguish both of these quantities from the poverty rate of a particular stratum of interest (i.e., a particular subset of the national population). The utility in estimating any of these quantities will depend precisely on how closely they map to the researcher's target population of interest.  (Horvitz and Thompson, 1952). For stratum-specific poverty rates, S indicates the set of individuals in a particular stratum of interest, where ⊂ , and | | denotes the length of the stratum. Readers should note we do not directly estimate in this technical report, so the quality of our results that rely on these weights will depend on the quality of these weights derived by prior analysis and made available in these data sets.

Poverty Rate of Interest
Observed Rate (in Data) Estimated Rate (Fitted)

D. Measures of Uncertainty via Bootstrap
For each of the surveys, we obtain both the observed rates of poverty based on the raw national surveys (y i ) and the estimated poverty rates obtained thorough the approaches described in Section III, . For all confidence intervals presented in this analysis, bootstrapped standard errors of the mean (using 5000 bootstrapped samples) are calculated for estimated poverty rates of interest (Efron and Tibshirani, 1998). Bootstrapped confidence intervals are similarly generated for the observed levels of poverty in the data, given that the observed rate of poverty is itself a sample and therefore an estimate of the unobserved population. 7 It is quite plausible that not all of these 10 variables will be the best predictors of poverty in each different stratum. In fact it is quite likely that the best predictors of poverty may vary from stratum to stratum which implies that the composition of the set of the 10 best poverty predictors is likely to vary across strata. For the purpose of keeping the analysis simple and fair to the SPS, we have decided to stick with the same 10 variables used by SPS to predict poverty. One exception is the Lasso which is analogous to selecting prediction variables among the 10 variable used in the model (see James et al. 2013).

IV. Results
In this section we assess the accuracy of the SPS by running multiple trials across different "test" sets, including national samples and subpopulations, to see how well the SPS poverty estimates can recover the observed poverty rates. We then subject the regression-based models to the same trials and compare SPS and regression results. We consider whether the SPS or other approaches are preferable for estimating poverty rates.

A. Estimating National Poverty Rates
We begin by exploring how effectively the SPS and other estimators can recover the observed poverty rate in a random sample drawn from national household survey data. This is a logical place to start because it is what the SPS is expressly designed to do. We run 14 tests in as many country-year data sets, with and without the observation-specific weights accompanying the national household surveys. 8 We show results for trials with the weights because they produce statistically representative national samples, but we also show results for trials run without weights (the "raw sample") because in practice, the SPS is often applied to idiosyncratic samples that do not generally have observation weights. Figure 2 is illustrative of the findings from the exercise described above. The upper panel shows poverty estimates and bootstrapped 95% confidence intervals-Indonesia on the left, Peru on the right-applied to raw (unweighted) national "test sets" or validation sets (i.e., random samples from the national survey data that were not used to fit the regression-based models). The regression-based methods clearly dominate the SPS in the upper panel; of these, weighted least squares (WLS), which utilizes observation weights in the training set but not the test set, does the worst but still dominates the SPS. The lower panel shows how these results change when observation weights are applied to the test sets (creating nationally representative test sets). Here we see that the SPS performs better than it did in the upper panel, but SPS results are not dominated by the regression-based approaches (and in fact SPS is clearly dominated by WLS, which performs the best). Additionally, a general feature of these results is that applying the SPS's bias correction factor actually increases discrepancies, moving estimates away from the observed poverty rate.

Figure 2. Regression-Based Methods Improve Upon Simple Poverty Scorecard for National Poverty Rate
Estimates. The upper panel shows poverty estimates and bootstrapped 90% and 95% confidence intervals-Indonesia on the left, Peru on the right-applied to raw (unweighted) national "test sets" or validation sets (i.e., random samples from the national survey data that were not used to fit the models). Regression-based methods clearly dominate the SPS in the upper panel; of these, weighted least squares (WLS), which utilizes observation weights in the training set but not in the test set, does the worst but still dominates the SPS. The lower panel shows how these results change when observation weights are applied to the test sets (creating nationally representative test sets). Here the SPS performs better than in the upper panel, but SPS is still not dominated by the regression-based approaches (and is clearly dominated by WLS, which performs the best). Additionally, a general feature of these results is that applying the SPS's bias correction factor increases the discrepancy, moving estimates away from the observed poverty rate.

B. Stratum-Specific Poverty Rates
Results in the previous section show the SPS to be fairly reliable in percentage-point terms (though not generally the most accurate estimator) for estimating poverty rates in samples mirroring the national household samples that formed the original SPS training sets. However, it is essential to test SPS and the other methods under realistic conditions, as they would actually be used by project leaders and researchers. To this end, we assess the performance of all estimators applied to subnational groups.
We first compare results for the SPS versus the Logit -based estimator across geographic strata. In this exercise the SPS and the Logit models are both trained (or estimated) with models applied to national-level data. Figure 3 indicates how nationally representative poverty estimators, such as SPS, perform relative to logistic regression models trained on the national data. Even though the Indonesia 2010 SUSENAS survey is only representative at the district level, we split the data into 934 strata such that each of the 498 districts are partitioned into urban and rural areas, and measure the discrepancies between estimated poverty rates and actual poverty rates for each of these strata. Each of the small circles corresponds to the discrepancy for a given stratum-raw discrepancies in the upper panel and absolute discrepancies in the lower panel, with SPS results in blue, and national-level logit results in gold. Locally weighted scatterplot smoothing (Cleveland and Devlin, 1988) defines best-fit curves drawn through the points. The green vertical line shows the national poverty rate. We implement bootstrapped Kolgomorov-Smirnov tests for the equality of the poverty scorecard estimates against the stratum specific regressions (Praestgaard, 1995), and find that the results from the two estimators are statistically distinguishable, but both estimators perform about equally poorly and overestimate poverty rates in the richer regions and underestimate poverty in the poorer regions. In the poorest strata, average SPS discrepancies are as high as 15-25 percentage-points.

Figure 3. "One Size Fits All" National Models Perform Poorly When Applied to Subnational Strata, SPS versus Logit (Indonesia 2010).
We split the Indonesia 2010 data into 934 strata such that each of the 498 districts are partitioned into urban and rural components and measure the discrepancies between the estimated poverty rates and the observed poverty rates. Each of the small circles corresponds to the discrepancy for a given stratum-raw discrepancies in the upper panel and absolute discrepancies in the lower panel, with SPS results in blue, and national-level logit results in gold. Each estimator produces results that are statistically distinguishable from the other (see the statistically significant Kolmogorov-Smirnov p-values), but both estimators perform about equally poorly, overestimating poverty rates in the richer regions and underestimating poverty in the poorer regions. In the poorest strata, average SPS discrepancies are as high as 15-25 percentagepoints.

14
The next set of figures (Figures 4-7) compares the SPS against regression-based estimators trained or estimated separately using data from each geographic stratum that the household survey is designed to yield representative estimates for. As Table 1 indicates many household surveys are designed to be representative for different geographic strata. Peru and Bangladesh for example are representative at the regional level, with Peru having 24 departments (regions) and Bangladesh 7 regions. Indonesia, on the other hand, following the fiscal decentralization that took place in 2001, is designed to representative for each of its 498 districts.
Regression-based approaches perform much better when trained on data specific to each geographic stratum. The upper subplot in Figure 4 shows that for the SUSENAS 2010 data set from Indonesia (which has the largest number of representative subnational groupings of all our data sets) no matter what the quantile, the magnitudes of discrepancies for regression-based estimators are a fraction of what they are for the SPS. For each estimator, we observe the districtlevel estimates of the poverty rate. We compare the relative absolute error (i.e., the absolute value of the estimated value minus the observed poverty rate in the test set, benchmarked against the error rate of the SPS) at each percentile of the absolute error rate. The lower half of this figure restricts the sample to the poorer districts (i.e., the districts with a poverty rate above the median district poverty rate), and reflects how applying the SPS to the poorest subgroups (as is often done in practice) may compare to other methods. Figure 5 shows the overlap of the 95% confidence interval of poverty rate estimates using SPS and the 95% confidence intervals of the true poverty rate estimate based on household consumption at the district level. 9 There tends to be spatial correlation among districts where the SPS overestimates the poverty rate and where it underestimates the poverty rate. In about 10% of the districts, the SPS under-estimates the poverty rate and in about 26% of the districts it over-estimates the poverty rate. In comparison, the strata-specific logit estimator underestimates the poverty rate in about 2% of the districts and over-estimates the poverty rate in about 2% of the districts.  The upper panel shows the overlap of the 95% confidence interval of poverty rate estimates using SPS and the 95% confidence intervals of the true poverty rate estimate based on household consumption at the district level. There tends to be spatial correlation among districts where the SPS overestimates the poverty rate and where it underestimates the poverty rate. In about 10% of the districts, the SPS underestimates the poverty rate and in about 26% of the districts it overestimates the poverty rate. In comparison, the strata-specific logit estimator (lower panel) under-estimates the poverty rate in about 2% of the districts and over-estimates the poverty rate in about 2% of the districts In Figure 6 we split the Peru 2010 data into 24 regional (departments) strata and measure discrepancies between estimated poverty rates by the stratum-specific logit and the national SPS and observed poverty rates. Each of the small circles corresponds to the discrepancy for a given strata. Raw discrepancies are in the upper panel and absolute discrepancies are in the lower panel, with SPS results in blue, strata-specific logit in red, and the green vertical line marking the average national poverty rate. Stratum-specific logit shows strong and consistent performance, with low average discrepancies across the domain of observed stratum poverty rates. Notice that as before, the SPS overestimates poverty rates in the regions that are least poor. In the strata with the lowest poverty rates, the discrepancies are highest, averaging about 15 percentage points. Kolmogorov-Smirnov p-values show statistically significant differences between SPS and strata-specific logit results. The map in Figure 7 shows that the SPS overestimates the poverty rate for 6 regions (upper panel), whereas region-specific logit produces estimates are within the 95% confidence interval of the true estimate in all cases (lower panel).

Figure 6. Regional Strata Poverty Rate Estimation: Strata-specific Logit Dominates the SPS for Peru 2010.
We split the Peru 2010 data into 24 regional strata (departments), and measure the discrepancies between estimated poverty rates and observed poverty rates. Each of the small circles corresponds to the discrepancy for a given stratum---raw discrepancies in the upper panel and absolute discrepancies in the lower panel, with SPS results in blue, strataspecific logit in red, and the green vertical line marking the average national poverty rate. Stratum-specific logit shows strong and consistent performance, with low average discrepancies across the domain of observed stratum poverty rates. SPS overestimates poverty rates in the regions that are least poor. In the strata with the lowest poverty rates, the discrepancies are highest, averaging about 15 percentage points. Kolmogorov-Smirnov pvalues show statistically significant differences between SPS and strata-specific logit results.

Figure 7. Mapping the Discrepancies Across Regional Strata: Strata-specific Logit
Dominates the SPS for Peru 2010. The region-specific logit produces estimates within the 95% confidence interval of the true estimate in all cases, whereas the SPS overestimates the poverty rate for 6 regions.

20
The next set of figures (Figures 8-12) drill deeper than the lowest geographic level for which a survey may be representative for, by comparing the performance of the national SPS and the stratum-specific Logit model with the stratum now defined by the intersection of the region or district identifier and another key socio-economic variable, such as an identifier whether the household head is male or female or whether the household head is in agriculture or not. One important caveat associated with these comparisons is that the "true" poverty rate estimate that the SPS and the stratum-specific Logit models are compared with, is likely to have a high variance since the survey is designed to yield reliable estimates for the region or the district as a whole but not for any specific socio-economic group within these geographic areas.
We will describe each of these analyses in order. 10 The map in Figure 8 also shows the differences between the SPS estimates and the 14 district-specific logit estimates for Sierra Leone. The Sierra Leone Integrated Household Survey (SLIHS) in 2003 is only representative for the 4 regions of Sierra Leone, whereas the 2013 is representative for all 14 districts of the country. In this exercise we do not take into consideration the geographic representativity of the available survey, and we use the district identifiers to create the district strata in 2003 even though the survey is representative at the region level and not at the district level. The district-specific logit produces estimates within the 95% confidence interval of the poverty estimates in all districts, 11 whereas the SPS overestimates the poverty rate for one district (the area near Freetown, the capital) and underestimates the poverty in three districts. Figure 9 shows the results for Bangladesh 2010, split into 14 strata defined by the intersection of the "agricultural head of household" and the 7 "region" dummy variables. Stratum-specific logit shows strong and consistent performance, with low average discrepancies across the domain of observed stratum poverty rates. Again, the stratum-specific logit shows strong and consistent performance, with low average discrepancies across the domain of observed stratum poverty rates. Furthermore, the SPS consistently over-estimates the lower poverty rates and consistently underestimates the higher poverty rates as in Indonesia and in Peru in the earlier set of estimates. In the poorest and richest strata, average SPS discrepancies may be as high as 15-20 percentagepoints. Kolmogorov-Smirnov p-values indicate that differences between SPS and strata-specific logit results are statistically significant. Figure 10 shows results for Sierra Leone 2003 data, divided into 28 strata defined by the inter-section of the district/region and the "agricultural head of household" dummy variables. The upper panel reveals that SPS is reliable (discrepancies are near zero) only if the strata poverty rates are in the immediate neighborhood of the national poverty rate; once again, the SPS overestimates poverty rates in the richer districts and severely underestimates poverty in the poorer districts, with the worst performance in the poorest districts. The Kolmogorov-Smirnov p-values in the upper panel show that strata-specific logit and SPS are not distinguishable when comparing performance based on raw discrepancies. In the lower panel, however, strata-specific logit dominates SPS (though not as much in previous strata-specific comparisons); in both weighted and unweighted comparisons, the red line (strata-specific logit) has less absolute discrepancy than SPS for the entire range of observed stratum poverty rates. SPS discrepancies here are some of the worst across all data sets-averaging as much as 30 percentage points in the poorest regions. Figure 11 shows results for Nepal 2010, using 28 strata defined by the intersection of the "female head-of-household" and the14 administrative zones of the country. Stratum-specific logit dominates SPS, which tends to underestimate the poverty rate across the entire domain of observed stratum-specific poverty rates. When the strata are idiosyncratic and deviate from the national sample, average SPS discrepancies may be as high as 10-20 percentage-points.
Finally, Figure 12 shows results for the Indonesia 2010 data, divided into 934 strata defined by the intersection of "urban/rural" and the district identifier. Stratum-specific logit shows strong and consistent performance, with low average discrepancies across the domain of observed stratum poverty rates. SPS is reliable (discrepancies are near zero) only if the strata poverty rates are in the immediate neighborhood of the national poverty rate; once again, the SPS overestimates poverty rates in the richer districts and severely underestimates poverty in the poorer districts, with the worst performance in the poorest districts. When the strata are idiosyncratic and deviate from the national sample, average SPS discrepancies may be as high as 15-25 percentage-points. Kolmogorov-Smirnov test results are statistically significant here as well.

Figure 8. Mapping Discrepancies Across District and Agricultural/Non-Agricultural
Strata: Strata-specific Logit Dominates the SPS for Sierra Leone 2011. The district-specific logit produces estimates within the 95% confidence interval of the true estimate in all cases, whereas the SPS overestimates the poverty rate for one district and underestimates the poverty rate for three districts. Figure 9. Agricultural/Non-Agricultural Household and Regional Strata: Strata-specific Logit Dominates SPS for Bangladesh 2010. We split the Bangladesh 2010 data into 14 strata, and measure the discrepancies between the estimated poverty rates and the observed poverty rates. Each of the small circles corresponds to the discrepancy for a given strata-raw discrepancies in the upper panel and absolute discrepancies in the lower panel, with SPS results in blue, strata-specific logit in red, and the green vertical line marking the average national poverty rate. Stratum-specific logit shows strong and consistent performance, with low average discrepancies across the domain of observed stratum poverty rates. SPS consistently overestimates the lower poverty rates and consistently underestimates the higher poverty rates. In the poorest and richest strata, average SPS discrepancies may be as high as 15-20 percentage-points. Kolmogorov-Smirnov p-values indicate that differences between SPS and strata-specific logit results are statistically significant.

Figure 10. Agricultural/Non-Agricultural Household and District Strata: Strata-specific Logit Dominates SPS, Sierra Leone 2003.
We split the Sierra Leone 2003 data into 28 strata defined by the intersection of the district and the "agricultural head of household" dummy variables, and measure the discrepancies between the estimated poverty rates and the observed poverty rates Each of the small circles corresponds to the discrepancies between observed and estimated poverty rates--raw discrepancies on the upper panel and absolute discrepancies on the lower panel, with SPS results in blue, strata-specific logit in red, and the green vertical line marking the average national poverty rate. The upper panel reveals that SPS is reliable (discrepancies are near zero) only if the strata poverty rates are in the immediate neighborhood of the national poverty rate; once again, the SPS overestimates poverty rates in the richer districts and severely underestimates poverty in the poorer regions, with the worst performance in the poorest districts. The Kolmogorov-Smirnov p-values in the upper panel show that strata-specific logit and SPS are not distinguishable when comparing performance based on raw discrepancies. In the lower panel, however, strata-specific logit dominates SPS (though not as much in previous strata-specific comparisons). In both weighted and unweighted comparisons, the red line (strata-specific logit) has less absolute discrepancy than SPS for the entire range of observed stratum poverty rates. SPS discrepancies here are some of the worst across all data sets-averaging as much as 30 percentage points in the poorest regions. Figure 11. Female Head-of-Household/Regional Strata Poverty Rate Estimation: Strataspecific Logit Dominates SPS for Nepal 2010. We split the Nepal 2010 data into 28 strata defined by the intersection of the "female head-of-household" and the administrative zone dummy variables, and measure the discrepancies between the estimated poverty rates and the observed poverty rates. Each of the small circles corresponds to the discrepancy for a given strata-raw discrepancies in the upper panel and absolute discrepancies in the lower panel, with SPS results in blue, strata-specific logit in red, and the green vertical line marking the average national poverty rate. Stratum-specific logit dominates SPS, which tends to underestimate the poverty rate across the entire domain of observed stratum-specific poverty rates. When the strata are idiosyncratic and deviate from the national sample, average SPS discrepancies may be as high as 10-20 percentage-points.

Figure 12. Strata-specific Logit Dominates SPS, Indonesia 2010.
We split the Indonesia 2010 data into 934 strata defined by the intersection of "urban/rural" and "district" dummy variables, and measure the discrepancies between the estimated poverty rates and the observed poverty rates. Each of the small circles corresponds to the discrepancy for a given strata. Raw discrepancies in the upper panel and absolute discrepancies in the lower panel, with SPS results in blue, strata-specific logit in red, and the green vertical line marking the average national poverty rate. Stratum-specific logit shows strong and consistent performance, with low average discrepancies across the domain of observed stratum poverty rates. SPS is reliable if and only if the strata poverty rates are in the immediate neighborhood of the national poverty rate. When the strata are idiosyncratic and deviate from the national sample, average SPS discrepancies may be as high as 15-25 percentage-points. 27 C. Testing Estimator Resilience over Time National household data sets are not published every year and SPSs are not developed for every national household survey, so it is not uncommon for there to be a mismatch in time between the vintage of one's sample and the vintage of available SPS. Of the 63 countries for which SPSs are currently available, 48 countries offer only one SPS, ten countries offer SPSs for two different years, and five countries offer SPSs for three different years. What can happen when, for example, someone attempts to estimate the poverty rate in a sample collected in 2012 and apply the SPS when the most recent SPS in the country was calibrated to a 2010 national household survey? 12 To explore this question, we took SPS and regression-based models trained on the Peru 2010 national data and tested those models using Peru 2011 and Peru 2012 test data. The upper panel in Figure 13 shows poverty estimates and bootstrapped 95% confidence intervals based on models derived from Peru 2010 data-for Peru 2011 data on the left and Peru 2012 data on the right-applied to raw (unweighted) national "test sets" or validation sets (i.e., random samples from the national survey data that were not used to fit the models). Regression-based methods clearly dominate the SPS and weighted least squares (WLS), which utilizes observation weights in the training set but not the test set, does the worst but still dominates the SPS. The lower panel shows how these results change when observation weights are applied to the test sets (creating nationally representative test sets). As with the concurrent SPS case, a common feature of these results is that applying the SPS "Bias Correction Factor" appears to increase the bias, moving estimates away from the observed poverty rate. This figure illustrates the discrepancies that may occur when the poverty estimator models are not contemporaneous with the targeted sample. The upper panel shows poverty estimates and bootstrapped 95% confidence intervals based on models derived from Peru 2010 data-for Peru 2011 data on the left and Peru 2012 data on the right-applied to raw (unweighted) national "test sets" or validation sets (i.e., random samples from the national survey data that were not used to fit the models). Regression-based methods clearly dominate the SPS and weighted least squares (WLS), which utilizes observation weights in the training set but not the test set, does the worst but still dominates the SPS. The lower panel shows how these results change when observation weights are applied to the test sets (creating nationally representative test sets). A common feature of these results is that applying the SPS "Bias Correction Factor" appears to increase the bias, moving estimates away from the observed poverty rate.
We do not claim that the results in Figure 13 generalize across countries and time-periods; they are merely indicative of what can happen when there is a mismatch in time between a training set and a test set. It is worth noting that mismatches in time can favor or hinder any estimator if economic trends cancel out discrepancies.

V. Concluding Remarks and the Way Forward
The accurate and efficient estimation of poverty rates is a concern for development practitioners and researchers alike. In this paper we demonstrate that an increasingly popular method for estimating poverty rates, the simple poverty scorecard, performs best when applied to the estimation of national poverty rates with nationally-representative samples. However, SPS-like procedures are (by their very nature and emphasis on simple operational implementation) ignore information that is commonly available to surveyors in most applied settings. Analysts generally have rich household-level covariates, such as occupation and geographic or regional information that can provide additional information and allow researchers and practitioners to more precisely estimate poverty rates in target populations of interest. We demonstrate that both SPS-type procedures and national-level regressions "perform well" in practice (in a training and test set paradigm) when applied to targeted strata with poverty rates near the national poverty rate. But as the populations of interest become more granular (e.g., regional) or more extreme on the income distribution, SPS-type procedures perform measurably worse than prosaic statistical models tuned at the stratum level. These results are also in accordance with the growing academic literature on small area estimation and poverty mapping that advocates the estimation of region or district -specific consumption (or income) models as long as the household survey is representative at that level (e.g., Elbers et al. 2003, Tarozzi andDeaton, 2009;Tarozzi, 2011).
The findings in this report have important implications for the practitioners in the field wanting to have an estimate of the poverty rate in their target population. To begin with, it is important to understand that there is a fundamental tradeoff between simplicity of use and accuracy. Simple tools, like the SPS in its current form, are designed in favor of simplicity by estimating poverty for any possible target population using underlying parameters derived from the full sample of households in the national household survey. For example, suppose that one's goal is to estimate poverty among female-headed household program participants, in one region of a country, in a year in which a national household survey was administered and a SPS has been estimated and made publicly available. The practitioner could then collect data for the10 questions required by the SPS from the target female population in the specific region and apply the SPS from the same year to estimate the poverty rate for the target female population. The poverty rates for the target population would then be based on the parameters that have been estimated using national data for male and female headed households from all different regions in the country. This implies that if male-headed households are much more prevalent in the national household survey, as it is usually the case, or if other regions have a larger population than the region of interest, then the poverty rate estimated by the SPS for the target female population may be biased in the sense that it would not be so good at approximating the true poverty rate among females in the region of interest.
The analysis in this report implies that the prediction of poverty could be improved if the underlying parameters of the model used to predict poverty or to assign poverty scores were estimated based on the sub-sample of female headed households in that region (extracted from the full national household survey). However, it is important to acknowledge upfront that there are various challenges especially for the practitioner in implementing the methods employed here just for the sake of improving the accuracy of poverty predictions in the target population: (i) Potential users (researchers, practitioners, or both) would require access to sub-national representative household level data, which are not easily accessible nor ready for processing for the purposes of predicting poverty 13 ; (ii) Specialized statistical background and econometric expertise would be required; (iii) Even if (i) and (ii) are feasible, the sample size of the specific population of interest in the national survey may be insufficient; (iv) There may not be sufficient business appetite to individually allocate the resources needed to attain such improvements in analytical precision.
Assuming there is sufficient appetite, for the practitioners in the field relying on the SPS or SPSlike methods to estimate poverty in the population of their interest, with the data available and currently used by the SPS, there are relatively simple and low cost ways of improving the predictions of poverty in target populations. One practical option is that the SPS method and its surrounding infrastructure get updated by considering: (i) the use of regression-based methods such as those used in this report; and (ii) the incorporation of the intermediate and more practical step of estimating regression-based models separately for the geographic strata that the national survey is designed to be representative at. Poverty estimates for target populations based on strata-specific estimates of regression-based models certainly improve upon poverty estimates based on the nationally estimated SPS. It is quite likely that region or district-specific estimates of the SPS, depending on the country, will improve the accuracy of the poverty estimates currently based on the nationally estimated SPS. 14 Therefore, we suggest that the international development and donor community take a lead in developing, refining, packaging and making available such models in a toolkit format that would be available to current users of the SPS.
Accuracy of SPS based poverty likelihoods 14. Determine the score for each of the households in the validation subsample. 15. Draw a bootstrap sample of n households with replacement from the validation sample. (In the Bangladesh study n = 16,384.) 16. Calculate the true poverty likelihood in the bootstrap sample. That is, the share of households below a poverty line. (Needs to be calculated separate for each poverty line considered.) 17. For each score, compare this true poverty likelihood with the estimated poverty likelihood determined in step 13 (Scorecard and poverty likelihood correspondence). Record the difference. 18. Repeat 1,000 times recording the difference between the true and estimated likelihoods for each score. 19. For each score, report the two-sided intervals containing the central 900, 950, and 990 differences between estimated and true poverty likelihoods (to get confidence intervals) to see how accurate the measure is for different poverty scores.
Accuracy of SPS based poverty rate 20. To determine the poverty rate for a particular group, average the estimated poverty likelihood (from the score cards) of all individuals in the group. 21. Calculate the true poverty rate for the 1,000 repetitions of n = 16,384 bootstrap samples. 22. Calculate the difference between the estimated poverty rate and the true poverty rate for each of the 1,000 repetitions. 23. The average difference between the estimated and the true poverty rates is the "bias correction factor." 24. The poverty rates than need to be adjusted by this "bias correction" to get the unbiased estimates.
There is a unique bias correction factor for each poverty line. (In Bangladesh they range from +0.5 to -0.9 percentage points.) 25. Use the distribution of the true poverty rate estimates from the bootstrap samples to determine standard errors/confidence intervals. (I.e. the interval containing the central 900 poverty rate estimates is the 90% confidence interval.) Determination of standard errors for estimated samples 26. To determine the standard errors for the scorecard based poverty rates the direct measurement standard error formula needs to be adjusted for the fact that the scorecard is not a direct measure of poverty. The correction factor is the ratio of the standard errors derived analytically from the bootstrap sample to the standard error from the mathematical formula in the direct measurement case. (A value less than on implies confidence intervals for poverty scoring method are smaller than those from direct measurement, i.e. they are more precise, and a value greater than one implies that they are less precise.) The correction factor is derived by using bootstrap samples of various sizes to get empirical estimates of the confidence interval and comparing them to the analytical standard errors corresponding to the same sample size. The correction factor is the average of these ratios. (In the Bangaldesh case he does the exercise for 7 different sample size ranging from n=256 to n=16,384.) 27. The standard error for point-in-time estimates of poverty rates vis SPS is Where is the correction factor, ̂ is the estimated poverty rate, N is the population size, and n is the sample size.
Estimate of change in poverty rates over time 28. Similar methodology can be used to derive the estimates of bias, precision and the when using (2010) SPS in other years. As above, the (2010) validation sample as well as full sample from another year are used to generate bootstrap samples to obtain mean differences standard errors between surveys samples.