Policy Research Working Paper 10255 Fewer Questions, More Answers Truncated Early Stopping for Proxy Means Testing Tim Ohlenburg Juul Pinxten Daniel Fricke Fabio Caccioli Poverty and Equity Global Practice December 2022 Policy Research Working Paper 10255 Abstract The assignment of social programmes to their target pop- maintaining predictive accuracy close to a standard proxy ulation, known as targeting, is key to effective policy means test baseline. Applying the approach to Indone- implementation. Proxy means testing is a widely used tar- sian data, simulation of a 40 percent population coverage geting approach where means testing is infeasible due to programme shows that targeting questionnaires could be economic informality. This paper proposes a novel, prac- shortened by 61 percent while maintaining PMT-level tically feasible assessment approach that aims to reduce accuracy. A case study of a large health insurance pro- average proxy means test data collection costs, or allow gramme in an urban area suggests that the share of intended more extensive data collection within a given resource enve- beneficiaries who are among the targeted population can lope. Combining variable selection and prediction intervals, potentially be increased from 65.6 percent to 78.3 percent it develops a household-level truncated early stopping algo- if enumerators conducted more of the shorter surveys that rithm, which can reduce average interview length while the truncated early stopping algorithm generates. This paper is a product of the Poverty and Equity Global Practice. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at email@ timohlenburg.com, jpinxten@worldbank.org, d.fricke@ucl.ac.uk, and f.caccioli@ucl.ac.uk. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Fewer Questions, More Answers: Truncated Early Stopping for Proxy Means Testing Tim Ohlenburg∗ 1 , Juul Pinxten2 , Daniel Fricke1 , and Fabio Caccioli1 1 University College London (UCL) 2 World Bank JEL I32, C83, O22 Keywords proxy means testing, adaptive survey design, policy simulation ∗ This paper has benefitted from the patient guidance and comments of Utz Pape, Virgi Agita Sari, Gracia Hadiwijaja, and Sailesh Tiwari. We are also grateful to review- ers Takaaki Masaki, Matthew Wai-Poi, and Phillippe Leite, whose constructive feedback improved contents and presentation. 1 1 Introduction Low- and middle-income countries spend an average of 1.5% of GDP on so- cial assistance [World Bank, 2022]. These public programmes aim to protect the poor and vulnerable from shocks, help them cope with crises, support investment in human capital and enjoy a decent standard of living through- out the lifecycle [ILO, 2021]. Policymakers design a social protection pro- gramme with a specific target population in mind. A crucial aspect of policy implementation, commonly referred to as the targeting problem, is how to identify households or individuals who are members of the target population and therefore eligible for the programme. Due to limited fiscal resources, social assistance tends to be limited to the most economically disadvantaged segment of society. Consequently, socio-economic criteria are a prevalent el- igibility category [Grosh et al., 2022], whether applied in isolation – such as in a food subsidy programme for the poor – or in combination with an- other eligibility category, such as conditional cash transfer programmes that support poor families with school-age children. In settings where nearly all households’ income is reported to the tax au- thority, means testing via income tax records can determine socio-economic eligibility. However, countries’ average informal employment share is above 50% in Asia Pacific, South Asia and Sub-Saharan Africa, while the poverty rate of the informally employed is around six times higher than that of those in formal employment [Ohnsorge and Yu, 2021]. This paper is concerned with such settings, where means testing is unviable due to a lack of tax records among poorer households. There, proxy means testing (PMT) has become a widespread targeting approach since its description by Grosh and Baker [1995]. PMT is a statistical approach that relies on two large-scale surveys. The first, conducted at regular intervals in most countries, is a socio-economic population sample survey that serves as training data for the PMT pre- dictive model. The second is a targeting survey that collects current data to determine household eligibility. Due to the large population coverage of many social protection programmes, targeting surveys typically need to be administered to a significant share of the population1 . Given the high cost of such extensive, country-wide data collection efforts, targeting sur- veys are typically conducted on a multiannual basis when carried out as a national survey sweep [Lindert et al., 2020]. Infrequent surveys delay assessment of new programme applicants and re-assessment of existing ben- eficiaries, resulting in programmes’ irresponsiveness to changing household circumstances. If budgets were not limited, the most accurate PMT approach would be 1 For example, the social registries that store targeting data covered 87%, 75% and 40% of Pakistan’s, Brazil’s and Indonesia’s populations respectively in 2015-17 [Leite et al., 2017], amounting to total records of ca. 360 million people. 2 to assess all households, census-like, and to apply an extensive questionnaire that asks all potentially relevant questions. In practice, administrative bud- gets of social protection programmes are limited, so that neither all house- holds will be assessed nor all questions asked. When eligible households are missed from the assessment process, exclusion errors will arise no matter how accurate the PMT model. For Indonesia, World Bank [2012] identifies insufficiently broad assessment as a key area for improving targeting out- comes, and its simulation shows that a full survey would lower exclusion error significantly. Since budget constraints normally preclude universal as- sessment, a practical option may be to streamline the data collection process by asking fewer questions. If this could be done without significantly com- promising model accuracy, more households could be assessed for the same budget, which would improve targeting outcomes at no additional cost. The aim of this paper is to develop such a method, achieving a similar level of targeting accuracy as a standard PMT while asking fewer questions. We propose an alternative to the PMT algorithm that applies a measure of predictive uncertainty to household consumption estimation. We use pre- diction intervals of a quantile regression-style estimator to identify those households for which consumption is most likely above or below the socio- economic eligibility threshold. By adjusting the intervals as data is collected, enumeration can be stopped early (in the sense that not all questions need to be asked) for a substantial share of households. In addition, questionnaire length is limited to the most predictive subset of questions. Our approach combines these mechanisms to expose a trade-off between predictive accu- racy and questionnaire length that policymakers can set according to their preferences or constraints. In contrast to more computationally demanding methods proposed for an adaptive poverty classification, its limited com- putational requirements make deployment feasible on the modest mobile devices commonly used for digital data collection in low- and middle income countries. The financial resources freed up by this approach could thus be directed to more regular or more extensive targeting surveys. In our empirical application for Indonesia, we find that the reduction in questionnaire length compared with the PMT is rather substantial. For a 40% coverage programme, the number of questions can be reduced by 61% (from an average of 22.8 to 8.9) when maintaining the exclusion error rate at the PMT baseline of 26.44%. Tentative estimates for an urban setting, namely Jakarta, in which we simulate a health insurance programme with 50% population coverage, suggest an increase in the proportion of eligible recipients from 65.6% under PMT-based assignment to 78.3% under the proposed method, assuming enumerators can survey additional households by deploying shorter, adaptive questionnaires. The paper is structured as follows. Section 2 reviews the relevant liter- ature, outlines the PMT process, and provides context for the data used in subsequent simulations. Section 3 describes the components and function- 3 ing of our novel targeting method. The results in Section 4 compare the performance of our algorithm against a policy baseline. The Jakarta case study in Section 5 presents an approach to maximize the inclusion of in- tended beneficiaries by balancing survey coverage and accuracy in an urban setting. Finally, Section 6 discusses the results and their implications. 2 Background This section describes the standard approach to PMT as the foundation of our proposed method. It sets the scene by linking the research to the literature, and by framing PMT as a predictive modelling task with specific design considerations. It then describes the PMT algorithm and provides some context on its use in Indonesia, including the data collected for its implementation. 2.1 Literature This paper proposes a novel algorithmic approach to targeting and its gen- eral context lies within the literature on targeting social programmes, and particularly the PMT approach. The computational method and emphasis on predictive modelling link it to the growing literature on machine learning methods in targeting. The final related area we touch on is the literature on adaptive designs for survey cost reduction. The seminal paper on PMT as a targeting mechanism is the work of Grosh and Baker [1995], who described a statistical welfare estimation ap- proach that had been developed in Chile in the 1980s. With PMT’s increas- ing popularity, Coady et al. [2004] conducted a systematic cross-country assessment that suggested it to be the best option where means testing is in- feasible, but that its effectiveness is often limited and very context-specific. More recently, Brown et al. [2018] found that PMT could stand for Poor Means Test, as targeting errors are often so high that untargeted benefit programmes or simpler scorecard methods work nearly as well across their country examples. Indonesia has long been a productive place for targeting research, sup- ported by progressive policymakers, their relationships with researchers and the international development community. An influential output of this col- laboration was Alatas et al. [2012], who showed in a randomized controlled trial setting that PMT results in somewhat better targeting outcomes than community-based targeting, but only when a consumption measure rather than community perceptions of poverty were used as eligibility yardstick. Relevant to survey design is an experiment by Banerjee et al. [2020], which suggested that the inclusion of certain assets in a targeting survey does not distort Indonesians’ buying behaviour, but that there is evidence of strate- gic misreporting. Similarly, Camacho and Conover [2011] in Latin America 4 and Niehaus et al. [2013] in South Asia find evidence of manipulation on the survey administration side. Tohari et al. [2019] made a case for considering the full set of programmes with socio-economic criteria when conducting targeting simulations via their extensive linking of Indonesian survey data. A growing literature considers PMT from a predictive modelling per- spective. McBride and Nichols (2018) highlighted the importance of using out-of-sample validation data in model selection, which has been adopted in Indonesia. A simplified scorecard approach supported by machine learning prediction was suggested by Kshirsagar et al. [2017], who showed impres- sive results with very limited data collection needs for Zambian data. A systematic evaluation of machine learning methods for construction of the predictive model in PMT was carried out by Areias and Wai-Poi [2022] with data from 12 African countries, but it found that accuracy gains tend to be limited and context specific; no clear machine learning works best across the board. A similar conclusion emerged from a study of Indonesia’s PMT model [Ohlenburg, 2020]. The model presented here has a household-specific stopping criterion as its adaptive component. Adaptive methods link to extensive literatures on the design of experiments, adaptive algorithms and active learning. As an example of the latter, Saar-Tsechansky et al. [2009] studied active feature acquisition when data collection is costly and variables differ in their in- formation value. They proposed a framework that aims to maximize the expected utility of each data item and tailors the sequence and length of data acquisition to each instance, achieving a cost reduction for a given level of accuracy. A line of research that focuses on the aspect of fairness in targeting emerged in the work of Noriega-Campero et al. [2020]. It puts an emphasis on achieving equitable accuracy rates across subgroups of the population, thereby avoiding systematic disadvantages in programme eligibility for spe- cific groups. In contrast to the evidence of limited accuracy improvements mentioned in the machine learning for targeting literature mentioned above, this work also suggests scope for meaningful improvements in targeting ac- curacy from changes in data preparation, such as the use of a deep feature embedding, and predictive modelling methods including neural networks. Bakker et al. [2019] pursued the fairness theme further in a paper that leverages reinforcement learning for selection of fair targeting questions that tailor the questionnaire sequence to each household according to group mem- bership and other characteristics. Closely related to this paper is Bakker et al. [2021], an adaptive design that computes a household-specific sequence of questions and stops when a predictive certainty threshold is met in clas- sifying households as poor or non-poor. Our approach differs in the use of a uniform question sequence and the prediction of a continuous consump- tion level. Estimation of the household consumption level enables us to rank households, which is important for adjusting eligibility to meet coverage tar- 5 gets and for the common case of multi-programme assessment highlighted by Tohari et al. [2019]. In view of the huge logistical scale of many targeting surveys, sample population surveys are perhaps the most closely related information col- lection exercise. Looking at their economic aspects, Groves and Heeringa [2006] proposed a responsive survey design to reduce survey cost while main- taining accuracy. Focusing on settings with cost and operational constraints on data collection, Pape and Mistiaen [2018] proposed an effective imputa- tion approach to estimate the population distribution of consumption rather than a full LSMS-style survey. A similar focus on a reduced length survey is Christiaensen et al. [2021], who investigated the use of components of consumption to estimate household values but find both theoretical and empirical problems with this approach. 2.2 PMT design Three essential design criteria, echoed in Grosh et al. [2022], shape PMT: • Accuracy. Identifying the intended beneficiaries accurately is the es- sential paradigm of a targeting mechanism. • Cost. The multiannual nation-wide PMT survey sweeps common in low- and lower-middle income countries require major fiscal outlays. The resource needs of on-demand registration, which countries with sufficient administrative capacity tend to invest in, are also high and imply the need for concise surveys that economize enumerator time. • Verifiability. Survey responses determine PMT outcomes, creating a monetary incentive for misreporting personal and household charac- teristics towards responses that are associated with poverty2 . As a result, targeting surveys are often restricted to variables that are ob- servable by enumerators, such as physical assets, or that are verifiable via documentary evidence. In comparison to other commonly used targeting methods such as ge- ographic targeting, community-based targeting, means testing or hybrid means testing, PMT uses easy-to-verify characteristics. It is most appro- priate when informality is high and when some form of household specific ranking is desired. At lower levels of informality in an economy, means or hybrid means tests would be more appropriate and yield higher accuracy than a PMT [Grosh et al., 2022]. 2 Banerjee et al. [2020] provide evidence of such strategic behaviour, which creates an unfair advantage for those willing to be dishonest. 6 2.2.1 PMT process To understand how a PMT aims to achieve these design considerations, we break the approach down into its components. A PMT is a combination of survey and statistical modelling work. It can be described as illustrated in fig. 1: Figure 1: PMT modelling stages 1. Population sample survey. The country’s statistics authority conducts a nationally representative survey to monitor socio-economic condi- tions, independently of the PMT. This survey has two elements from a PMT perspective: (a) a consumption module that is aggregated to the dependent variable, and (b) the remainder of the survey contains the set of potential independent variables. 2. Pre-selection & preparation. In variable pre-selection, a subset of the non-consumption variables is chosen. As per the design criteria, it should be both verifiable and also avoid perverse incentives for be- havioural changes that may ultimately be harmful, such as “do your children attend school regularly?” 3. Feature engineering. The creation of additional features, such as vari- able transformations, interaction terms, group markers etc. is an im- portant aspect of predictive modelling in data-constrained settings. 4. Variable selection. The combined set of potential features that emerges from step 3 is usually large, which would result in high survey costs if all were included in the targeting survey. The task of variable selection 7 is to identify those features – denoted as X – that offer most predictive power. Some approaches, such as stepwise regression, perform the selection simultaneously with model building, but the two tasks can also be split. 5. Modelling. The modelling step involves the development of a super- vised learning model f (X ) that predicts consumption per household member. Below we expand on the PMT approach to variable selection and modelling, and then propose an adjusted algorithm. 6. Targeting survey & prediction. Operationalizing a PMT requires the set of variables X to be queried from a list of potential beneficiaries. In addition to the features, this survey may include household char- acteristics that determine categorical programme eligibility, such as the number of school-age children, verification of household charac- teristics or identity via inspection of administrative documents. The model f (X ) is applied to produce consumption estimates for the set of potential beneficiaries. Combining categorical and socio-economic eligibility, the list of beneficiaries is determined either by a ranking of household values (in case of a beneficiary quota) or via an absolute threshold that assigns eligibility if a given household’s income falls below it. To consider the link between these steps and the design considerations outlined before, note that both cost and verifiability are influenced by the questionnaire design of the targeting survey. The number and detail of ques- tions and the associated verification routines are an important cost lever, especially in urban areas where travel times between households are short for enumerators. In terms of the PMT process shown in fig. 1, verifiability is considered in the pre-selection step. The variable selection aspect of the modelling step also influences costs by shrinking the volume of questions. The implicit policy objective of the PMT algorithm is to achieve the best possible accuracy for a set of pre-selected questions. As such, the emphasis of the standard PMT approach is accuracy, rather than on cost reduction. The focus of this paper is on modelling, but note that targeting survey design and eligibility determination are also impacted by modelling choices: the targeting survey consists primarily of variables selected, and the pre- dictive model is used to score eligibility. The key statistical challenge in PMT design is model selection, particularly to limit the number of indepen- dent variables without sacrificing accuracy. The number of possible variable combinations is enormous in most settings due to the wide extent of the sample population survey. It is computationally infeasible to try out any but a fraction of these combinations within a chosen predictive model, and PMT manages this challenge with the following algorithm. 8 2.3 PMT algorithm PMT uses a ‘greedy’ approach to break the intractable variable selection problem down into a sequence of manageable tasks. At each step, it selects the best currently available option, until a stopping criterion is triggered. Such a myopic procedure can result in a selection that is very different to the optimal subset. However, the lack of theoretical guarantees is outweighed by its feasibility and results that tend to be credible, as evidenced by widespread policy adoption. A further benefit from the perspective of this paper is that a greedy algorithm produces a sequence of nested variable sets, which is a requirement for the adaptive approach laid out below. The original PMT algorithm, implicit in Grosh and Baker [1995], uses stepwise regression as the basic algorithm. It consists of the following key components: • Predictive model. Ordinary least squares (OLS) regression is the workhorse of PMT. Starting with an intercept, one OLS model is trained for each un-queried variable. • Selection criterion. The convention in stepwise regression tends to be the use of the Bayesian or Akaike information criterion for in-sample model selection [Friedman et al., 2001]. In forward selection, each variable is considered in turn, and that which provides the largest increase in the information criterion is added to the current variable set. • Stopping criterion. Variable selection continues until the best avail- able model in the current step offers no improvement in the chosen information criterion. The PMT algorithm is described algorithm 1 below, where X is a design matrix with partitions, Xsel , Xtmp , and Xcand are column-wise partitions thereof, y is the consumption level per household member that PMT esti- mates to determine eligibility, f () is an OLS predictive model, AIC is the Akaike information criterion [Akaike, 1998]. 2.4 Data and background The Indonesia context Indonesia uses a social registry to implement its PMT for targeting mul- tiple programmes. Much of the country’s population lives clustered above the poverty line and despite marked improvements in welfare, vulnerability remains substantial. In 2018, the poverty rate was 9.8%, but 28% of Indone- sians lived below 1.5 times the poverty line. As a result, a relatively small income shock can push around 20% of Indonesian households into poverty. Those living above this vulnerability line but below 3.5 times poverty line 9 Algorithm 1 Standard PMT (forward selection) Input: Training data (X, y) Xsel ← 1 f () ← fit(Xsel , y) ▷ Predictive model ˆ sel ← f (Xsel ) y (Xtmp , y ˆ tmp ) ← (Xsel , y ˆ sel ) while ∃X:,j ∈ X, ∈ / Xsel do: for each X:,j ∈ X, ∈ / Xsel do: Xcand ← append(Xsel , X:,j ) f () ← fit(Xcand , y) ˆ cand ← f (Xcand ) y if AIC (y, y ˆ cand ) < AIC (y, y ˆ tmp ) then: ▷ Selection criterion (Xtmp , y ˆ tmp ) ← (Xcand , y ˆ cand ) end if end for if AIC (y, y ˆ tmp ) >= AIC (y, y ˆ sel ) then: ▷ Stopping criterion break end if ˆ sel ) ← (Xtmp , y (Xsel , y ˆ tmp ) end while comprise 46%, indicating that about 75% of the population live on incomes below middle- and upper-class levels. The consumption distribution in fig. 2 shows how tightly clustered the consumption level of a large part of the pop- ulation is, in turn highlighting the challenge of targeting social assistance programs to the poorest. The country’s PMT was revised in 2015 to consist of 514 distinct district- level models, to account for the differences within a nation that spans geo- graphically and economically diverse regions. Since the introduction of the Unified Database for Targeting (now referred to as DTKS), improved tar- geting outcomes were seen for social protection programmes that adopted it [World Bank, 2017]. The country’s first unconditional cash transfer in 2005 covered more than a third of the population with the aim to protect them from a reduction in the fuel subsidy. Beneficiary incidence – the share of total beneficiaries found in a welfare grouping – in the poorest 20% of the population comprised 36%. When an unconditional cash transfer was launched for a third time in 2013, its coverage was about 40 percent, though beneficiary incidence increased to 40%. Similarly, a conditional cash trans- fer for families launched in 2007 saw an increase in beneficiary incidence in the poorest 20% from 39% to 44% between 2010 and 20183 . Its targeting 3 Based on Holmemo et al. [2020] and World Bank staff calculations from SUSENAS 2010/2018. 10 Figure 2: Indonesia’s distribution of household consumption per capita, 2018 Source: Holmemo et al. [2020], World Bank staff calculations from SUSENAS 2018 outcomes are on par with similar programs around the world, though ex- clusion errors remain a major issue. Whilst the programme allocates over 71% of total program benefits to the poorest 40%, about one third of eligi- ble households in the poorest decile still do not receive the program due to being excluded from consideration or being misclassified as non-poor. Survey questions The primary data source for Indonesia’s PMT and for this paper is the National Socio-Economic Survey (SUSENAS). Each year of this bi-annual population sample survey contains responses from around 300,000 individ- uals and is representative of the country’s 514 districts. Stratified by year and district, the survey rounds for the five years of 2015-19 are split 80-20 into training and test sets of circa 1.2 and 0.3 million observations respec- tively. SUSENAS includes an extensive consumption module and a living conditions module from which the targeting survey questions are drawn. In the modelling, the consumption data are adjusted for prices and expressed as a logarithm per household member. The targeting survey mirrors the DTKS social registry administered by the Ministry of Social Affairs. Ta- ble 1 displays the variables that comprise the feature set. 11 Table 1: Overview of Indonesian targeting survey variables Type Theme Variables included Core Demographics Household size & HH size2 , number of people by age groups by gender, ur- ban/rural, family structure, family smart card Optional Housing Materials used in the floor, wall, roof, source of drinking water and cooking wa- ter, type of lighting & cooking fuel, toi- let facilities, septic tank, floor space per capita, ownership status. Assets Household ownership of: motorcycle, car, computer, fridge, boat, motorboat, phone, water heated, air conditioning Education Household members’ total levels of educa- tional attainment and enrolment Employment Employment status, employment sectors Baseline PMT The Indonesian PMT implemented in the country’s social registry provides the baseline method for the early stopping algorithm’s performance. Its construction uses the data described above, and follows the canonical PMT approach in Algorithm 1 except for two significant changes. The first is in line with McBride and Nichols [2018], who pointed out that PMT is an out- of-sample prediction task and that a cross-validation approach to variable selection leads to better performance than in-sample measures such as in- formation criteria. Accordingly, we substitute selection via an information criterion with 5-fold cross-validation and a mean squared error (MSE) selec- tion criterion. The stopping criterion is triggered when the cross-validated error no longer declines. MSE is the preferred metric as it aligns with the objective function of predictive model OLS4 . The second change is to start estimation from a set of core questions required for administrative reasons or known up front. The rationale is that the algorithm should leverage both mandatory and pre-existing data to achieve a higher starting accu- racy, rather than to ignore available information and start estimation from scratch. Table 1 displays the assignment of the two variable types. A separate PMT model is built for each of the 514 districts. Although our baseline follows the current policy practice, there are also minor differences 4 Areias and Wai-Poi [2022] show that optimizing for MSE does not necessarily minimize EER, which is coverage-level dependent, but it provides a consistent objective for multi- programme targeting across coverage levels 12 in its construction in terms of the data, and it should not be understood as equivalent to the official PMT. The adjustments include use of different years of SUSENAS to the current set, corresponding to data available when this project was initiated, as well as the simplification of the dataset. We eliminate a number of household demographic and work status interaction terms that were too numerous for the sequential approach proposed below, while offering only a marginal predictive gain. Despite these differences, the targeting metrics are of broadly comparable magnitude5 , and we would not expect a change in the qualitative conclusions if the current PMT were used instead. 3 Methods We already identified accuracy, cost and verifiability as the three major design considerations of a statistical targeting method. An additional con- sideration for methods that require on-the-fly prediction in the resource- constrained, often geographically remote setting of PMT deployment is that they need to be computationally feasible. The approaches proposed here meet this criterion by virtue of being relatively simple to implement and de- ploy on hardware with modest storage, memory and computational power at inference time. If we consider verifiability to be embedded in the selec- tion of variables that can be queried reliably, accuracy and cost remain as desirable outcomes. A more extensive variable set improves the predictive power of household data, as long as the additional variables contain addi- tional information about its consumption level, but it also requires higher collection cost. Consequently, statistical targeting requires a choice along the length-accuracy trade-off in a resource-constrained setting. The standard PMT approach described above has become widespread since its publication by Grosh and Baker [1995] to design models that use a moderate number of explanatory variables that also achieve a suitable level of targeting accuracy. In its pure form of automated variable selection, it selects the point on the accuracy-vs-enumeration-cost spectrum that max- imizes model accuracy. This section revolves around the idea that other points – and especially earlier points – along the variable sequence should also be considered. A shorter questionnaire may be less accurate, yet it may result in lower overall exclusion due to the consideration of more households. We suggest methods which replace the single point with a range of options that allow policymakers to select a length-accuracy profile that is suitable for their conditions. 5 Klasen et al. [2016] estimated exclusion errors for a forward stepwise regression model of 27% for a simulated 50% program coverage. The forward stepwise model constructed mirrors closely the approach taken in generating PMT rankings in the update of the UDB/DTKS in 2015. Appendix A shows a PMT exclusion error rate of 21% at 50% coverage. 13 This section introduces three approaches. The first, truncated question- naires, is a family of methods that includes the standard PMT, and which varies questionnaire length uniformly for all respondents. The second, early stopping, employs a measure of predictive uncertainty as a household-level stopping criterion. Third, Truncated Early Stopping (TrESt), combines the other two methods to leverage their respective strengths. Before an explanation of the methods, the first subsection describes two modelling preliminaries. Modelling preliminaries Predictive model In line with standard econometric practice, OLS re- gression has been the predictive model of choice in the PMT literature. In contrast, this paper uses a gradient boosting machine [Friedman, 2001] as it provides greater predictive accuracy, deals efficiently with large datasets, offers variable interactions without the need for explicit interaction terms, and provides a flexible and coherent framework for all computational tools deployed below except the group lasso. Although machine learning models are often considered less transparent than linear ones, the interpretation of feature importance for tree-based models – described as a variable se- lection approach below – and the option to use explanation tools such as SHAP [Lundberg and Lee, 2017] that is grounded in Shapley values provide a similar degree of transparency, especially when considering that a causal interpretation of model parameters is unsuitable in a predictive modelling context (see Shmueli [2010]). Grouped variables Given our emphasis on enumeration cost, the group- ing of variables needs to be considered in more detail than in a standard setting. The first of two related aspects concerns derived features that are generated from other variables, such as squared terms, dummy/ one-hot en- codings. Although they may enter the predictive model as a stand-alone feature, the underlying information is queried through a question for an- other variable. Such items should be treated as a single feature, both from a questionnaire length and a variable selection perspective. In this vein, we adjust the various methods below to treat grouped variables as single items. The second aspect, mentioned in section (section 2.4), is the informa- tion that needs to be collected from any and all respondent households to a targeting survey. Social protection programmes with an economic wellbeing eligibility criterion verified via PMT are often targeted to specific population subgroups. For this purpose, the households are classified most commonly by demographic criteria, such as family structure or age group. The veri- fication of such categorial eligibility implies the collection of a core set of relevant variables before income proxies are queried. Combined with items known up-front, such as geographic features, any data needed for adminis- trative or statistical purposes makes up a core questionnaire administered 14 to all respondents at the beginning of enumeration. We refer to the remain- der of variables, which amount to 37 items in our Indonesian dataset, as optional. In the subsequent discussion and simulations, we take the core questions as given and refer to optional questions when discussing items such as questionnaire length. Having been selected for administrative reasons, the short core question- naire has limited predictive power for household consumption. At the other extreme, a full socio-economic population sample survey contains rich house- hold information that confers greater predictive accuracy, but is unsuitable for administration to a large section of the population. The following three subsections describe methods that provide a menu of choices in between, and expose the relationship between survey cost and accuracy. 3.1 Truncated questionnaires A simple approach is to order the optional features from most to least predic- tive. A truncated questionnaire, i.e. one that is limited to a certain number of questions, arises from adding the desired number of most predictive fea- tures to the core set. Computation of the predictive accuracy for each of the iteratively growing variable sets provides the cost-accuracy trade-off. In addition to a predictive model, this approach requires an ordering of fea- tures by their predictive power for household consumption. We test several variable selection methods to perform such an ordering and thus generate a sequence that can be truncated. • Stepwise selection. The classic mechanism of PMT modelling6 . To generate a complete variable sequence instead of a particular set of regressors, we adjust algorithm 1 by eliminating the stopping criterion and recording the order in which variables are added to the set of predictors. • Variable importance. Tree-based predictive models perform variable selection during the construction of their tree structure. Our gradi- ent boosting model, which is based on decision trees, generates two measures of variable importance. One is the proportion of trees in which each variable is used, the other is the sum of increases in the trees’ objective function produced by the splits of each variable. We translate both the split and gain measures into two separate variable orderings. The first is to rank variables in descending order of either the number of splits, or of the gain they produce, when the full set of variables is considered jointly. The second uses a backward selection approach, recursively eliminating the variable with the lowest split or gain, respectively. 6 Alternative methods have begun to gain traction among practitioners [Grosh et al., 2022]. 15 • Group lasso. A variation of the Lasso, or Least Absolute Shrinkage and Selection Operator [Tibshirani, 1996], the group lasso’s [Yuan and Lin, 2006] objective function allows regularization over grouped variables: m m 1 √ minβ ||y − X (l) β (l) ||2 2+λ pl ||β (l) ||2 2 l=1 l=1 where, following the exposition of Simon et al. [2013], X (l) is the sub- matrix of X with columns corresponding to the predictors in group l, β (l) is the coefficient vector of that group and pl is the length of β (l) . As in the basic lasso [Tibshirani, 1996], λ remains a free param- eter that induces sparsity along the regularization path, only in this case jointly over groups and single variables. From the sets of groups selected for increasing values of λ, we construct a nested sequence of group sets that results in the variable sequence. Note that we separate the variable selection approach that generates a question sequence from the predictive model that is used to predict along each sequence. The uniform prediction model used in this paper enables a comparison of the differences in length-accuracy trade-offs of the various sequences, as the starting accuracy of the core questionnaire and the final accuracy of the full variable set are identical across sequences. This clar- ifies the variable sequence that achieves the highest accuracy for a given number of questions (for the average household). Truncation is designed to achieve the best average result, but – unlike the following method – it fails to exploit the household-specific information that becomes available during enumeration. 3.2 Early stopping In machine learning the term early stopping is commonly used to refer to a learning algorithm that stops training at a point where additional iter- ations yield no more benefit or lead to overfitting (e.g. Goodfellow et al. [2016]). In our setting, we re-purpose the term to refer to an interruption of household-level data collection when additional questions are expected to lack meaningful information about a household’s consumption level. Like truncated questionnaires, this approach draws on variable sequence selec- tion and gradient boosting as the predictive model. The key difference lies in the prediction of conditional quantiles instead of a conditional mean for each step along the variable sequence. This change is accomplished by a change of objective function from mean squared error to quantile regression loss [Koenker and Bassett Jr, 1978]. By setting a symmetric prediction interval of a percentile θ and (1 − θ), a pair of models estimates a plausible quantile range for a household’s consumption 16 level in view of the information accumulated at the current point of the ques- tion sequence. Enumeration proceeds until the prediction interval suggests that a household is either eligible or ineligible. Eligibility is assumed when the upper bound of the prediction interval lies below the programme’s eligi- bility threshold, and ineligibility is assumed when the lower bound lies above. When early stopping is triggered in this way, the conditional mean7 predic- tion for a model trained on the variables queried until this point is recorded as the household’s final consumption estimate. Algorithm 2 summarizes the procedure, where Xcore , Xsel and X:,s denote column-wise partitions of de- sign matrix X , f () is a prediction model for the conditional mean, qθ is a quantile regression model for quantile θ. Figure 3 illustrates the process for a single household. In this case, the prediction interval shifts down and becomes narrower as more information is collected. After 12 questions, the upper bound falls below the eligibility threshold, at which point the conditional mean of the household’s consump- tion level is logged as the final estimate (dashed line), and enumeration ends. The trade-off between cost and accuracy is achieved via the width of the prediction interval determined by the percentile value θ. As θ is decreased, the prediction interval becomes narrower and is more likely to exclude the eligibility threshold. Early stopping will be triggered for a greater share of households, leading to a reduction in the average number of questions per household. A lower θ can be interpreted as a less certain prediction8 that reduces enumeration cost at the expense of less precise consumption estimates. Figure 3: Illustration of early stopping for an example household 7 The conditional median would be a point estimate alternative, and in a similar vein [Brown et al., 2018] showed that the use of quantile regression set to the intended coverage quantile of the target population can result in an exclusion error reduction. 8 The predictive model does not calibrate the prediction intervals precisely to the true population quantile. As a result, a θ value of, say, 0.05 does not correspond accurately to a 95% probability of the true value being below the upper bound of the prediction interval, but it can be thought of as an approximation. 17 For truncated questionnaires, we noted that the ordering of variables by predictive power leads to diminishing accuracy gains as more items are added. Similarly, the prediction intervals that determine early stopping behaviour shift progressively less. This implies that households for which early stopping is not triggered early on are unlikely to breach the stopping threshold at any point. The result is a full questionnaire for those households close to the eligibility threshold, which raises average questionnaire length with little accuracy benefit. The following approach can overcome this issue. Algorithm 2 Early stopping Input: Training set (X, y), inference set (Xnf r ), core variables Xcore , vari- able sequence s, eligibility threshold γ Xsel ← Xcore ▷ Pre-compute models for each s ∈ s do: Xsel ← append(X:,s ) f s () ← fit(Xsel , y) qθs () ← fit(θ, X , y) sel q1s () ← fit(1 − θ, X , y) −θ sel end for for each xi ∈ Xnf r do: ▷ Infer consumption xsel ← Xnf r i,core for each s ∈ s do: query(Xi,s ) xsel ← append(xsel , Xi,s ) s (x ) < γ or q s (x ) > γ then: if qθ sel 1−θ sel yis = f (x ) sel break end if end for yi = f s (xsel ) end for 3.3 Truncated early stopping (TrESt) TrESt performs early stopping, but with a limit on the maximum num- ber of questions that households can be asked. This approach provides a household-specific stopping criterion for cases where eligibility status ap- pears clear, while capping questionnaire length to balance cost and accu- racy for more ambiguous cases. Figure 4 illustrates the concept, showing the EER-average length result for a range of truncations that emerge from the early stopping points for selected interval widths. Setting truncation to the full number of optional variables recovers the early stopping solution, whereas a maximum width prediction interval would yield the truncated 18 questionnaire solution. Truncated questionnaires and early stopping both expose the cost-accuracy trade-off via a single hyperparameter. For the former, this is the number of optional questions, and for the latter it is the width of prediction in- tervals. TrESt requires both hyperparameters, which can work in opposite directions; a longer questionnaire can be counteracted with a narrower pre- diction interval, and vice versa, which results in overlapping outcomes for different hyperparameter pairs. The TrESt solution consists of those hyper- parameter pairs that generate the lower bound of accuracy-length outcomes (shown as a red line in fig. 8 in the results section). We use a separate valida- tion set to identify the lower bound hyperparameters without overfitting, as cross-validation was already deployed on the training set in the generation of the variable sequence9 . Figure 4: Illustrative truncation sequences for selected early stopping predictions. Colours represent different prediction intervals. 4 Results This section presents the simulation results for each of the three proposed approaches, for a social protection programme with 40% population coverage in Indonesia. A PMT baseline provides a comparison with current policy practice. Whereas the PMT baseline consists of district-specific models, the other approaches are trained at national level with a district identifier in the core questionnaire that enables generation of locally adapted models. Our main outcome of interest is the number of questions that an approach requires versus the exclusion error rate. We favour the exclusion error rate (EER) as the evaluation metric most closely aligned with the targeting policy objective of identifying the eligible poor. Additional targeting metrics for a 9 A more computationally intensive approach along the lines of Cawley and Talbot [2010] may be preferable in settings with limited data. 19 broader range of programme population coverages from 10% to 50% can be found in appendix A. The graphical results show the number of optional questions that are required for a given accuracy, measured by EER. This ordering implies that policymakers choose a minimum required accuracy with reference to a bench- mark method, rather than assigning a question budget per household. Re- call that the full questionnaire starting point and the core questionnaire-only end point are shared, as the predictive model is identical across sequences. A method that is closer to the origin is preferable as it offers a superior cost-accuracy trade-off. The complete questionnaire that mirrors Indone- sia’s targeting survey of 37 optional questions results in a simulated EER of 26%10 . At the other extreme, a gradient boosting model trained only on core questionnaire variables would achieve an EER of 37.78%. The range between core and full questionnaire spans almost half the EER’s lower end value, implying a major – and likely unacceptable – decline in targeting accuracy when minimizing question numbers. 4.1 Truncated questionnaires The first set of results is for questionnaires of uniform length administered to all households. We consider outcomes for sequences based on stepwise selection, the group lasso11 , and variable importance for the generation of grouped variables sequences that underlie these questionnaires. For the tree- based variable importance measures, we noted the option of using either split counts or loss function gains, as well as their computation via either a one- way ranking or a backwards selection-style recursive feature elimination. From the resulting four possible implementations, we show one-off gain as the best-performing one, which leaves recursive split-based elimination as a complement. Figure 5 shows the four methods’ relative performance. Figure 5 reveals stepwise selection to be the most effective approach in this setting, followed by the group lasso and then the variable importance measures. Whereas the performance penalty of the group lasso is moderate, the variable importance measures underperform significantly and cannot be recommended for this use case. Whereas there is a steep rise in the required number of questions to achieve the ultimate, small reductions in EER (top left), the sequencing of variables by predictive power results in a strong accuracy impact for initial questions (bottom right). The second key result is that fewer than 5 questions raise the EER strongly, and beyond 15 questions the accuracy gain becomes imperceptible for the best-performing method of stepwise selection. An intermediate range provides a moderate 10 The full questionnaire would be selected when a gradient boosting model is used in the PMT algorithm instead of OLS on nation-wide data. 11 We use the Group Lasso python library and the LightGBM framework [Ke et al., 2017] to implement gradient boosting. 20 Figure 5: Truncated questionnaires: number of questions by EER comparison for several variables sequences elasticity between EER and questionnaire length. It appears likely that policymakers with all but extreme preferences are likely to choose a point in this range for settings, where each question has a meaningful budget impact. 4.2 Early stopping This subsection assesses whether early stopping, a basic adaptive method that processes household information during enumeration, can improve the cost-accuracy trade-off of a PMT-style approach. We compare outcomes among the same variable sequences as above, and then evaluate whether early stopping or truncated questionnaires yield better results in each case. Recall that early stopping, which generates the length-accuracy trade-off via the width of prediction intervals, results in questionnaire length varia- tion across households. The end points of the variable sequences are very similar, at 34.8 (variable importance/ gain), 34.5 (recursive feature elimina- tion/ split), 34.5 (group lasso), and 34.2 (stepwise) questions, all resulting in an EER of 25.99%. The number of questions is below the full-length 37 as early stopping is triggered in some instances even when the interval width is set to a minimum of 0.01, as even basic information is sufficient to identify households with an extreme income level. The identical EER with that of a full questionnaire shows that there is no loss in accuracy from early stopping for that subset of households. To constrain the range of outcomes to a statistically credible range, we have limited the maximum θ quantile to 0.3, resulting in prediction intervals limited by the 30th and the 70th percentiles at its most narrow. Note that the prediction bound collapses to the median as the quantile reaches 0.5 in our symmetric interval scheme. At that point, early stopping would be triggered at the core questionnaire for all households, yielding the same result as the truncation at zero questionnaire with EER 37.78% as above. 21 Accordingly, the end point of early stopping is the same across all approaches including truncated questionnaires, and the reader may picture the lines converging at this point. The first main result, shown in fig. 6, mirrors that of the previous subsec- tion: early stopping based on stepwise selection provides the best trade-off, followed the group lasso sequence and then the variable importance mea- sures. The differences between variable selection approaches are notably smaller than for truncated questionnaires, suggesting that early stopping is less sensitive to the particular variable sequence used. The likely rea- son for this effect is that stopping happens at different points for different households, and also more often at more effective models, which smooths the outcomes. Figure 6: Early stopping: number of questions by EER comparison for several variable sequences The shape of the stepwise selection curve shows that the trade-off be- tween length and accuracy is remarkably similar. We observe that the elas- ticity becomes greater at around 20 questions. When plotted on the same graph, a direct comparison underlines the validity of this impression. Fig- ure 7 compares the results of the truncation and early stopping approaches separately for each of the variable selection methods. Interestingly, the truncation vs early stopping results are nearly identical for both the step- wise selection and the group lasso plots, despite the substantial difference in the algorithms. The similarity suggests that both methods are able to exploit their respective variable sequences with similar efficacy. In the case of variable importance, the early stopping approach provides better results than truncated questionnaires. We again interpret this rela- tive outperformance as being related to a less precise ranking of variables by predictive power, and the smoothing across models that occurs in early stopping. Overall, the stepwise selection variable sequence retains the best length-accuracy result by a small margin over the group lasso. For this se- quence, truncated questionnaires require slightly fewer questions for a given 22 accuracy, when considering shorter questionnaires, whereas for long ques- tionnaires the early stopping solution is more effective by what amounts to a negligible margin. Figure 7: EER comparison of early stopping vs stepwise selection by number of questions 4.3 Truncated Early Stopping (TrESt) Early stopping and truncated questionnaire can potentially be combined as they exploit the same variable sequence in distinct ways. The results shown in fig. 8 confirm the effectiveness of a joint approach in an empirical application12 . The orange dots show the early stopping outcomes for var- ious prediction interval widths, and the green line represents the range of truncated questionnaires. Grey lines emerging from each dot represent the truncation paths for the early stopping models, and their lower bound (in red) constitutes the TrESt solution. The initial near-vertical decline of the grey lines confirms that the exclu- sion of the least predictive variables from questionnaires is similarly useful in an early stopping algorithm to reduce average questionnaire length. The zoomed-in section isolates the lower bound, and plots it against the current policy baseline to illustrate the performance difference. TrESt achieves a PMT-equivalent EER with an average of only 8.9 questions per household, compared with 22.4 for the PMT policy baseline. We review a key aspect of algorithm functioning before considering a full set of results across the different methods. To appreciate the mechanics of the TrESt approach, consider the algo- rithm’s behaviour for different pairs of questionnaire length and prediction interval width. Figure 9 shows the proportion of households for which early stopping is triggered along these dimensions. As expected, a larger quantile, 12 Stepwise selection and a gradient boosting model are used to generate the TrESt algorithm results below. 23 Figure 8: Overview plot which results in a narrower prediction interval, increases the proportion of early stopping for a given questionnaire length. At the first percentile, there is practically no early stopping on completion of the core questionnaire, whereas at the 30th percentile this basic information is deemed sufficient to classify nearly half (45.3%) of households. At the other end, by the penultimate question, only 9.0% of households’ prediction interval excludes the eligibility threshold for the 1st percentile, rising to 89.9% for the 30th percentile. The information about a household’s economic conditions in the first five optional questions has the strongest impact on stopping, as it shifts and nar- rows the prediction interval more than subsequent information. The share of households for which early stopping is triggered becomes increasingly flat thereafter. Even at the most sensitive end of the considered intervals (30th percentile), early stopping only increases by 1.5 percentage points over the last twenty questions. From truncated questionnaires, we already know that the consumption estimate’s accuracy gain is very limited at this part of the questionnaire. Excluding the last variables from considerations thus has only a small impact on the ultimate consumption estimate, but it reduces questionnaire length by one-third. 4.4 Comparison table Table 2 compares the results of the main approaches the main results in terms of the EER that corresponds to a particular questionnaire length. It demonstrates that a trade-off between average questionnaire length and targeting accuracy that can be generated with each of the three proposed methods. In our Indonesia simulation, both truncated questionnaires and 24 Figure 9: Proportion of stopped questionnaires early stopping yielded remarkably similar outcomes. TrESt is able to lever- age the respective mechanisms for a superior length-accuracy trade-off for the case of shorter questionnaires. Above circa 18 questions, early stop- ping becomes the most effective approach, albeit by a small margin13 . For questionnaires with fewer than 20 items, at which point a notable trade- off between length and accuracy emerges, TrESt provides the best set of solutions. The resulting range of options begs the question which specifica- tion should be chosen, and we explore this issue for a specific setting in the following section. Table 2: Comparison of EER% across models by questionnaire length (approxi- mate length) Approx. questions PMT Truncated Early stop TrESt 37 – 25.99 – – 30 – 26.00 25.99 (29.5) – 23 26.44 (22.8) 26.09 26.01 (23.2) 26.08 (23.3) 20 – 26.09 26.05 (20.3) 26.10 (19.8) 15 – 26.28 26.44 (14.7) 26.18 (15.1) 10 – 27.02 27.07 (10.3) 26.34 (9.9) 5 – 28.55 28.53 (5.6) 27.32 (5.0) 0 – 37.79 37.79 37.79 5 Case study: Urban Setting The benefit of a method to allow much shorter questionnaires while main- taining accuracy becomes more tangible when placed into a particular policy 13 The small underperformance of TrESt vs early stopping for long questionnaires can be attributed to the selection of hyperparameter pairs on the validation set. 25 context. This section calibrates a TrESt model to a particular urban set- ting, allowing us to weigh enumeration costs against estimation accuracy, and thereby maximize a policy objective. The following rudimentary policy simulation suggests scope for substantially improved targeting outcomes. Three basic solutions for the cost-accuracy trade-off are apparent. One is maximum accuracy regardless of cost. Implicitly, the standard PMT falls into this category as it optimizes the model’s predictive power regardless of questionnaire length. The other extreme would be the minimum cost solu- tion of a core questionnaire. In many cases, the optimal solution may lie at a point where cost and accuracy are balanced to achieve a certain policy ob- jective. Programme incidence, defined as the share of intended beneficiaries in the covered population, is a realistic objective, and in this case study we optimize for it on the assumption of a fixed budget. Accordingly, we select the TrESt model specification that yields the highest feasible incidence by weighing the extent of survey coverage against overall accuracy. The setting we simulate is DKI Jakarta, Indonesia’s capital city region that had about 10.5 million inhabitants within a densely populated urban environment. Information shared by the local government, which is charged with updating the social registry, suggests that enumerators who administer the current targeting questionnaire survey collect information from 10 to 20 households per day, and that each questionnaire takes between 15 and 30 minutes per household (personal communication, September 2021). Taking the midpoints of these ranges implies that the average travel time between households is 10 minutes. Based on the number of items that make up the core questionnaires and the implied travel time, the fixed and maximum variable times for each interview both amount to circa 16 minutes14 . As- suming that each of the optional questions takes the same amount of time, variable time is reduced to the proportion of grouped questions out of the full questionnaire total. The programme we simulate is the PBI-JKN subsidized health insur- ance programme intended to cover 40% of the population nationally, but which the DKI government has extended to 51% of households. Round- ing to 50% to align with our simulations, and adjusting household figures accordingly, SUSENAS data indicate an incidence of 62.8% in 2019. A sim- plifying assumption at baseline is that all households which are included in the social registry receive the programme, as full coverage of all the surveyed households mirrors actual practice at national level, where the DTKS social registry contained the same 40% share of households that the programme aimed to cover [Pahlevi, 2019]. We simulate the impact of a shorter questionnaire by separating house- holds into consumption deciles and splitting each decile into a surveyed and 14 Based on 26 core items out of a total of 112 ungrouped variables, and rounding up to the full minute. 26 an unsurveyed group according to empirically observed percentages in the SUSENAS data. The PMT is distinct to the policy baseline as it would only require an average of 24.4 questions to be asked for the Jakarta area local models, whereas the status quo is based on the actual policy prac- tice of querying a full questionnaire. A shorter questionnaire then allows additional households to be sampled at random from the unsurveyed pop- ulation15 . The additional households are surveyed with either a PMT or a TrESt questionnaire based on the stepwise selection sequence that proved most effective. To determine eligibility, households are ranked according to the PMT predictions, or the TrESt predictions that arise from the lower bound length-interval width pairs that yield its solution. Each hyperparam- eter pair results in an average number of questions, which in turn determines the total number of households that can be surveyed at a given accuracy. A corresponding incidence value is the main outcome for each pair. Figure 10 shows the simulation results. The incidence curve is less smooth than for the national results due to a more limited number of ob- servations for Jakarta. The chosen hyperparameter pair, marked by the red triangle, is a maximum questionnaire length of only 5 grouped items, along- side a medium level of early stopping sensitivity generated by θ set to 0.11. Note that this specification lies slightly below the best test set outcome as it was chosen according to its performance on a validation set. Figure 10: Simulated PBI-JKN incidence in DKI Jakarta at 50% coverage The simulations indicate that a PMT that only queries the questions required for the model would allow for an additional 16.7% of the popula- tion to be surveyed. However, the linear model’s modest predictive power in the DKI setting only raises the incidence to 65.6%. In contrast, TrESt achieves a significantly higher incidence at any model specification. At an average of 21.8 questions, even the model with the longest average question- naire requires fewer questions on average than the PMT and still achieves 15 Random selection is a conservative assumption, as households could be prioritized by likely eligibility, e.g. through poverty maps. 27 a considerably higher incidence of 74.1%. The chosen solution only uses an average of 3.9 questions per household, allowing 94.6% of the population to be sampled to achieve an incidence of 78.3%. Further average length reductions yield little benefit in terms of respondent numbers, and decrease accuracy so much that the overall effect on incidence is negative. This case study suggests that TrESt has significant potential for raising incidence in a suitable setting. The case study relies on limited information and should be interpreted as a stylized example rather than as a fully cali- brated use case. Nevertheless, it provides an illustration of the relationship between targeting survey coverage and household-level accuracy, and that it can be beneficial for desired policy outcomes to prioritize the former over the latter. Another caveat is that the low travel time between households in the urban setting under consideration here is a key environmental factor that promotes a close link between questionnaire length and incidence. Whether similar benefits would accrue where population density is lower could be as- sessed with enumeration meta-data for rural and peri-urban localities. Even when the number of questions is similar, the simulation suggests that the TrESt approach has a considerable advantage over a standard PMT, likely due to a more efficient orientation of survey resources to households with more uncertain eligibility status, and due to the greater predictive power of the non-linear, national-level machine learning model. 6 Discussion We proposed a selection of practically feasible methods that expose a trade- off between accuracy and survey length, all based on variable sequences that order questions in order of predictive power. One simply truncates ques- tionnaires to a certain length for all surveyed households, another deploys prediction intervals to generate a household-level early stopping criterion. For the most effective variable sequence – generated by stepwise selection – both approaches produce remarkably similar results in a simulation of a 40% programme coverage simulation with data from Indonesia, despite re- lying on different mechanisms. Compared with a policy baseline PMT that requires 22.8 questions to achieve an EER of 26.44%, truncation only needs 14 questions for a similar EER, and the early stopping method requires 18. A combined method that leverages their respective advantages achieves a superior length-accuracy trade-off at any point, and requires a mere 8.9 questions for an equivalent EER. The methods presented here provide policymakers a choice on cost ver- sus accuracy to suit their context and constraints. When selecting a point that maximizes incidence in a rudimentary simulation of an urban setting, a considerable improvement (to [78.3 %] incidence) appears feasible over both current practice ([65.6%]) and a more streamlined approach based on 28 a PMT model ([62.8%]). Whether similar gains can be achieved in a less densely populated setting is an open question. Similarly, the assumptions of a uniform time per question and the ability to expand survey coverage to nearly the whole area population might prove overoptimistic in a real-world setting. Nevertheless, the urban case study highlights the potential benefit of expanding targeting survey coverage in some settings, even if it entails a moderate accuracy penalty. A census sweep of the whole population with an extensive questionnaire would provide the most comprehensive data for accurate targeting. But if budgets are constrained, surveying more households with a shorter ques- tionnaire may yield the best results. Beyond maximizing incidence with limited resources, policymakers would also enjoy greater flexibility through the methods presented here. The resources needed to conduct regular re- assessment constrain targeting surveys to be conducted with multi-year gaps in the case of survey waves, and potentially also increase the re-assessment intervals in on-demand systems. A reduction in survey cost could contribute to a more dynamically supportive social protection system by raising the fre- quency of assessment and re-assessment. It is also worth noting that TrESt’s efficient use of survey resources via household-level early stopping, a suit- ably truncated questionnaire, and a more effective predictive model appears to offer a clear accuracy improvement over the standard PMT for a given average questionnaire length. To generate transparent results, our simulations consider a single pro- gramme with a fixed population coverage rate. Countries with social reg- istries, of which Indonesia’s DTKS is a well-documented example, usually target multiple programmes with differing population coverage rates that re- sult in different eligibility thresholds. Prediction interval-based methods can be adjusted to this common scenario by using an eligibility interval instead of a threshold. The appropriate interval for each household can be identi- fied according to characteristics queried in the core questionnaire. Ceteris paribus, wider eligibility intervals will result in longer questionnaires on aver- age. A related caveat is that the targeting accuracy of new programmes that leverage existing consumption estimates but which have different eligibility thresholds – widely adopted during the recent COVID-19 pandemic – will be lower than that of estimates based on full questionnaires or newly collected data. Policymakers considering the introduction of a TrESt-style method are advised to simulate such contingencies in a fully calibrated model. Implementation of the TrESt algorithm would require changes to enu- meration practice. Software is one aspect, as it requires on-the-fly inference of prediction intervals for the stopping criterion. For Indonesia, an exist- ing Android app used for the targeting survey could be adjusted for this purpose, but survey bodies elsewhere that currently rely on paper or off- the-shelf digital questionnaires would require new software. Deployment on sufficiently powerful hardware, such as a mid-range tablet or smartphone 29 that can conduct the computations on the fly, or a reliable internet connec- tion for processing in the cloud, would be required. Enumerator training is another implementation aspect that would require adjustment as the target- ing survey questions need to be ordered by their stepwise regression sequence rather than the standard of thematic grouping. A final IT-related issue is that the social registry information collected with a TrESt or early stopping approach would collect jagged optional question data, so that survey design- ers would need to include all variables required for analytical or monitoring purposes in the core questionnaire. Although the simulations presented here suggest scope for reducing sur- vey costs while potentially reducing exclusion errors, this paper is only desk- based proof of concept. Indonesia has already achieved good targeting out- comes through years of investment in improving targeting processes and im- plementation. Continued improvements in design could further strengthen targeting outcomes at no additional cost, but piloting would be advisable to verify that the method is practical, that the results hold in the field, and also to collect additional data. For Indonesia, a simple PMT trial that collects the survey metadata to estimate realistic survey times would pro- vide key budgetary inputs. Similarly, information on travel times between households would be important to calibrate time savings, particularly in rural districts and island locations where shorter questionnaires may only yield negligible savings. Detailed cost estimation along the lines of Fujii and van der Weide [2020] on the cost-effectiveness of double sampling would be advisable if implementation were to be considered. One way to improve accuracy may be to draw on alternative data sources, such as the exploration of internet and phone expenditure for Indonesia’s PMT by Pinxten [2021]. Such data can potentially support better classifica- tion, shorter questionnaires, or both. Similarly, alternative data preparation tailored to the early stopping approach may offer significant benefits in terms of accuracy or question numbers. On the other hand, the shortened ques- tionnaire raises the risk of misreporting, as it concentrates both predictive power and enumerator attention on a few high-impact variables. A restric- tion to verifiable variables may mitigate this risk, but monitoring of response patterns would remain important to identify emerging subterfuge. Beyond a more detailed assessment of financial and logistical aspects, additional applications with data from other countries would be helpful in assessing whether the early stopping algorithm can be a useful tool in other settings. The distributional impact in terms of unequal outcomes for differ- ent groups is another important aspect that would warrant further explo- ration. Appendix B outlines an illustrative group-level analysis which sug- gests that outcomes for pensioner households are similar for early stopping when compared with a standard PMT. Further reassurance would be gained by verification of estimation consistency for a full range of relevant vulner- able groups, as well as by a more extensive exploration of targeting fairness 30 along the lines of Noriega-Campero et al. [2020]. The distributional anal- ysis could be combined gainfully with a consideration of multidimensional poverty measures that assess impacts beyond the standard consumption- based perspective taken here. In terms of methodological options, alternative variable selection meth- ods that are less prone to suboptimal choices, different predictive models, or non-symmetric prediction intervals that align with the skew of consumption distributions may yield further improvements in the cost-accuracy trade-off. A promising research direction is to move beyond the stopping criterion to a survey design with a question sequence tailored on-the-fly to each house- hold. The adaptation of the tree-based method in Bakker et al. [2021] from classification to regression would be one step in this direction. While com- putational constraints preclude deployment of a fully-adaptive approach in the field, the development of a semi-adaptive method that deploys a limited set of variable sequences may be a promising research avenue. A related di- rection would be to tailor data collection to the promotion of more equitable targeting outcomes for economically disadvantaged groups. References Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle. In Selected papers of Hirotugu Akaike, pages 199– 213. Springer, 1998. Vivi Alatas, Abhijit Banerjee, Rema Hanna, Benjamin A Olken, and Julia Tobias. Targeting the poor: evidence from a field experiment in Indonesia. American Economic Review, 102(4):1206–40, 2012. Ana Areias and Matthew Wai-Poi. Machine learning and prediction of ben- eficiary eligibility for social protection programs. In Revisiting Targeting in Social Assistance: A New Look at Old Dilemmas, chapter 8. World Bank, Washington DC, 2022. Michiel A Bakker, Duy Patrick Tu, Humberto River´ es, Krishna P on Vald´ Gummadi, Kush R Varshney, Adrian Weller, and Alex Pentland. DADI: Dynamic Discovery of Fair Information with Adversarial Reinforcement Learning. arXiv preprint arXiv:1910.13983, 2019. Michiel A Bakker, Duy Patrick Tu, Krishna P Gummadi, Alex Sandy Pent- land, Kush R Varshney, and Adrian Weller. Beyond reasonable doubt: Im- proving fairness in budget-constrained decision making using confidence thresholds. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 346–356, 2021. Abhijit Banerjee, Rema Hanna, Benjamin A Olken, and Sudarno Sumarto. The (lack of) distortionary effects of proxy-means tests: Results from a 31 nationwide experiment in Indonesia. Journal of Public Economics Plus, 1:100001, 2020. Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and machine learning - limitations and opportunities. self-published manuscript, 2021. Caitlin Brown, Martin Ravallion, and Dominique Van de Walle. A Poor Means Test? Econometric targeting in Africa. Journal of Development Economics, 134:109–124, 2018. Adriana Camacho and Emily Conover. Manipulation of social program eli- gibility. American Economic Journal: Economic Policy, 3(2):41–65, 2011. Gavin C Cawley and Nicola LC Talbot. On over-fitting in model selection and subsequent selection bias in performance evaluation. The Journal of Machine Learning Research, 11:2079–2107, 2010. Luc Christiaensen, Ethan Ligon, and Thomas Pave Sohnesen. Consumption subaggregates should not be used to measure poverty. The World Bank Economic Review, 2021. David Coady, Margaret Grosh, and John Hoddinott. Targeting outcomes redux. The World Bank Research Observer, 19(1):61–85, 2004. Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. The Elements of Statistical Learning. Number 10. Springer series in statistics, 2001. Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001. Tomoki Fujii and Roy van der Weide. Is predicted data a viable alternative to real data? The World Bank Economic Review, 34(2):485–508, 2020. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. Margaret Grosh and Judy L Baker. Proxy means tests for targeting social programs: simulations and speculation. Living Standards Measurement Study Working Paper, 118:1–49, 1995. Margaret Grosh, Phillippe Leite, Matthew Wai-Poi, and Emil Tesliuc. Re- visiting Targeting in Social Assistance: A New Look at Old Dilemmas. World Bank, Washington DC, 2022. Robert M Groves and Steven G Heeringa. Responsive design for household surveys: tools for actively controlling survey errors and costs. Journal of the Royal Statistical Society: Series A (Statistics in Society), 169(3): 439–457, 2006. 32 Camilla Holmemo, Pablo Acosta, Tina George, Robert J Palacios, Juul Pinxten, Shonali Sen, and Sailesh Tiwari. Investing in People: Social Protection for Indonesia’s 2045 Vision, 2020. ILO. World Social Protection Report 2020-22: Social protection at the cross- roads – in pursuit of a better future. International Labor Organization, 2021. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A highly efficient gradi- ent boosting decision tree. Advances in Neural Information Processing Systems, 30:3146–3154, 2017. S. Klasen, S. Lange, G. Hadiwijaja, J. Pinxten, and B.A. Wirapati. Evalua- tion of PMT-based targeting in Indonesia. Unpublished technical report. Technical report, World Bank, 2016. Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica, pages 33–50, 1978. Varun Kshirsagar, Jerzy Wieczorek, Sharada Ramanathan, and Rachel Wells. Household poverty classification in data-scarce environments: a machine learning approach. arXiv preprint arXiv:1711.06813, 2017. Phillippe Leite, Tina George, Changqing Sun, Theresa Jones, and Kathy Lindert. Social registries for social assistance and beyond. 2017. Kathy Lindert, Tina George Karippacheril, In´ ıguez Caillava, and es Rodr´ Kenichi Nishikawa Ch´ avez. Sourcebook on the foundations of social pro- tection delivery systems. World Bank, Washington DC, 2020. Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in Neural information Processing Systems, 30, 2017. Linden McBride and Austin Nichols. Retooling poverty targeting using out- of-sample validation and machine learning. The World Bank Economic Review, 32(3):531–550, 2018. Paul Niehaus, Antonia Atanassova, Marianne Bertrand, and Sendhil Mul- lainathan. Targeting with agents. American Economic Journal: Economic Policy, 5(1):206–38, 2013. Alejandro Noriega-Campero, Bernardo Garcia-Bulle, Luis Fernando Cantu, Michiel A Bakker, Luis Tejerina, and Alex Pentland. Algorithmic tar- geting of social policies: fairness, accuracy, and distributed governance. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 241–251, 2020. 33 Tim Ohlenburg. Machine learning for proxy means testing in Indonesia. Unpublished technical report. Technical report, World Bank, 2020. Franziska Ohnsorge and Shu Yu. The long shadow of informality. 2021. Said Mirza Pahlevi. Indonesia’s Unified Database (UDB). ADB Social Protection Week conference presentation, Nov 2019. Utz Johann Pape and Johan A Mistiaen. Household expenditure and poverty measures in 60 minutes: a new approach with results from Mogadishu. World Bank Policy Research Working Paper, (8430), 2018. Juul Pinxten. Estimating the improvement of predictive performance through inclusion of mobile phone and internet data expenditure vari- ables in Indonesian PMT modelling. Unpublished technical report. 2021. Maytal Saar-Tsechansky, Prem Melville, and Foster Provost. Active feature- value acquisition. Management Science, 55(4):664–684, 2009. Galit Shmueli. To explain or to predict? Statistical Science, 25(3):289–310, 2010. Noah Simon, Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A sparse-group lasso. Journal of Computational and Graphical Statistics, 22 (2):231–245, 2013. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996. Achmad Tohari, Christopher Parsons, and Anu Rammohan. Targeting poverty under complementarities: Evidence from Indonesia’s unified tar- geting system. Journal of Development Economics, 140:127–144, 2019. World Bank. Targeting Poor and Vulnerable Households in Indonesia. World Bank, Washington DC, 2012. World Bank. Social Assistance Public Expenditure Review. World Bank, Washington DC, 2017. World Bank. Aspire: Atlas of social protection in- dicators of resilience and equity, 2022. URL https://www.worldbank.org/en/data/datatopics/aspire. Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Sta- tistical Methodology), 68(1):49–67, 2006. 34 A Additional targeting metrics and coverage rates To show that the methods proposed perform well for a variety of programme coverage levels, this annex provides simulations for programmes that are targeted at 10%, 20%, 30%, 40%, and 50% of the population. Figure A1 shows the number of questions by EER across coverage levels. It highlights that the performance ranking of truncation and early stopping varies with the coverage level, which points to the need for context-specific selection if one of these is to be chosen. At low coverage rates, early stopping performs relatively better than truncation as the eligibility threshold moves to the lower bound and facilitates exclusion of non-eligible households with high consumption levels. TrESt not only maintains its superior performance, but outperforms truncation and early stopping by a wider margin than at the relatively elevated 40% coverage level considered in the main text. Whereas the previous focus was on EER as the key targeting metric, this annex expands the results to a wider set of metrics. Table A1 outlines these, adding the commonly used inclusion error rate (IER), mean squared error (MSE), and the coefficient of variation (R2 ). The results are shown in tables A2 to A5, for each of which we provide brief commentary below. Table A1: Major targeting metrics Metric Description Mean Squared The average squared difference between observed and pre- Error (MSE) dicted household consumption. MSE ranges over positive values and is denominated, and a smaller value signifies a more accurate predicted consumption level. Coefficient of The proportion of variation in household consumption that is determination captured by the predictive model. A value of zero implies the (R2 ) lack of a linear relationship, whereas a value of one implies perfect correlation between predictor and predictand. Exclusion The proportion of households with consumption below the Error Rate eligibility threshold who are incorrectly predicted as being (EER) above. EER ranges between zero and one, and a smaller value signifies a more accurate predicted eligibility status. Inclusion Error The proportion of households with consumption above the Rate (IER) eligibility threshold who are incorrectly predicted as being below. IER also ranges between zero and one, and a smaller value signifies a more accurate predicted eligibility status. The EER for each coverage level shown in table A2, in the same format as the main results table table 2, highlights that lower coverage rates result 35 (a) Coverage: 10% (b) Coverage: 20% (c) Coverage: 30% (d) Coverage: 40% (e) Coverage: 50% Figure A1: Questionnaire length vs exclusion error rate by method for various programme coverage rates 36 in a higher proportion of exclusion among intended beneficiaries. At first sight, an EER of 50% may suggest that the eligibility assignment method is arbitrary, but random assignment would only capture 10% of beneficiaries and result in an EER of ca. 90%. Where household-level targeting is the policy preference for small coverage programmes, one can therefore argue that PMT, TrESt and other methods thus provide considerable advantages even if they only capture half of the intended beneficiaries. In terms of the number of questions needed to match PMT accuracy, around ten optional questionnaire items – and thus less than half the PMT outcome of around 23 – are needed for TrESt across coverage levels. For the Indonesia context, a questionnaire of this length would thus appear to provide a versatile method for most programmes. Table A3 shows IER rates, a metric often considered important for po- litical economy reasons. When the IER is defined with the number of ben- eficiaries in the denominator, e.g. as in Brown et al. [2018], then the fixed coverage rate used here would result in IER = EER. By using number of non-beneficiaries instead to align with local policy practice, we note that the inclusion error rate becomes IER = EER * coverage rate / (1 - coverage rate); EER and IER are then only equal in the case of 50% coverage. Given the close relationship between EER and IER in the fixed coverage regime considered here, the outcomes are qualitatively similar. As an effect of a growing denominator as the coverage level declines, the IER varies in the opposite direction to EER. The coefficient of determination, or R2 , is a continuous targeting metric that is independent of the coverage level for methods such as the standard PMT and truncation. For early stopping and TrESt, the coverage level affects the algorithm’s stopping criterion, and thus also feeds into estimation results. Table A4 shows that the point estimates of early stopping and TrESt are less precise than those of the PMT and truncation, which optimize for these point estimates. Although the classification-based EER and IER outcomes match the PMT at around ten questions, the R2 at this point is considerably lower. However, this relative underperformance is of no concern if accurate eligibility is the ultimate objective. In fact, the lower point accuracy is a design feature as the algorithms stop trying to pinpoint the consumption level as soon as they identify a sufficient estimated difference to the eligibility threshold. As such, inferior continuous targeting metrics are unproblematic as long as there is no separate need for accurate consumption estimates. The same insights hold for the MSE results shown in table A5 as for the previous table (not least because, as IER and EER, they are mathematically related). The PMT and truncation display the same MSE across coverage levels, and truncation achieves a similar result with ca. 10 versus 22.8 ques- tions. Early stopping and TrESt drop off due to their mechanics, and can only match the PMT in question numbers for its MSE level, but not trun- 37 Table A2: Exclusion error % by model and number of question for various cover- age rates (approx. no. of questions in brackets) Number of questions PMT Truncated Early stop TrESt 10% coverage 37.0 – 50.01 (37.0) – – 30.0 – 50.08 (30.0) – – 22.8 50.16 (22.8) 50.23 (23.0) – – 20.0 – 50.30 (20.0) 50.01 (20.5) – 15.0 – 50.60 (15.0) – 50.09 (12.1) 10.0 – 51.70 (10.0) 50.15 (10.5) 50.21 (10.3) 5.0 – 54.09 (5.0) 51.85 (5.1) 50.62 (4.9) 0.0 – 63.82 (0.0) 63.08 (0.8) 63.34 (0.1) 20% coverage 37.0 – 39.56 (37.0) – – 30.0 – 39.76 (30.0) – – 22.8 40.05 (22.8) 39.76 (23.0) 39.56 (23.0) – 20.0 – 39.87 (20.0) 39.56 (19.8) 39.82 (19.1) 15.0 – 40.16 (15.0) 39.75 (15.3) 39.70 (14.8) 10.0 – 41.07 (10.0) 40.86 (9.8) 39.89 (10.0) 5.0 – 43.35 (5.0) 45.88 (5.2) 40.86 (5.0) 0.0 – 53.99 (0.0) – 53.79 (0.1) 30% coverage 37.0 – 32.03 (37.0) – – 30.0 – 32.03 (30.0) 32.03 (29.7) – 22.8 32.34 (22.8) 32.12 (23.0) 32.06 (23.2) – 20.0 – 32.17 (20.0) 32.07 (20.2) – 15.0 – 32.31 (15.0) 32.37 (14.9) 32.14 (15.0) 10.0 – 33.24 (10.0) 33.36 (10.2) 32.34 (9.8) 5.0 – 35.11 (5.0) 37.33 (4.7) 33.59 (5.0) 0.0 – 45.10 (0.0) – 44.97 (0.1) 40% coverage 37.0 – 25.99 (37.0) – – 30.0 – 26.00 (30.0) 25.99 (29.5) 26.09 (27.2) 22.8 26.44 (22.8) 26.09 (23.0) 26.01 (23.2) 26.08 (22.6) 20.0 – 26.09 (20.0) 26.05 (20.3) 26.10 (20.1) 15.0 – 26.28 (15.0) 26.44 (14.7) 26.19 (15.0) 10.0 – 27.02 (10.0) 27.07 (10.3) 26.35 (10.1) 5.0 – 28.55 (5.0) 28.53 (5.6) 27.41 (5.0) 0.0 – 37.79 (0.0) – 35.19 (0.5) 50% coverage 37.0 – 20.85 (37.0) – – 30.0 – 20.86 (30.0) 20.85 (30.5) 20.91 (30.2) 22.8 21.17 (22.8) 20.91 (23.0) 20.90 (22.5) 20.96 (22.7) 38 20.0 – 20.95 (20.0) 20.95 (20.3) 20.98 (19.8) 15.0 – 21.05 (15.0) 21.21 (15.4) 21.00 (15.0) 10.0 – 21.69 (10.0) 22.09 (9.8) 21.17 (9.9) 5.0 – 22.89 (5.0) 23.77 (5.2) 22.16 (5.0) 0.0 – 31.19 (0.0) – 28.99 (0.6) Table A3: Inclusion error % by model and number of question for various coverage rates (approx. no. of questions in brackets) PMT Truncated Early stop TrESt 10% coverage 37.0 – 5.56 (37.0) – – 30.0 – 5.56 (30.0) – – 22.8 5.57 (22.8) 5.58 (23.0) – – 20.0 – 5.59 (20.0) 5.56 (20.5) – 15.0 – 5.62 (15.0) – 5.57 (12.1) 10.0 – 5.74 (10.0) 5.57 (10.5) 5.58 (10.3) 5.0 – 6.01 (5.0) 5.76 (5.1) 5.62 (4.9) 0.0 – 7.09 (0.0) 7.01 (0.8) 7.04 (0.1) 20% coverage 37.0 – 9.89 (37.0) – – 30.0 – 9.94 (30.0) – – 22.8 10.01 (22.8) 9.94 (23.0) 9.89 (23.0) – 20.0 – 9.97 (20.0) 9.89 (19.8) 9.96 (19.1) 15.0 – 10.04 (15.0) 9.94 (15.3) 9.92 (14.8) 10.0 – 10.27 (10.0) 10.21 (9.8) 9.97 (10.0) 5.0 – 10.84 (5.0) 11.47 (5.2) 10.21 (5.0) 0.0 – 13.50 (0.0) – 13.45 (0.1) 30% coverage 37.0 – 13.73 (37.0) – – 30.0 – 13.73 (30.0) 13.73 (29.7) – 22.8 13.86 (22.8) 13.77 (23.0) 13.74 (23.2) – 20.0 – 13.79 (20.0) 13.74 (20.2) – 15.0 – 13.85 (15.0) 13.87 (14.9) 13.77 (15.0) 10.0 – 14.24 (10.0) 14.30 (10.2) 13.86 (9.8) 5.0 – 15.05 (5.0) 16.00 (4.7) 14.40 (5.0) 0.0 – 19.33 (0.0) – 19.27 (0.1) 40% coverage 37.0 – 17.33 (37.0) – – 30.0 – 17.33 (30.0) 17.33 (29.5) 17.39 (27.2) 22.8 17.63 (22.8) 17.39 (23.0) 17.34 (23.2) 17.39 (22.6) 20.0 – 17.39 (20.0) 17.37 (20.3) 17.40 (20.1) 15.0 – 17.52 (15.0) 17.63 (14.7) 17.46 (15.0) 10.0 – 18.01 (10.0) 18.04 (10.3) 17.57 (10.1) 5.0 – 19.03 (5.0) 19.02 (5.6) 18.28 (5.0) 0.0 – 25.19 (0.0) – 23.46 (0.5) 50% coverage 37.0 – 20.85 (37.0) – – 30.0 – 20.86 (30.0) 20.85 (30.5) 20.91 (30.2) 22.8 21.17 (22.8) 20.91 (23.0) 20.90 (22.5) 20.96 (22.7) 39 20.0 – 20.95 (20.0) 20.95 (20.3) 20.98 (19.8) 15.0 – 21.05 (15.0) 21.21 (15.4) 21.00 (15.0) 10.0 – 21.69 (10.0) 22.09 (9.8) 21.17 (9.9) 5.0 – 22.89 (5.0) 23.77 (5.2) 22.16 (5.0) 0.0 – 31.19 (0.0) – 28.99 (0.6) Table A4: Coefficient of determination (R2 ) by model and number of question for various coverage rates (approx. no. of questions in brackets) PMT Truncated Early stop TrESt 10% coverage 37.0 – 0.63 (37.0) – – 30.0 – 0.63 (30.0) – – 22.8 0.62 (22.8) 0.63 (23.0) – – 20.0 – 0.63 (20.0) 0.55 (20.5) – 15.0 – 0.63 (15.0) – 0.46 (12.1) 10.0 – 0.61 (10.0) 0.44 (10.5) 0.46 (10.3) 5.0 – 0.57 (5.0) 0.37 (5.1) 0.43 (4.9) 0.0 – 0.31 (0.0) 0.32 (0.8) 0.31 (0.1) 20% coverage 37.0 – 0.63 (37.0) – – 30.0 – 0.63 (30.0) – – 22.8 0.62 (22.8) 0.63 (23.0) 0.56 (23.0) – 20.0 – 0.63 (20.0) 0.53 (19.8) 0.55 (19.1) 15.0 – 0.63 (15.0) 0.49 (15.3) 0.51 (14.8) 10.0 – 0.61 (10.0) 0.43 (9.8) 0.51 (10.0) 5.0 – 0.57 (5.0) 0.38 (5.2) 0.46 (5.0) 0.0 – 0.31 (0.0) – 0.31 (0.1) 30% coverage 37.0 – 0.63 (37.0) – – 30.0 – 0.63 (30.0) 0.61 (29.7) – 22.8 0.62 (22.8) 0.63 (23.0) 0.56 (23.2) – 20.0 – 0.63 (20.0) 0.54 (20.2) – 15.0 – 0.63 (15.0) 0.49 (14.9) 0.55 (15.0) 10.0 – 0.61 (10.0) 0.45 (10.2) 0.55 (9.8) 5.0 – 0.57 (5.0) 0.40 (4.7) 0.50 (5.0) 0.0 – 0.31 (0.0) – 0.31 (0.1) 40% coverage 37.0 – 0.63 (37.0) – – 30.0 – 0.63 (30.0) 0.61 (29.5) 0.61 (27.2) 22.8 0.62 (22.8) 0.63 (23.0) 0.57 (23.2) 0.60 (22.6) 20.0 – 0.63 (20.0) 0.55 (20.3) 0.58 (20.1) 15.0 – 0.63 (15.0) 0.51 (14.7) 0.56 (15.0) 10.0 – 0.61 (10.0) 0.48 (10.3) 0.55 (10.1) 5.0 – 0.57 (5.0) 0.44 (5.6) 0.51 (5.0) 0.0 – 0.31 (0.0) – 0.38 (0.5) 50% coverage 37.0 – 0.63 (37.0) – – 30.0 – 0.63 (30.0) 0.61 (30.5) 0.62 (30.2) 22.8 0.62 (22.8) 0.63 (23.0) 0.58 (22.5) 0.60 (22.7) 40 20.0 – 0.63 (20.0) 0.57 (20.3) 0.60 (19.8) 15.0 – 0.63 (15.0) 0.54 (15.4) 0.60 (15.0) 10.0 – 0.61 (10.0) 0.49 (9.8) 0.56 (9.9) 5.0 – 0.57 (5.0) 0.45 (5.2) 0.51 (5.0) 0.0 – 0.31 (0.0) – 0.39 (0.6) cation. As such, a truncation approach may be preferable in settings where survey costs should be minimized, but accurate consumption estimates are required for uses outside the programme’s eligibility determination mecha- nism. 41 Table A5: Mean squared error by model and number of question for various cov- erage rates (approx. no. of questions in brackets) PMT Truncated Early stop TrESt 10% coverage 37.0 – 0.16 (37.0) – – 30.0 – 0.16 (30.0) – – 22.8 0.17 (22.8) 0.16 (23.0) – – 20.0 – 0.16 (20.0) 0.19 (20.5) – 15.0 – 0.16 (15.0) – 0.23 (12.1) 10.0 – 0.17 (10.0) 0.24 (10.5) 0.23 (10.3) 5.0 – 0.18 (5.0) 0.27 (5.1) 0.25 (4.9) 0.0 – 0.30 (0.0) 0.30 (0.8) 0.30 (0.1) 20% coverage 37.0 – 0.16 (37.0) – – 30.0 – 0.16 (30.0) – – 22.8 0.17 (22.8) 0.16 (23.0) 0.19 (23.0) – 20.0 – 0.16 (20.0) 0.20 (19.8) 0.19 (19.1) 15.0 – 0.16 (15.0) 0.22 (15.3) 0.21 (14.8) 10.0 – 0.17 (10.0) 0.25 (9.8) 0.21 (10.0) 5.0 – 0.18 (5.0) 0.27 (5.2) 0.24 (5.0) 0.0 – 0.30 (0.0) – 0.30 (0.1) 30% coverage 37.0 – 0.16 (37.0) – – 30.0 – 0.16 (30.0) 0.17 (29.7) – 22.8 0.17 (22.8) 0.16 (23.0) 0.19 (23.2) – 20.0 – 0.16 (20.0) 0.20 (20.2) – 15.0 – 0.16 (15.0) 0.22 (14.9) 0.19 (15.0) 10.0 – 0.17 (10.0) 0.24 (10.2) 0.20 (9.8) 5.0 – 0.18 (5.0) 0.26 (4.7) 0.22 (5.0) 0.0 – 0.30 (0.0) – 0.30 (0.1) 40% coverage 37.0 – 0.16 (37.0) – – 30.0 – 0.16 (30.0) 0.17 (29.5) 0.17 (27.2) 22.8 0.17 (22.8) 0.16 (23.0) 0.19 (23.2) 0.17 (22.6) 20.0 – 0.16 (20.0) 0.20 (20.3) 0.18 (20.1) 15.0 – 0.16 (15.0) 0.21 (14.7) 0.19 (15.0) 10.0 – 0.17 (10.0) 0.23 (10.3) 0.19 (10.1) 5.0 – 0.18 (5.0) 0.24 (5.6) 0.21 (5.0) 0.0 – 0.30 (0.0) – 0.27 (0.5) 50% coverage 37.0 – 0.16 (37.0) – – 30.0 – 0.16 (30.0) 0.17 (30.5) 0.17 (30.2) 22.8 0.17 (22.8) 0.16 (23.0) 0.18 (22.5) 0.17 (22.7) 42 20.0 – 0.16 (20.0) 0.19 (20.3) 0.18 (19.8) 15.0 – 0.16 (15.0) 0.20 (15.4) 0.17 (15.0) 10.0 – 0.17 (10.0) 0.22 (9.8) 0.19 (9.9) 5.0 – 0.18 (5.0) 0.24 (5.2) 0.21 (5.0) 0.0 – 0.30 (0.0) – 0.27 (0.6) B Example analysis of group-level outcome differ- ences Changes in policy implementation, such as the use of a new algorithm, may have distributional effects in the sense of systematic outcome differences across beneficiary groups. Most commonly, groups are defined by demo- graphic or ethnic characteristics, but any economically vulnerable minority would be a useful unit of analysis. The machine learning literature refers to group-based outcome issues as fairness (see Barocas et al. [2021] for a textbook-style overview), with various definitions of what constitutes a fair outcome at group level. While a comprehensive assessment of the fairness ef- fects of PMT vs early stopping is beyond the scope of this paper, this annex provides a simple example of the kind of analysis that could be conducted to ensure equitable policy outcomes across the population. We select two-person pensioner households as an example of an eco- nomically vulnerable group. The histogram in Figure A2 shows the EER distribution for this group across districts, both for the PMT (red) and the early stopping (blue) algorithm. The purple area is the overlap between the group’s outcomes for the two estimators, while the dotted lines show the mean respective EER rates. The means lie close together at 25.6% for the PMT and 25.8% for early stopping, and the predominance of purple area also implies that there is only a marginal increase in misclassification risk for this group when the early stopping algorithm is used instead of the PMT baseline. There is slightly more dispersion for early stopping, evident in the somewhat higher frequency of districts in which either all or none of the eligible pensioner households are classified correctly. Apart from this, the overall pattern is as similar as can be expected for a statistical process subject to randomness; we can conclude that the early stopping algorithm does not produce disparate outcomes for this group. Figure B1: EER in a 40% coverage programme for two-person pension age house- holds, early stopping vs PMT models 43