What Can We (Machine) Learn About Welfare Dynamics from Cross-Sectional Data?

This paper implements a machine learning approach to estimate intra-generational economic mobility using cross-sectional data. A Least Absolute Shrinkage and Selection Operator (Lasso) procedure is applied to explore poverty dynamics and household-level welfare growth in the absence of panel data sets that follow individuals over time. The method is validated by sampling repeated cross-sections of actual panel data from Peru. In general, the approach performs well at estimating intra-generational poverty transitions; most of the mobility estimates fall within the 95 percent confidence intervals of poverty mobility from the actual panel data. The validation also confirms that the Lasso regularization procedure performs well at estimating household-level welfare growth between two years. Overall, the results are sufficiently encouraging to estimate economic mobility in settings where panel data are not available or, if they are, to improve panel data when they suffer from serious non-random attrition problems.


Policy Research Working Paper 8545
This paper implements a machine learning approach to estimate intra-generational economic mobility using cross-sectional data. A Least Absolute Shrinkage and Selection Operator (Lasso) procedure is applied to explore poverty dynamics and household-level welfare growth in the absence of panel data sets that follow individuals over time. The method is validated by sampling repeated cross-sections of actual panel data from Peru. In general, the approach performs well at estimating intra-generational poverty transitions; most of the mobility estimates fall within the 95 percent confidence intervals of poverty mobility from the actual panel data. The validation also confirms that the Lasso regularization procedure performs well at estimating household-level welfare growth between two years. Overall, the results are sufficiently encouraging to estimate economic mobility in settings where panel data are not available or, if they are, to improve panel data when they suffer from serious non-random attrition problems. This paper is a product of the Poverty and Equity Global Practice. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/research. The authors may be contacted at llucchetti@worldbank.org.

Introduction
There has been a considerable increase in the number of countries that have developed the necessary tools to measure poverty in recent years. In addition, a large body of research has proposed standardized methods to compare poverty across countries, as well as to monitor poverty evolution at a regional and global level (Ravallion, Datt, and van de Walle 1991;Chen and Ravallion 2001;Ravallion, Chen, and Sangraula 2009;Jolliffe and Prydz 2016;Ferreira et al. 2016;Castaneda et al., forthcoming). The rapid expansion of household surveys at frequent intervals and comparable over time and across countries has facilitated poverty monitoring in the developing world; coverage increased from 13 countries in the 1990s to over 60 countries in 2011 (Serajuddin et al. 2015). However, most of the micro data available are cross-sectional that do not track individuals and households over time and therefore only provide aggregate poverty trends.
Panel datasets that follow individuals over several periods of time are rarely available, which limits the understanding of the underlying factors behind movements out of poverty, the dynamics into poverty, and the duration of poverty experienced by a group of individuals. This paper introduces a supervised machine learning method to estimate intra-generational economic mobility using cross-sectional data. 1 The method estimates parameters in the first round of cross sectional data by means of the Lasso regularization process (Tibshirani 1996). A crossvalidation method is used to evaluate the out-of-sample predictive performance of the model in the first round of data. These estimated parameters are then used to predict a point estimate of the unobserved income in the first round for all households surveyed in the second round and estimate intra-generational poverty transitions in the absence of panel data. This approach is validated by comparing estimates from cross-sectional data with those from actual panel data from Peru.
A large body of research on the subject has emerged in recent years. "Synthetic Panels", developed by Dang et al. (2014), is the most recent one. 2 The authors estimated a (log) income 3 model in both the first and second rounds of cross-sectional data, including time-invariant 1 Mullainathan and Jann Spiess (2017) present a detailed description of the use of machine learning methods in economics. Supervised machine learning consists in producing good predictions of a variable y from the values of x, as opposed to the classical econometric problem of obtaining good estimates of parameters that describe the relation between both variables. Supervised machine learning refers to those situations where a value of y is observed for each value of x. Conversely, we do not observe a value of y for each value of x under the unsupervised machine learning. 2 The Synthetic panel method builds on the poverty mapping technique developped by Elbers, Lanjouw, and Lanjouw covariates and retrospective regressors. Parameters estimated in the first round are then used to predict the unobserved income in the first round for all households interviewed in the second round. Depending on the assumptions introduced with respect to the correlation between the error terms in the underlying regressions in both rounds, this "non-parametric" approach generates an upper and lower bound poverty mobility using cross sectional data. The methodology was validated in Chile, Nicaragua, and Peru by Cruces et al. (2015), while Ferreira et al. (2012) predicted intra-generational poverty mobility in 18 countries in Latin America and the Caribbean (LAC) by implementing the lower bound estimates with harmonized cross-sectional micro data.
By assuming normality of the error terms in the underlying regressions and by using the age-cohort correlation of residuals from cross-sections, Dang and Lanjouw (2013) produced a point estimate of intra-generational poverty mobility-as opposed to upper and lower bound estimates. This "parametric" method was validated by the authors using panel data from five countries. The method was applied by Dang and Lanjouw (forthcoming) to study poverty dynamics in India, by Dang and Dabalen (forthcoming) to analyze whether growth has been propoor in 21 countries in Africa, and by Vakis, Rigolini, and Lucchetti (2016) to analyze chronic poverty in 17 LAC countries for which harmonized cross-sectional micro data exist.
Lucchetti (2017) developed a "non-parametric" point estimate of the unobserved household income in the first round for all households surveyed in the second round of crosssectional data. To this end, the author calculates a weighted average of the residuals obtained in the upper and lower bound estimates. This approach is validated using actual panel data from Chile, Nicaragua, and Peru, and applied in 17 LAC countries for which harmonized micro data are available. This non-parametric point estimate requires an unknown underlying weight when computing the weighted average of lower and upper bound residuals. The author introduces an adhoc assumption by weighting lower and upper bound estimates equally-i.e., setting =0.5-and performs a sensitivity test of results to changes in the value of .
The machine learning approach introduced in this paper presents several strengths and uses less restrictive assumptions than similar studies previously developed. First, the method does not use estimated residuals from regressions. Therefore, no normal distribution of error terms in the underlying income regressions needs to be assumed. 4 Second, this approach does not introduce any arbitrary underlying weight as in Lucchetti (2017) and it does not require the estimation of 4 The assumption of normality of error terms is rejected in Vietnam and Indonesia by Dang et al. (2014). the age-cohort correlation of residuals from cross-sections as in Dang and Lanjouw (2013). Third, unlike Dang et al. (2014), this machine learning approach also predicts point estimates of income mobility-as opposed to just predicting probabilities of poverty transitions. This paper contributes to the growing empirical literature on the use of machine learning to predict economic well-being. Engstrom et al. (2017) use regularization processes together with satellite images to estimate poverty at a high level of geographical disaggregation in Sri Lanka. Babenko et al. (2017) train Convolutional Neural Networks and use satellite images to also estimate the spatial distribution of poverty in Mexico. Afzal et al. (2015) test the accuracy of poverty estimations using machine learning methods-also combined with satellite data-in Pakistan and Sri Lanka. Finally, McBride and Nichols (2016) focus on machine learning techniques to improve targeting tools to identify potential program beneficiaries.
Results in this paper reveal that the Lasso regularization process performs well at predicting intra-generational poverty transitions in the context of the Peruvian data. Most of the estimates fall within the 95 percent confidence intervals of the joint and conditional probability of poverty mobility of the true panel data. The paper also finds that the method does well at predicting household-level income growth-and not just poverty transitions-between the two rounds of cross-sectional data. The analysis reveals that these predictions can be further improved by randomly drawing observed incomes from the distribution in round 1 and allocating them to each household surveyed in round 2 based on their position in the distribution of predicted income that results from the Lasso regularization approach described in this paper.
The next section summarizes all the Synthetic panel approaches, as well as the machine learning method proposed in this paper. Section 3 presents the main data used. Section 4 discusses the validation results. Finally, Section 5 concludes.

Non-parametric Synthetic panels
Assume two rounds of cross-sectional data. We call household's i log per capita income in moment t, xit a vector of household characteristics for household i in round t, and z the poverty line. Characteristics included in xit are variables whose first round value can be inferred for all households surveyed in the second round of data. These characteristics include: (i) time-invariant variables such as gender of the head of the household if his/her identity remains constant between the rounds of data; (ii) deterministic variables such as age; and (iii) retrospective variables such as whether a household surveyed in the second round had an asset in the first round (Cruces et al. 2015, Dang andLanjouw 2018). The relationship between income and a set of time invariant characteristics can be expressed as where it is an error term and xit is a vector of K regressors whose first element is equal to one so that the first element of is the intercept of the model.
We introduce superscripts to refer to observations surveyed in each moment in time. As such, the objective is to estimate, for a household i interviewed in round 2, the change of incomes between the two rounds of data: ∆ 2 = 2 2 − 1 2 , where 1 2 and 2 2 are the first and second round incomes of household i surveyed in round 2, respectively. Similarly, we can also estimate all poverty dynamics: the joint probability of a household i surveyed in round 2 of escaping poverty in round 2 (Pr( 1 2 < 2 2 > )), remaining poor (Pr( 1 2 < 2 2 < )), becoming poor (Pr( 1 2 > 2 2 < )), and remaining non-poor (Pr( 1 2 > 2 2 > )). 6 This can be easily done with panel data, since all households are interviewed in both rounds (i.e., 1 2 is known for every household i interviewed in round 2). However, these datasets are rarely available and costly to collect. Alternatively, Synthetic panels allow predicting the first round "unobserved" incomes of households surveyed in the second round by multiplying their timeinvariant characteristics and the first-round Ordinary Least Squares (OLS) estimates of parameters ̂1 that solve the optimization problem where 1 1 is the first-round log income of household i surveyed in round 1, N1 indexes the number of observations in round 1, and RSS refers to the residual sum of squares. The three non-parametric approaches differ in the treatment given to the correlation between the error terms in the first and second rounds of cross-sectional data, which is likely to be non-negative according to Dang et al. (2014). 6 For simplicity, I will only focus on the probability of escaping poverty in this section.
Upper bound estimates assume no correlation between the first and second round error terms. The authors propose to estimate first round incomes of those households interviewed in the second round of data by drawing randomly with replacement from the empirical distribution of first round estimated residuals (denoted as ̃1 2 ). In this case, the upper bound prediction of the firstround incomes for households surveyed in the second round is where ̂1 2 is the product between time-invariant characteristics and the first-round OLS estimates of parameters: ̂1 2 =̂1 ′ 1 2 . Once incomes are predicted, we can then calculate the joint probability of a household i surveyed in round 2 of being poor in round 1 and escape poverty in round 2, Pr(̂1 2 < 2 2 > ), as well as the income change between both periods ∆ 2 = 2 2 −̂1 2 . Since predictions arise from a random draw of the empirical distribution of residuals, the method needs to be repeated R times and results averaged over these R replications. 7 Lower bound estimates on the other hand assume perfect positive correlation between the first and second round error terms. The authors propose to estimate first round incomes of those households interviewed in the second round of data by using the estimates of the scaled residuals from the second-round regression (denoted as ̂2 2 ). The lower bound predictions are where ̂1 and ̂2are estimated standard errors for the two error terms 1 and 2 , respectively.
The joint probability of a household i surveyed in round 2 of being poor in round 1 and escape poverty in round 2 is given by Pr(̂1 2L < 2 2 > ), while the change in incomes between both periods is ∆ 2 = 2 2 −̂1 2L . Since the method is not randomly drawing from any the empirical distribution of residuals, there is no need to repeat the procedure R times.
The third non-parametric point estimate proposed by Lucchetti (2017) is an adaptation of the lower and upper bound estimations. The author suggests computing a weighted average of the residuals to get a point estimate of mobility. First round non-parametric predicted incomes are where 0 ≤ γ ≤ 1. The joint probability of a household i surveyed in round 2 of being poor in round 1 and escape poverty in round 2 is given by Pr(̂1 2 < 2 2 > ), while the change in incomes between both periods is ∆ 2 = 2 2 −̂1 2NP . Since upper bound residuals are used, the method needs to be repeated R times. 8 The lower bound estimates can be obtained by setting γ = 1, while the upper bound estimates emerge from setting γ = 0. Based on residual correlations estimated from panel data in the literature, the author sets γ = 0.5 and test the sensitivity of results to changes in the value of the γ.

A parametric Synthetic panel
Dang and Lanjouw (2013) propose a parametric point estimate of the intra-generational poverty mobility. The authors assume a bivariate normal distribution for the error terms with a nonnegative correlation coefficient ρ. Thus, a point estimate of the probability of moving out of poverty is where ̂2 are the second-round OLS parameter estimates. A parametric lower bound estimate can be obtained by setting = 1, while the upper bound estimate emerges from setting = 0.
The authors suggest estimating an age-cohort correlation of residuals using cross-sectional data to obtain an estimation of the unknown parameter ρ.

A Machine Learning approach based on the Lasso regularization method
This paper applies a Lasso regularization method to estimate intra-generation poverty mobility and household-level income growth using cross-sectional data. The Lasso procedure is one of the most popular machine learning methods among economists and consists on minimizing a quadratic loss function plus the sum of the absolute value of the coefficients (Mullainathan and Jann Spiess 2017). The paper proposes to estimate parameters in the first round of cross-sectional data by solving the optimization problem 8 The author shows that results are robust to the number of repetitions R.
The estimation depends on the value of the "shrinkage" factor . Whenever → 0, the objective function will become the OLS objective function in (2) and ̂1 →̂1 . The Lasso estimate will deviate from the OLS estimate for positive values of . Finally, ̂1 will be shrunk to zero as → ∞. Therefore, for values ≥ 0, the Lasso is biased towards zero if compared with OLS.
The factor is introduced for two reasons. First, the shrinkage penalty ∑ | 1 | =1 in Lasso provides corner solutions, which implies that some coefficients are forced to be zero. Therefore, the Lasso works well for model selection when the number of candidate variables K is large.
Second, for appropriate values of , the bias introduced is compensated by a reduction of variance.
In this paper, the shrinkage factor is selected with a 10-fold cross-validation algorithm, 9 which is a method to test the out of sample fit of the income model. 10 The algorithm randomly divides the first-round of data into 10 equal sized folds. By leaving one fold out (the test fold), the model is fit in the other 9 folds (the training folds). Once the income model is estimated, the withheld fold is used to predict the model. This is repeated 10 times until all folds have been left out and all observations have a predicted value. The value of is selected so that it minimizes the mean squared error (MSE) defined as ∑ ( 1 The Lasso prediction of the first-round incomes for households surveyed in the second round is Once incomes in first round are predicted for every observation in second round, we can compute the joint probability of a household i surveyed in round 2 of being poor in round 1 and escape poverty in round 2, Pr(̂1 2LASSO < 2 2 > ), as well as its income change between both periods ∆ 2 = 2 2 −̂1 2 .
It is important to note that this approach has several advantages with respect to previous methods. First, residuals are not used and therefore no assumption for the distribution of error terms is required. Second, and connected to the previous point, the approach described in this paper does not introduce any arbitrary underlying weight as in the non-parametric point estimate and it does not require the estimation of the age-cohort correlation of residuals from cross-sections as in the parametric approach. Third, unlike the parametric approach, the method obtains householdlevel income changes and not just probabilities of poverty mobility.

Data, empirical approach, and a second-stage cross-validation process
To validate the approach, this paper uses a panel subsample of the SEDLAC harmonized micro database for Peru. 11 The SEDLAC project consists of more than 400 household surveys in more Following Cruces et al. (2015), a second stage cross-validation is considered by randomly splitting the panel dataset into two subsamples and treating each subsample as a cross-section.
Therefore, the coefficients are estimated in one of these subsamples in the first round of data and applied to the second subsample in the second round. By treating each subsample of the panel 11 See Bourguignon (2015) and Gasparini, Cicowiez, and Escudero (2013) for a description of the SEDLAC data. dataset as a cross-section, this second stage cross-validation avoids any bias that might arise from using the panel dataset to validate the method.
This paper follows the literature to estimate income mobility by including time invariant, deterministic, and/or retrospective regressors in the underlying models. However, unlike most of the previous analysis using Synthetic panels, the harmonized data used in this paper allow to validate poverty transitions using the same underlying harmonized variables frequently used in many regional studies (e.g., Ferreira et al. 2012;Vakis et al. 2016

Lasso coefficients and poverty rate prediction in the first round
The Lasso approach has at least two advantages over the OLS regression. The first advantage is related to the bias-variance trade-off; the Lasso approach shrinks the coefficients towards zero, introducing a bias that is compensated with a reduction of variance for an optimal value of .
Second, since the Lasso approach produces corner solutions, it selects a subset of covariates by potentially forcing some coefficients to be zero.
The selection of the optimal shrinkage factor is shown in Figure 1.  Based on the estimated Lasso coefficients, a first step of the intra-generational mobility analysis can be done by comparing actual poverty rates in round 1 with the estimated ones that emerge when applying the machine learning approach suggested in section 2. Table 1

Joint and conditional probabilities of poverty/non-poverty transitions
The main objective of the paper is to estimate the dynamics into and out of poverty experienced by a group of individuals between two periods of time. Table 2

Sub-group joint probabilities
How well does the approach perform in measuring poverty dynamics for subgroups of the total population? Figures 3 and 4

Sub-group income growth
Another relevant question is whether this approach works well at predicting income growth-∆ 2 = 2 2 −̂1 2 -for different sub-groups of the population. Figure 5 validates the methodology for estimating household per capita income growth for two groups of the population defined by: (i) the dynamic poverty transitions and (ii) the quintiles of the income distribution in the second round-i.e., the non-anonymous growth incidence curves (GIC). 14 All estimates from the Lasso approach are compared with the actual income growth from panel data. The figure presents both the point estimate, as well as the 95 percent confidence interval.
All estimates are generally good for both sub-groups of poverty dynamics and quintiles of the income distribution. With few exceptions, Lasso estimates are close to-and fall within the 95% confidence intervals of-actual mobility for most of the cases. This is a relevant result; unlike the parametric Synthetic panel approach developed by Dang and Lanjouw (2013), this figure suggests that the Lasso approach performs well at predicting income growth instead of just joint probabilities of poverty transition into and out of poverty.

A matching framework to improve Lasso predictions
Results in Figure 5 are sufficiently encouraging to predict income growth for different sub-groups of the population between two periods of time. However, some cases can be substantially improved, especially at the two ends of the income distribution. For instance, while incomes increased for those who remained poor between 2010 and 2011, the Lasso approach predicts a negative income growth for this group of individuals between the two periods-and 95% confidence intervals do not overlap.
To improve income predictions in round 1, this section introduces a variant of the initial Lasso approach in which first-round observed cross-sectional income data are matched with the first round Lasso income predictions. To do so, a random draw from round 1 of the observed empirical income distribution is assigned to each household surveyed in round 2. These values are assigned based on the position of the household in the distribution of predicted income that results from the Lasso regularization approach described in this paper. The following 4 steps describe the approach [1] For each household in round 1, take a random draw with replacement of size N2-which indexes the number of observations in round 2-from the empirical income distribution of actual log incomes and denote it by ̃1 1 .
[2] Sort the two vectors of log incomes ̃1 1 and ̂1 2 from the lowest to the highest value ̃1 1 1 ≤̃2 1 1 ≤ ⋯ ≤̃2 1 1 (9) And [4] The joint probability of a household i surveyed in round 2 of being poor in round 1 and escape poverty in round 2 is given by Pr(̃1 2 < 2 2 > ), where ̃1 2 is first round log income of household i surveyed in round 2 that results from implementing step [3]. Similarly, the change in incomes between both periods is ∆̃2 = 2 2 −̃1 2 .
Since ̃1 2 constitutes a random sample from the empirical distribution of first-round actual incomes, this matching framework is expected to outperform the Lasso predictions described in previous sections. Table 4 Table 4 fall within the 95 percent confidence interval of actual mobility from panel data.
However, results improve substantially when comparing changes in household incomes ∆̃2. Figure 6 validates this matching framework by estimating household per capita income growth for the same two groups of the population defined in Figure 5. Results show a marked improvement; except for the fifth quintile, all estimates are close to and fall within the 95% confidence intervals of actual mobility.

Conclusion
This is the first paper, to the best of my knowledge, that uses a supervised machine learning approach to estimate welfare dynamics in the absence of panel datasets. It proposes to estimate parameters of a log income model in the first round of cross-sectional data using a Lasso process and use those parameters to predict incomes in the first round for all households surveyed in the second round of data. The proposed approach is validated by comparing income dynamics estimated from cross-sectional data with those derived from panel data from Peru. A validation process is implemented in two stages. In a first stage, a 10-fold cross-validation algorithm is used to evaluate the out-of-sample performance of the underlying income models in the first round of data. In a second stage, a cross-validation is implemented by randomly splitting the panel dataset into two subsamples to treat each subsample as a cross-section, which avoids any bias from using actual panel data to validate the method proposed in this paper.
A critical reason for using the approach suggested in this paper is that most of the data used to monitor poverty trends are not longitudinal in the sense that they do not follow individuals or households over time. There has been a rapid expansion in the number of household surveys in recent years, although most of these datasets are cross-sectional in nature. Panel datasets, when available, typically cover short periods of time, which poses serious concerns regarding the validity of policy recommendations that arise from their use in the analysis of long-term poverty dynamics (Ferreira et al. 2012). The proposed approach allows the analysis of poverty dynamics by describing the gross flow of household movements over time, as opposed to the net changes in poverty. This analysis helps to understand, for example, how much income mobility there has been, who has benefited from that mobility, and what have been the factors behind this mobility.
Results in this paper suggest that the method performs well in predicting the joint and conditional probabilities of entering and exiting poverty; most poverty transition estimates using cross sections fall within the 95 percent confidence intervals of mobility from panel data. The method also allows estimating household-level income growth between two periods of time in the absence of longitudinal data.
The machine learning approach introduced in this paper presents several strengths and uses less restrictive assumptions than previously developed Synthetic panel methods. As such, it serves as a promising contribution to guide future research on intra-generational income mobility. For instance, future research could expand the approach to more than two periods and/or two or more poverty lines; and consider other dependent variables (e.g., labor or health as suggested by Dang and Lanjouw 2013). Additional research could also focus on the application of this method to general situations in which two moments in time are considered, for instance, to estimate vulnerability lines based on the population at risk of falling into poverty (Dang and Lanjouw 2016).
Estimates in this paper are computed based on harmonized micro data that allow validations of poverty dynamics using the same variables frequently included in regional and global poverty analysis. The models used in the study include variables that are easy to find in all countries, which ensures the comparability of estimates between countries and over time.
However, if the objective is to study income dynamics in one country-as opposed to many countries or a region as a whole-more predictive power may be achieved by including variables available in that country, but not necessarily in other countries, such as parent's education, place of birth, etc.
This paper suggests using this machine learning approach in the absence of longitudinal data that follow individuals or households over two or more moments in time. However, the approach is not intended to be a substitute-but rather a complement-of panel data. For instance, the method can be used to combine a small panel data set with mobility estimates using this method on a larger cross-sectional data set (Dang et al. 2014) or to correct for serious non-random attrition in actual panel data sets (Dang and Lanjouw 2013).         Data source: SEDLAC data (CEDLAS and the World Bank). Note: The figure presents results that arise from randomly drawing actual income from round 1 and allocating that income to each household surveyed in round 2 according to their position in the distribution of predicted income that results from the Lasso approach described in this paper (presented as "LASSO" in the figure), as well as estimates using "actual" data. Results are constrained to the panel sample of households whose heads are between 25 and 65 years old. Poor are those individuals with a per capita income lower than $4. Poverty lines and incomes are expressed in 2005 $PPP/day. All results are unweighted.