Policy Research Working Paper 10013 New Algorithm to Estimate Inequality Measures in Cross-Survey Imputation An Attempt to Correct the Underestimation of Extreme Values Gianni Betti Vasco Molini Lorenzo Mori Development Economics Development Data Group April 2022 Policy Research Working Paper 10013 Abstract This paper contributes to the debate on ways to improve proposes a method for overcoming these limitations based the calculation of inequality measures in developing on an algorithm that minimizes the sum of the squared dif- countries experiencing severe budget constraints. Linear ference between a certain number of direct estimates of an regression-based survey-to-survey imputation techniques index and its empirical version obtained from the predicted are most frequently discussed in the literature. These are values. Indeed, when comparing the estimated results with effective at estimating predictions of poverty indicators those directly estimated from the original sample, the bias but are much less accurate with inequality indicators. To is negligible. Furthermore, the inequality indices for the demonstrate this limited accuracy, the first part of the years for which there are only model estimates, rather than paper discusses several simulations using Moroccan House- direct information on expenditures, seem to be consistent hold Budget Surveys and Labor Force Surveys. The paper with Moroccan economic trends. This paper is a product of the Development Data Group, Development Economics. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at vmolini@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team A New Algorithm to Estimate Inequality Measures in Cross-Survey Imputation: An Attempt to Correct the Underestimation of Extreme Values Gianni Betti* Vasco Molini† Lorenzo Mori*‡ Keywords: Inequality indicators, Bias reduction, Survey-to-Survey Imputation, Mo- roccan HBS, Moroccan LFS JEL Classification: I32, C13 and E24 * University of Siena. Department of economics and statistics. Email: gianni.betti@unisi.it † 50x2030 Initiative. Email: vmolini@worldbank.org ‡ University of Bologna. Department of Statistics P. Fortunati. Email: lorenzo.mori7@unibo.it 1 Introduction Collecting data to produce reliable and timely estimations of poverty requires elaborate and expensive surveys called Household Budget Surveys (HBSs). Only a few countries are able to collect data annually on income or expenditure to facilitate the estimation of poverty and inequality. Most National Statistical Offices (NSOs) worldwide col- lect these surveys only every four to five years. Therefore, producing reliable indices annually for monitoring poverty results remains quite challenging for many countries. To overcome this challenge, scholars have focused on developing methods to compare welfare indicators over time from surveys that are little comparable. These techniques, broadly known as Survey-to-Survey Imputation Techniques (SSITs), proved successful at predicting comparable poverty indicators, but as this paper argues, were less effective at predicting comparable inequality indicators. SSITs are a derivative of the poverty map literature (Elbers et al., 2003; Tarozzi and Deaton, 2009, Christiaensen et al., 2012; Mathiassen, 2013). By imputing income onto censuses, poverty maps have been applied to many developing countries’ data to pro- vide geographically disaggregated estimates of poverty. In addition to survey-to-census imputation, there has been a recent trend of survey-to-survey imputation, mapping from surveys with consumption data to those with other outcomes of interest, but lacking standard welfare aggregates (Elbers et al., 2004; Dabalen et al., 2014; Dang et al., 2019). While this approach is well established for poverty estimates, a cursory overview of the literature shows that little attention is devoted to potential problems in obtain- ing accurate inequality measures with this method. Using actual data from a census of households in a set of rural Mexican communities, Demombynes and Hoogeveen (2007) show that correlations between the estimated and true welfare at the local level are highest for mean expenditure and poverty measures and lower for inequality mea- sures. Douidich et al. (2016) obtain accurate estimates for quarterly poverty rates using a classic survey-to-survey imputation method that combines HBSs and Labor Force Surveys (LFSs). Using a log-linear regression to impute a completely missing variable (total expenditure) for certain years, they obtain reliable estimates for Head Count Ratio (HCR) with an exogenous poverty line. Krafft et al. (2019) impute consumption from HBSs onto LFSs and obtain information on poverty and inequality. Resulting measures of consumption, poverty, and inequality are similar across survey pairs, particularly for the data in Jordan and the Arab Republic of Egypt. These estimates are obtained based on the hypothesis that the model is time invariant, that is, the coefficients used to impute from one survey to another are stable over time (e.g., that βt = β ), and that the error term is homoscedastic and normally distributed. However, the concern is that incorrectly as- suming normal errors or ignoring heteroscedasticity has the potential to introduce bias when estimating poverty or inequality. Specifically, in the prediction of inequality, the bias can be significant. Dang et al. (2019) provide an alternative method of simulating the final consumption imputation, which also seems extremely accurate when estimat- 2 ing poverty but does not address the estimation of inequality. To analyze poverty trends in Tunisia after 2010, Cuesta and Ibarra (2017) compare cross-survey micro imputation with macro projection techniques based on sectoral GDP, unemployment, and inflation. The SSIT used is an OLS regression, where the dependent variable of the model is the logarithm of annual household consumption per capita. Newhouse et al. (2014) pro- vide evidence, on the other hand, illustrating that Survey-to-Survey Imputation can fail. They demonstrate that minor differences in the sampling scheme, sampling design, or structure of the answers/questions can produce inaccurate SSITs. Corral et al. (2021) show that there are more accurate methods (see also Rao and Molina, 2015) than poverty mapping to obtain estimates for small areas. Given that the SSITs are a derivative of this method, we argue their conclusions can also apply to SSITs.1 Overall, the literature seems to indicate that the SSITs that are valid for predicting poverty can be equally valid for predicting inequality. By contrast, we argue that SSITs, by construction, tend to underestimate the predicted inequality, and to obtain accurate estimates, a different methodological approach should be chosen. The assumption of residuals’ normality distribution and the fact that standard SSITs are based on regres- sion make them more accurate at predicting the central moments of a distribution (or transformations such as the poverty headcount) rather than the shape of the tails. This latter is crucial in predicting inequality. These models, however, tend to predict distri- bution compressed around the mean and, with thin tails, underestimate inequality. This point is not critical if the target parameters to be estimated are related with poverty, how- ever, as shown by Schluter (2012), it is a crucial aspect of inequality measures. In this paper, we show that indicators of inequality obtained from fitted values can be biased if the variables common to the two surveys, as often happens, are not strongly correlated with the dependent one, that is, when the R2 ad j of the regression is not high enough. To overcome these limitations, our approach is to estimate a semi-parametric model that combines the incorporation of the same covariates typically used by SSITs via the parametric component with a flexible fit to the data at hand via the non-parametric one. In general, these new techniques would seem particularly well suited to find a compromise between flexibility in terms of adaptation to the indicator to be predicted and consistency with standard survey-to-survey models. We tested this approach on a data-set recently used in Morocco (Douidich et al., 2016). The paper is organized as follows: section 2 describes the data; section 3 frames the problem and includes simulations based on classic SSITs; section 4 presents the methodology used to produce results with the classical techniques; section; section 5 presents the proposed algorithm; section 6 describes the results of the new method; and section 7 concludes. 1 Incases where the true population total or mean of an independent variable is known, we suggest using Small Area Estimation methods, in particular the M-quantile method developed by Chambers and Tzavidis (2006). 3 2 Survey Procedure Two sets of surveys were used, the HBS and the LFS. The HBS is carried out about every seven years, while the LFS is conducted every year. The HBS is used to determine total family expenditure, which we convert into expenditure per capita expressed in USD. The Moroccan HBS includes two different surveys: the 2000–2001 National Survey on Consumption and Expenditure (NSCE) and the 2006–2007 National Living Standards Survey (NLSS), which both measure household expenditure and are representative of the national and regional context and of urban and rural areas. The NSCE covers 15,000 households and was administered between November 2000 and October 2000. The 2006–2007 NLSS, administered between December 2006 and November 2007, was smaller and covered only 7,200 households. The two ques- tionnaires share many common modules, of which the most relevant address socio- demographic characteristics, habitat, expenditures, durable goods, education, health, and employment. The NSCE also has a module on transfers, subjective indicators of well-being, nutrition, and a module administered to the community to measure access to services. Both surveys have the same structure. For urban areas, strata include the region, province, city size (large, medium, and small) and five types of housing. For rural areas, the strata are regions and provinces. The common points between the two surveys, however, end here. All the surveys conducted up to 2005, including the NSCE, have a master sample frame that is derived from the 1994 population census. The surveys conducted after 2005 are, instead, based on the 2004 census. The 2001 NCSE follows a two-stage sampling process, while the 2007 NLSS fol- lows a three-stage sampling process. In the first stage of NSCE, 1.250 Primary Sam- pling Units (PSUs) of approximately 300 households each were extracted from the 1994 population census. In the second stage, a dozen households per PSU were extracted randomly to constitute the final sample. In the first stage of NLSS, 1,848 PSUs of ap- proximately 600 households each were selected from the 2004 population census. The second stage subdivided each PSU into twelve Secondary Sample Units (SSUs), repre- senting about 50 households each, and then randomly selected six of the twelve SSUs from each PSU. In the third stage, a constant number of households was selected ran- domly from each SSU. The LFS, which was first launched in 1976, follows a sampling process that is simi- lar to that of the 2007 NLSS and shares a number of questions with the two censuses on which the population frame is based. The LFS questionnaires share only a few questions with the other two censuses, relating to sociodemographic characteristics, habitat, ed- ucation, health and employment, but, significantly, do not share any information about expenditure. 4 3 A Critical Overview of Survey-to-Survey Regression Techniques In general, SSITs use a log-linear regression to estimate values for the parameter at year t, then use those values to predict values for the previous year t − n if only the independent variables are available. This process is done under the hypothesis that the parameters are stable over the years. Often, this hypothesis is verified by estimating parameters at two non-consecutive years to illustrate that they are similar irrespective of the selected years. The regression usually uses individual or household expenditures as the dependent variable. With the right skew and in order to avoid normality, the variable is transformed into a logarithm and then transformed back to obtain the predicted values. Regression is the best way to obtain predicted values only if all the assumptions are fulfilled and the selected dependent variables have a higher explanatory power versus the independent one. This last point is summarized in R2 ad j . Often, articles do not 2 report the Rad j , although multiple regressions are always used. Interaction or special transformation of the dependent variables is also common. Unfortunately, failure to report the R2 ad j hinders the ability to check the accuracy of the regression. Given the connection between R2 2 ad j and R we can argue that the first is at least equal to or — more likely — lower than the ones reported (which is often around 0.5 or less). A low R2 can result from a selection that prefers a common set of variables from the two surveys — which is a key requirement for SSITs — over a high explanatory power. Demographic variables have been identified in both surveys, such as age, sex, house- hold size, or variables related to assets, such as the number of rooms per capita and measures of physical assets such as the asset index, among other variables. However, their choice might present two problems. First, it will be difficult to capture changes in household expenditure using only demographic variables (that change less rapidly). Second, there is often a weak correlation between dependent and independent variables. As highlighted by Ketkar and Ketkar (1987), household sociodemographic charac- teristics are as important as determinants of expenditure patterns as price and income. The standard consumer behavior model states that the economic agent maximizes her utility subject only to relative prices and to income constraints. The two theories may both be valid, but if prices and income are excluded from the standard SSITs models, the sociodemographic variables alone may not adequately explain the individual or house- hold consumption patterns. However, these conclusions must be verified for every data set and are impossible to determine a priori. A low value for R2 arising from insufficient correlated independent variables, com- bined with a logarithmic transformation, could also generate problems with establishing predictions. The logarithmic transformation compresses the tail of skewed and kurtotic variables, which effectively generate symmetric PDFs and, therefore, cause Gaussian- like errors. But the predicted values could be applied in the tails and, given the low 5 R2ad j , they could all be near the mean, with a sharp decrease in data variability. In addi- tion, logarithmic transformations are not immune to a re-transformation bias. There are various ways to solve this issue, but the most frequently used ones are based on scale correction of the predicted values. Both the relative Gini index and the relative Theil index are scale invariant,2 which can render those corrections ineffective at reducing bias. To summarize the discussion thus far, the regression to obtain a predictor is the right choice only if the R2 ad j is high, all the assumptions are respected, and the bias caused by the log-transformation is negligible. In addition, it is important to bear in mind other potential sources of bias (Newhouse et al., 2014) arising from the differences between the survey design and the questionnaires. 4 Estimates of Biases Based on Moroccan Data In this section, using data from Morocco, we will show that the most common regres- sion techniques can fail to accurately estimate three parameters: the Gini index, Theil index, and Head Count Ratio (HCR) with an endogenous poverty line equal to 60% of the median value of the regional expenditure per capita. We will compare estimates obtained with a direct estimator from the HBS at the rural/urban level in four different regressions. In all cases, the reported results will have a high negative skewed bias. The following regression techniques are used: • Log-linear regression over two (split) data sets. We split the data set using a dummy (0= urban, 1= rural) variable, ran the regressions, and then computed the indices; • Log-linear regression over the whole data set. The fitted values obtained are then split using the dummy (0= urban, 1= rural) variable and then the indices are esti- mated, both with ordinary least squares and feasible generalized least squares; • Random forest regression over the split data set. With a log-transformation of the dependent variable; • Quantile regression over the split data set with a log-transformation. The fundamental problem with these regression techniques, even the robust ones, is that all fitted values will be near the mean value; therefore, variability will be reduced compared to the real regression. To attenuate this problem, we use two methods previ- ously used in the literature (inter alia, Douidich et al., 2016 and Newhouse et al., 2014), 2 Theproperty of scale invariance states that inequality remains unchanged when all incomes increase by the same proportion. See Clementi et al. (2022)) for a discussion of differences between (relative) scale invariant and non-scale invariant (absolute) measures of inequality. 6 that produce reliable results when data from the census can be used or when the focus is on a poverty index with an exogenous poverty line. Table 1 reports the direct estimates of three inequality indices calculated from the 2000 and 2007 HBSs. The standard error reported in brackets is computed with a bootstrap estimator based on 500 iterations. Table 1: Estimated Indicators by Sample Unit 2000 2007 Index Rural Urban Rural Urban 0.349 0.408 0.353 0.430 Gini (0.006) (0.005) (0.006) (0.006) 0.240 0.325 0.228 0.361 Theil (0.017) (0.016) (0.012) (0.015) 0.161 0.209 0.177 0.194 HCR (0.005) (0.004) (0.007) (0.006) Source: Authors’ calculations based on Moroccan HBS. Before moving to regression techniques, it is necessary to identify the variables that could be used as regressors. To do this, we must bear in mind that a bias could also arise from differences between questions and answers in the surveys.3 To better un- derstand the SSIT results, the trends of dependent and independent variables should be examined to predict trends for our indicators. The sample mean expenditure per capita in 2000 was 1, 752.96$ for rural areas and 3, 514.9$ for urban areas, which increased to 2, 265.9$ (+29%) and to 4, 053$ (+15%) respectively in 2007. The bulk of the depen- dent variables seem to have constant values over the years, with the notable exception of electricity. In rural areas, the proportion of families with electricity increased from 0.35 to 0.70. The first model tested was the log-linear regression over the split data set. The results (see table [A.1]) are consistent with those presented by Douidich et al. (2016). Illiteracy and a low level of education have a negative impact on expenditure, while a higher number of rooms per capita has a positive impact, all other things being equal. 3 Once the structure of questions and answers has been checked, we suggest also performing a graph- ical analysis of PDF and checking position and variability parameters, see Betti et al. (2018). We do not report this analysis for all the variables since the comparability has already been checked by Douidich et al. (2016) and we use the same set of variables. It is, however, important to underline that since 2015, Morocco has officially administered 12 regions. Previously, there were 16 regions. From that moment the new regions and the oldest ones are not directly comparable. In the interest of future comparability, we decided to use 14 regions, which was also decided by the Moroccan statistics department that unified the regions that had been merged. For more on this point, we recommend the official survey site. The only other variable that needs an explanation is the “low level of education,” with a value of 0 “up to the 1st cycle" and 1 for "over the 2nd cycle. 7 Furthermore, compared to those working in agriculture, those working in manufacturing or services are better off. A comparison of the results from these models (table [A.2]) with those in table [1] clearly shows that the predicted inequality indicators are significantly biased downward. In a second stage, to increase the R2 ad j , we regress the model over the whole data set, expecting an increase in the coefficient of determination. We obtain an R2 ad j of 0.5649 for 2000 and an R2 ad j of 0.4695 for 2007, indeed a significant improvement observed only in urban areas. This moderate increase in the R2 ad j for urban areas is caused by the introduction of one new regressor, the variable “milieu = urban (0) and rural (1).” However, the increase in R2 ad j does not significantly reduce the bias. Following Newhouse et al. (2014), all parameters for the regression coefficients and the distribu- tions of the error terms are estimated by feasible generalized least squares (FGLS4 ). We do not consider the cluster-specific error terms since the clusters have changed between the two surveys and are not comparable.5 Since the standard linear regression falls short of estimating our distributional indi- cators, we first use a random forest regression model followed by a quantile regression. We perform the first with p/3 = 5 knots and 500 trees. The pseudoR2 is equal to 0.561 for 2000 and 0.489 for 2007. In the latter case, it is performed for each decile. Table [A.2] reports first the random forest results then those obtained for the quantile regres- sion for q=0.5. Here, the goodness of fit is measured by R1 (q = 0.5) and is equal to 0.349 for 2000 and 0.289 for 2007. Similar results were obtained for the other deciles. These results highlight another problem encountered with the regression techniques. Usually, robust techniques perform very well due even in presence of outliers. Here, however, we are aiming to predict indices rather than a central tendency because the “extreme value” in the tails of the distributions are as important as those near the mean. To sum up, the standard linear regression and the other two methods we tested dras- tically reduced the variability of the fitted values, which triggered a downward bias in the distributional indicators.6 Hence, the proposed models failed to reproduce accurate inequality indicators because of the reduced data variability when the fitted values are used. In section 4, we propose an iterative algorithm that takes its cue from regression 4 This method is also used because the studentized Breusch-Pagan test (BP = 307.78, df = 29, p-value < 2.2e − 16) confirms the presence of heteroscedasticity in the OLS model. 5 Nonetheless, we think that ignoring the cluster-specific error term is not the correct approach to obtain sound estimates for a single year. However, it is the only method we can use to avoid introducing another bias to the predicted values. There is persistent underestimation of the three parameters that is not improved by this method. 6 The Ordinary Least Squares (OLS) method is based on the minimization of a function based on indi- vidual values. As is known, the confidence bands for the fitted lines reflect the minimum correspondence of the mean, which increases the error for the extreme values. This is acceptable if the goal is to predict a central tendency or something correlated with it but could be problematic if the indicators of interest are based on values in the tail of a distribution. 8 techniques but allows flexibility in the hypothesis of normality, changes the objective function, and does not use a transformation of the dependent variable. 5 The Proposed Algorithm The previous section indicates that SSITs based on regression techniques do not accu- rately predict inequality indicators. From this point on, we propose an algorithm that, using regression techniques as a point of departure, improves the results by removing hypotheses that can be too stringent and by changing the function to minimize. Our al- gorithm is based on the Constrained Optimization by Linear Approximation (COBYLA) method, a numerical optimization method developed by Powell (2007) and Conn et al. (1997) for constrained problems where the derivative of the objective function is not known. Given a dependent variable, yi , observed on i = 1, . . . , n individuals, and a set of m covariates Xn×m we predict yi as a linear combination of Xn×m , as follows: yi = β1 x1i + β2 x2i + . . . + βm xmi (1) We then define the objective function to minimize as the sum of the square difference between the direct estimates of an index, Ind , and its empirical version obtained for the fitted values, Ine , that is: J ∑ (Ind − Ine)2j (2) j=1 Where J is the number of indices used. In this specific case, we used three indicators: the Gini index (G), Thiel index (T), and Head Count Ratio (HCR) with an endogenous poverty line. Equation [2] becomes: (Gd − Ge )2 + (Td − Te )2 + (HCRd − HCRe )2 (3) As the dependent variable is, by definition, non-negative, we introduce the constraint that the mean of the fitted values must be equal to the mean of the absolute value of the fitted values. That is: 1 n est 1 n est yi − ∑ |yi | = 0 (4) n i∑ =1 n i=1 where yest i is obtained as follows: yest i = β1 x1i + β2 x2i + · · · + βn xni (5) The other constraint used is that the absolute difference between the mean of the sam- pled value and the mean of the fitted ones must be less than 1, that is: 1 n 1 n est y − yi ≤ 1 (6) n i∑ n i∑ i =1 =1 9 Hence, we add a lower bound and an upper bound to our parameters. Those bounds will be equal to (−In f , 0) for those variables that result in a negative value from the regression and to (0, +In f ) for all others. We perform a different regression for each area by checking where the signs agree between them. If there are differences–in some regressions some parameters could be either positive or negative–we opt for signs con- sistent with economic theory. The COBYLA7 also requires an initial value from which to start. We use a constant, k, such that the value of k multiplied by the sum of mean values of covariates will be equal to the mean of the dependent variable. The regression parameters are not used since, if we start with this initial value, there is a difference in terms of scale in that some are denominated in tenths and others in thousandths, making the conversion particularly complex. To estimate the coefficients’ variability, we use a bootstrap resampling method. First, we resample the data with replacement and the size of the resample equal to the size of the original data set. Then the parameters are computed from the resample from the first step. We repeat this routine n = 100 times. Once the parameters and their variance are estimated and the results are verified, we rebuild estimates for the years in between the two surveys. Estimation of temporally linked values assumes that "everything is related to every- thing else, but near things are more related than distant things.”8 We estimate param- eters for the initial year (2000) and the final year (2007) rather than for the years in between and use a weighted mean of β2000 and β2007 . If ωty represents the weights and βty the parameters, where ty indicates a generic year between βmin and βMAX and min and MAX are, respectively, the first and last year for which we have estimates, estimates are obtained by: βty = ωty βmin + (1 − ωty )βMAX (7) Where: (MAX − ty ) ωty = (8) (MAX − min) This type of weighting allows us to remove the hypothesis that regressors are stable and completely invariant over time. Finally, we obtain: yest i,ty = β1,ty x1i,ty + β2,ty x2i,ty + · · · + βm,ty xmi,ty (9) From these values, we decided not to use an expansion estimator because even a small difference in the sampling scheme, and thus in the way the weights are obtained, could produce a bias. As shown by Newhouse et al. (2014), a change in the sam- pling scheme could be the reason that SSITs sometimes underperform. We estimate 7 The stopping criteria used for this algorithm are a tolerance of 1e−8 or a max evaluation of 100000 loop. 8 First law of geography, Tobler (1970) 10 the variable distribution using a maximum likelihood estimator and then sample from the estimated distribution n-times a certain number of values to compute the parameter of interest. Therefore, since the Gini, Theil, and HCR indices are closely associated with the distribution of the variable, if we can estimate the exact distribution from the sample data, we can derive an accurate estimate of the indicators. To clarify, the clas- sic log-normal distribution is used most often for consumption data. Assuming that the distribution and estimated parameters are correct, we will obtain an accurate estimate of the parameters. For example, we know that for a given log-normal distribution the related Gini coefficient is G = 2Φ( σ 2 ) − 1. It is not possible to obtain a perfect estimate of the distribution as it is almost impos- sible to select the exact distribution function. Nonetheless, we argue that the bias will be minimal. The more closely the selected distributions fit the data, the greater the de- crease in the margin of error. However, it is not strictly necessary for the sample scheme to be identical, only that both represent the same area for which we need estimates. 6 Results As mentioned in the section 4, we estimate the coefficients of the covariates through a linear regression and, in most cases, coefficient signs correspond to those previously estimated in table [A.1]. We prefer to use a linear regression and not a log one because we will not compute our algorithm through a transformation. The coefficient signs are key in determining the upper and the lower bounds of our algorithm. Thus, we will have positive values for all measures, except sex of household head, literacy, schooling level, and household size. These signs are consistent with findings in related literature. Before discussing the results, it is important to further explore one aspect of the eq. [4]. As the Gini and HCR indices are between 0 and 1, and the Theil index is between 0 and log(N ), the value of the direct estimate is known, from which the maximum of [4] can be easily computed. The reason for this is that for every index, the maximum will be equal to the maximum difference between its direct value and its maximum. The minimum will be chosen to maximize the single difference. This is a key step, since, once the maximum is computed, it is also easy to determine a relative error for each algorithm. The value of the estimated parameters, the number of iterations computed by each algorithm, the value of eq. [4] and its corresponding relative error are summarized in table [A.3]. The first three methods perform similarly, and the RE is generally the same. The last (urban areas for 2007) has the highest RE, about three times that for rural areas in the same year. It is difficult to say why this last has such a high RE. One plausible explanation is that mean expenditure per capita significantly increases from 2000 to 2007 in urban areas, but, because of the time-gap, covariates do not follow this pattern. Since the covariates vary little in urban areas between 2000 and 2007, the algorithm 11 cannot easily minimize the objective function within the constraints. The estimated parameters presented in table [A.3] show the expected signs (imposed by the model) and coefficient sizes. It is also worth noting that their impact decreases in 2007; those increasing (in absolute terms) are mostly negative, suggesting that being illiterate or having a low level of education in urban areas becomes more disadvanta- geous in 2007 in economic terms. Using eq. [10] we compute the fitted value, from which we obtain the results reported in table 2. With θ ˆ as an estimated parameter and θdir the corresponding direct estimates, it is possible to define the Absolute Relative Bias (ARB): ˆ − θdir θ ARB% = × 100 (10) θdir This index facilitates comparison of all models in terms of bias (figure 1). This plot demonstrates that the only method that produces a low ARB is the one using the algorithm, with consistent bias reduction when compared with the other methods. Table 2: Estimated Indices with Fitted Values, Algorithm Method 2000 2007 Index Rural Urban Rural Urban 0.354 0.425 0.356 0.442 Gini (0.004) (0.003) (0.006) (0.005) 0.233 0.323 0.230 0.343 Theil (0.009) (0.007) (0.011) (0.009) 0.175 0.207 0.167 0.268 HCR (0.004) (0.004) (0.013) (0.006) Source: Authors’ calculations based on Moroccan HBS 12 Figure 1: Absolute Relative Bias (%) Note: R1 = Regression model for Urban and Rural area, R2 = Regression model over the whole data set, R3 = Survey-to-survey imputation FGLS, R4 = Random Forest regression model, R5 = Quantile regression model q=0.5. 1 = Rural 2000, 2 = Urban 2000, 3 = Rural 2007, 4 = Urban 2007. The final step (see section 4) is to estimate the log-normal distribution using the MLE indicator for direct and empirical values. Starting with the national distribution obtained by merging the previous estimates, differences between direct estimates and empirical ones are small, especially in the meanlog (table 3). An F-test for variance confirms that direct estimates and empirical ones are equal in all cases except for the last. For urban areas in 2007, where there is the biggest difference, the test rejects the null hypothesis that the two variances are equal. The good news is that this bias, according to eq. [10], will be mitigated by the results obtained for urban areas in 2000. Table 3: National Estimated Log-Normal Distribution 2000 2007 Direct Empirical Direct Empirical 7.611 7.577 7.810 7.747 meanlog (0.006) (0.006) (0.008) (0.007) 0.722 0.798 0.718 0.808 sdlog (0.005) (0.004) (0.007) (0.005) Source: Authors’ calculations based on Moroccan HBS. 13 Table 4: Estimation of Rural and Urban Log-Normal Parameters 2000 2007 Rural Urban Rural Urban Dir. Emp. Dir. Emp. Dir. Emp. Dir. Emp. 7.265 7.265 7.880 7.827 7.518 7.521 8.001 7.907 meanlog (0.007) (0.007) (0.008) (0.010) (0.114) (0.011) (0.010) (0.015) 0.595 0.604 0.696 0.753 0.615 0.608 0.716 0.928 sdlog (0.006) (0.006) (0.007) (0.006) (0.009) (0.008) (0.009) (0.013) Source: Authors’ calculations based on Moroccan HBS. Applying eq. [10], we obtain the figures shown in table [5], which reports the en- tire time series of inequality indicators between 2000 and 2007; 2000 and 2007 are calculated based on the sample data while the rest is imputed. During the period considered, inequality slightly increases in urban areas while re- maining relatively stable in rural ones. Growth in the early 2000s is less volatile than in the previous decade and posts substantially higher annual rates (Clementi et al., 2022). This growth, however, has not been accompanied by rapid economic modernization or significant labor market outcomes. Indeed, the Moroccan economy created an aver- age of 115,000 jobs annually between 2000 and 2010. However, this could not absorb the annual increase in the working-age population (Lopez-Acevedo et al., 2021). Over the same period, the slight increase in inequality resulted from two counter-balancing trends: convergence of development across regions and increased intra-region inequality in some regions. Indeed, inequality increased mainly in the coastal, urbanized regions while decreasing in the — mainly rural — inner and eastern parts of the country (HCP, 2018). As additional verification of the validity of this trend, we compared inequality mea- sures in rural areas with average annual precipitation data.9 As the Moroccan economy highly depends on agriculture, we expect a significant correlation between rainfall and inequality. In periods of drought, we expect inequality to increase since most of the agriculture is rainfed and only a few very developed areas (for example the Settat area) are equipped with modern irrigation systems that enable farmers to withstand shocks, which exacerbate rural inequality. In periods of abundant rainfall, we anticipate bet- ter performance overall of the agricultural economy and thus a decline in inequality. As expected, the correlation coefficients between the three indices and this data are, respec- tively, (G=-0.75, T=-0.67, HCR=-0.15). From these, we can deduce that the trend of the indices is consistent with rainfall patterns, confirming the validity of our hypothesis and the economic significance of our estimated results. 9 https://climateknowledgeportal.worldbank.org/country/morocco/climate-data-historical 14 Table 5: Predicted Values (2000-2007) Weighted weights Log-normal dist. Gini Theil HCR meanlog sdlog 0.408 0.325 0.209 7.880 0.696 Urban (0.005) (0.016) (0.004) (0.010) (0.008) 2000∗ 0.349 0.240 0.161 7.265 0.595 Rural (0.006) (0.017) (0.005) (0.007) (0.006) 0.464 0.383 0.279 7.659 0.876 Urban (0.003) (0.009) (0.003) (0.057) (0.004) 2001 0.328 0.180 0.197 7.242 0.600 Rural (0.002) (0.003) (0.004) (0.021) (0.007) 0.416 0.300 0.254 7.736 0.775 Urban (0.003) (0.005) (0.003) (0.002) (0.003) 2002 0.293 0.142 0.168 7.419 0.533 Rural (0.002) (0.002) (0.003) (0.011) (0.010) 0.427 0.318 0.261 7.682 0.798 Urban (0.003) (0.006) (0.003) (0.004) (0.003) 2003 0.305 0.154 0.178 7.445 0.554 Rural (0.002) (0.002) (0.003) (0.007) (0.009) 0.453 0.365 0.275 7.594 0.855 Urban (0.003) (0.008) (0.003) (0.014) (0.002) 2004 0.329 0.181 0.198 7.400 0.602 Rural (0.002) (0.003) (0.003) (0.005) (0.003) 0.424 0.332 0.265 7.619 0.815 Urban (0.001) (0.002) (0.001) (0.002) (0.112) 2005 0.330 0.181 0.198 7.448 0.602 Rural (0.002) (0.003) (0.003) (0.004) (0.006) 0.415 0.300 0.254 7.544 0.777 Urban (0.009) (0.006) (0.003) (0.012) (0.011) 2006 0.323 0.173 0.194 7.406 0.589 Rural (0.002) (0.002) (0.004) (0.007) (0.003) 0.430 0.361 0.194 8.001 0.716 Urban (0.006) (0.015) (0.006) (0.001) (0.009) 2007∗ 0.353 0.228 0.177 7.518 0.615 Rural (0.006) (0.012) (0.007) (0.114) (0.009) Source: Authors’ calculations based on Moroccan HBS and LFS. Note: year∗ index obtained from sample data 15 7 Conclusion This paper aims to contribute to the ongoing debate on ways to improve the accuracy and timeliness of welfare statistics in developing countries experiencing severe budget constraints. Only a few developing countries have the capacity to collect annual data on income or expenditure, therefore, indicators such as inequality or poverty rates can be computed in these countries only when Household Budget Surveys are available, about every four to five–and sometimes as many as seven–years. To overcome this problem, methods have been developed to compare these indicators over time from surveys that are little comparable. These techniques (SSITs) have proven effective at predicting poverty indicators, but are much less accurate when used for inequality indicators. To illustrate this limitation, we conducted several simulations based on data from Moroccan Household Budget Surveys and Labor Force Surveys. Our results indicate that the predicted inequality measures are, at best, one-third smaller than those directly estimated from the data. Other regression methods were implemented, but none seems to significantly reduce the negative bias in the Gini, Theil, and HCR indices. Our theo- retical explanation for this points to two main limitations of the Standard SSIT models based on linear regressions: the overly stringent assumption of the residuals’ normality distribution and the expectation that regression-based models predict distribution com- pressed around the mean and with thin tails. Unfortunately, the shape of the tails is crucial to correctly estimate inequality. Thus, almost by design, these models tend to produce estimates that are far below the correct values. The method we propose is based on an algorithm that minimizes the sum of the squared differences between a certain number of direct estimates of an index and its empirical version obtained from the predicted values. With this algorithm, we reduce the bias and obtain results that are not systematically biased. When the estimated results are compared with those directly estimated on the original sample, the bias is negligible. Furthermore, the estimates of inequality indices for the years in which only labor force data are available seem to be consistent with Moroccan economic trends. In the future, it would be interesting to explore the implementation and estimation of HCR for two or more areas where the poverty line stands at 60% of the national median. An alternative approach would be to identify the starting points to estimate the parameter and its variance. Other distributions of consumption should be tested in order to reduce the margin of error. The authors are working on the log-student-t distribution. For this paper, it is evident that bias reduction, realistic rebuilding of the values, and the economic interpretation of results represent a good starting point. 16 References Betti, G., Bici, R., Neri, L., Sohnesen, T. P., & Thomo, L. (2018). Local poverty and in- equality in albania [Publisher: Taylor & Francis]. Eastern European Economics, 56(3), 223–245. Chambers, R., & Tzavidis, N. (2006). M-quantile models for small area estimation. Biometrika, 93(2), 255–268. Christiaensen, L., Lanjouw, P., Luoto, J., & Stifel, D. (2012). Small area estimation- based prediction methods to track poverty: Validation and applications. The Journal of Economic Inequality, 10(2), 267–297. Clementi, F., Fabiani, M., Molini, V., Schettino, F., & Kahn, H. (2022). Polarization and its discontents: Morocco before and after the arab spring» journal of economic inequality. forthcoming. Conn, A. R., Scheinberg, K., & Toint, P. L. (1997). On the convergence of derivative- free methods for unconstrained optimization. Approximation theory and opti- mization: tributes to MJD Powell, 83–108. Corral, P., Himelein, K., McGee, K., & Molina, I. (2021). A map of the poor or a poor map? [Publisher: Multidisciplinary Digital Publishing Institute]. Mathematics, 9(21), 2780. Cuesta, J., & Ibarra, L. G. (2017). Comparing cross-survey micro imputation and macro projection techniques: Poverty in post revolution tunisia. Journal of Income Dis- tribution®, 1–30. Dabalen, A., Graham, E. G., Himelein, K., & Mungai, R. (2014). Estimating poverty in the absence of consumption data: The case of liberia. World Bank Policy Re- search Working Paper, (7024). Dang, H.-A., Jolliffe, D., & Carletto, C. (2019). Data gaps, data incomparability and data imputation: A review of poverty measurement methods for dara-scarce en- viroments. Journal of Economic Surveys, 33(3), 757–797. Demombynes, G., & Hoogeveen, J. G. (2007). Growth, inequality and simulated poverty paths for tanzania, 1992–2002. Journal of African Economies, 16(4), 596–628. Douidich, M., Ezzrari, A., Van der Weide, R., & Verme, P. (2016). Estimating quarterly poverty rates using labor force surveys: A primer. The World Bank Economic Review, 30(3), 475–500. Elbers, C., Lanjouw, J. O., & Lanjouw, P. (2003). Micro-level estimation of poverty and inequality. Econometrica, 71(1), 355–364. Elbers, C., Lanjouw, P. F., Mistiaen, J. A., Özler, B., & Simler, K. (2004). On the unequal inequality of poor communities. The World Bank Economic Review, 18(3), 401– 421. Ketkar, K. W., & Ketkar, S. L. (1987). Socio-demographic dynamics and household demand [Publisher: JSTOR]. Eastern economic journal, 13(1), 55–62. 17 Krafft, C., Assaad, R., Nazier, H., Ramadan, R., Vahidmanesh, A., & Zouari, S. (2019). Estimating poverty and inequality in the absence of consumption data: An ap- plication to the middle east and north africa. Middle East Development Journal, 11(1), 1–29. Lopez-Acevedo, G., Betcherman, G., Khellaf, A., & Molini, V. (2021). Morocco’s jobs landscape: Identifying constraints to an inclusive labor market. World Bank Publications. Mathiassen, A. (2013). Testing prediction performance of poverty models: Empirical evidence from uganda. Review of Income and Wealth, 59(1), 91–112. Newhouse, D. L., Shivakumaran, S., Takamatsu, S., & Yoshida, N. (2014). How survey- to-survey imputation can fail. World Bank Policy Research Working Paper, (6961). Powell, M. J. (2007). A view of algorithms for optimization without derivatives [Pub- lisher: Citeseer]. Mathematics Today-Bulletin of the Institute of Mathematics and its Applications, 43(5), 170–174. Rao, J. N. K., & Molina, I. (2015). Small area estimation. John Wiley & Sons, Inc. Schluter, C. (2012). On the problem of inference for inequality measures for heavy- tailed distributions [Publisher: Oxford University Press Oxford, UK]. The Econo- metrics Journal, 15(1), 125–153. Tarozzi, A., & Deaton, A. (2009). Using census and survey data to estimate poverty and inequality for small areas. Review of Economics and Statistics, 91(4), 773–792. Tobler, W. R. (1970). A computer movie simulating urban growth in the detroit region [Publisher: Taylor & Francis]. Economic geography, 46, 234–240. 18 Appendix Table A1: Regression Parameters in Urban and Rural Areas (2000 and 2007) Estimate 2000 Estimate 2007 Rural Urban Rural Urban Intercept 6,848 ∗ ∗ ∗ 6,939 ∗ ∗ ∗ 7,316 ∗ ∗ ∗ 7.20 ∗ ∗ ∗ Log(age) 0.124 ∗ ∗ ∗ 0.135 ∗ ∗ ∗ 0.050 0.057 Sex -0.055 ∗ -0.044 ∗ -0.035 -0.006 Reg2 -0.027 0.014 -0.23 ∗∗ -0.05 Reg3 0.114 . -0.139 ∗ ∗ ∗ -0.264 ∗∗ -0.214 ∗ ∗ ∗ Reg4 0.258 ∗ ∗ ∗ -0.061 0.117 -0.103 . Reg5 0.053 -0.165 ∗ ∗ ∗ -0.127 0.022 Reg6 0.057 -0.158 ∗ ∗ ∗ -0.01 -0.167 ∗ ∗ ∗ Reg7 0.630 ∗ ∗ ∗ 0.290 ∗ ∗ ∗ 0.130 0.080 . Reg8 0.330 ∗ ∗ ∗ 0.058 -0.136 -0.016 Reg9 0.232 ∗ ∗ ∗ -0.055 -0.206 ∗ -0.173 ∗∗ Reg10 0.208 ∗ ∗ ∗ -0.059 -0.05 -0.165 ∗∗ Reg11 -0.151 ∗ -0.175 ∗ ∗ ∗ -0.122 -0.22 ∗ ∗ ∗ Reg12 0.180 ∗∗ -0.124 ∗∗ -0.013 -0.074 Reg13 0.330 ∗ ∗ ∗ 0.076 . -0.037 -0.279 ∗ ∗ ∗ Reg14 0.264 ∗ ∗ ∗ 0.035 -0.034 0.093 . Unliter.∗ -0.115 ∗ ∗ ∗ -0.161 ∗ ∗ ∗ -0.137 ∗ ∗ ∗ -0.232 ∗ ∗ ∗ Electricity∗ 0.277 ∗ ∗ ∗ 0.380 ∗ ∗ ∗ 0.259 ∗ ∗ ∗ 0.296 ∗ ∗ ∗ Low school-level∗ -0.3 ∗∗∗ -0.302 ∗ ∗ ∗ 0.020 -0.197 ∗ ∗ ∗ Roompc2 -0.079 ∗ ∗ ∗ -0.076 ∗ ∗ ∗ -0.068 ∗ ∗ ∗ -0.1 ∗∗∗ Hhldsize2 0.002 ∗ ∗ ∗ 0.001 ∗ 0.005 ∗ ∗ ∗ 0.002 ∗∗ Employment ∗ 0.155 ∗∗ 0.182 ∗ ∗ ∗ 0.055 0.148 ∗∗ Household Size -0.09 ∗ ∗ ∗ -0.063 ∗ ∗ ∗ -0.127 ∗ ∗ ∗ -0.087 ∗ ∗ ∗ Inactive∗ 0.130 ∗ 0.258 ∗ ∗ ∗ 0.047 0.334 ∗ ∗ ∗ Room per cap. 0.623 ∗ ∗ ∗ 0.801 ∗ ∗ ∗ 0.567 ∗ ∗ ∗ 0.873 ∗ ∗ ∗ Ter. Sector∗ 0.015 0.238 ∗ ∗ ∗ 0.093 ∗∗ 0.110 ∗ ∗ ∗ Sec. Sector∗ 0.136 ∗ ∗ ∗ 0.104 ∗ ∗ ∗ 0.265 ∗ ∗ ∗ 0.416 ∗ ∗ ∗ Pubb. Empl.∗ 0.014 0.004 -0.104 ∗∗ 0.043 ∗ Married 0.019 -0.005 0.091 ∗ 0.075 ∗ Source: Authors’ calculations based on Moroccan HBS. Note: Variable: ∗∗ = Sex of the householder (0= man, 1= woman) ∗ = Dummy variable referring to the householder or the house (0= no, 1= yes). 19 Table A2: Empirical Gini, Theil, and HCR Indices from Fitted Values 2000 2007 Index Rural Urban Rural Urban Regression model for Urban and Rural area 0.222 0.300 0.219 0.299 Gini (0.003) (0.003) (0.003) (0.004) 0.088 0.161 0.081 0.159 Theil (0.002) (0.004) (0.003) (0.005) 0.031 0.106 0.046 0.104 HCR (0.002) (0.004) (0.004) (0.005) Regression model over the whole data set 0.247 0.278 0.256 0.253 Gini (0.003) (0.002) (0.005) (0.004) 0.115 0.134 0.118 0.115 Theil (0.004) (0.003) (0.006) (0.008) 0.031 0.090 0.087 0.084 HCR (0.002) (0.003) (0.007) (0.005) Survey-to-survey imputation FGLS 0.224 0.252 0.230 0.220 Gini (0.001) (0.001) (0.002) (0.002) 0.087 0.105 0.088 0.079 Theil (0.001) (0.001) (0.002) (0.002) 0.268 0.012 0.316 0.021 HCR (0.004) (0.002) (0.001) (0.011) Random forest regression model 0.222 0.299 0.215 0.299 Gini (0.003) (0.003) (0.003) (0.004) 0.086 0.157 0.079 0.155 Theil (0.002) (0.003) (0.002) (0.005) 0.049 0.120 0.047 0.071 HCR (0.003) (0.004) (0.005) (0.005) Quantile regression model q=0.5 0.246 0.278 0.256 0.261 Gini (0.003) (0.002) (0.004) (0.003) 0.111 0.132 0.119 0.118 Theil (0.004) (0.003) (0.005) (0.004) 0.034 0.091 0.076 0.081 HCR (0.002) (0.004) (0.006) (0.004) Source: Authors’ calculations based on Moroccan HBS. 20 Table A3: Estimation of the Algorithm by Parameter 2000 2007 Index Rural Urban Rural Urban log(age) 33.329 38.762 76.161 10.22 (8.858) (12.692) (6.647) (12.16) Sex∗∗ -9.689 -40.827 -17.915 -41.608 (6.121) (9.513) (9.804) (21.143) Unliterary∗ -8.625 -39.2 -10.232 -33.304 (5.517) (10.733) (8.01) (22.914) Electricity∗ 79.099 29.932 78.351 16.032 (7.312) (9.705) (8.801) (12.067) School-level∗ -7.341 -39.089 -10.931 -56.25 (5.576) (11.039) (7.137) (22.227) Roompc2 81.917 33.189 64.259 32.888 (5.988) (15.183) (16.176) (22.49) Hhldsize2 27.102 94.116 45.403 106.98 (1.552) (4.037) (2.745) (5.925) Employee∗ 79.51 40.955 71.25 13.088 (6.877) (11.953) (13.76) (11.634) Household Size -13.403 -53.406 -14.251 -75.127 (8.952) (14.347) (1.412) (21.315) Inactive∗ 76.823 49.091 54.147 24.974 (6.087) (13.572) (17.247) (14.862) Room per cap. 78.579 36.9 50.111 8.349 (6.954) (11.225) (15.159) (1.269) Ter. Sector∗ 75.436 39.092 59.612 21.122 (6.893) (10.191) (19.61) (16.823) Sec. Sector ∗ 63.27 37.287 53.303 22.181 (17.695) (7.439) (16.51) (17.134) Pubb. employee ∗ 69.907 39.669 52.742 20.146 (14.13) (7.472) (16.686) (15.656) Married 74.462 34.356 50.417 1.768 (7.104) (9.581) (13.522) (6.688) Number of it. 2272.5 886.46 1471.26 5923.3 Min. function value 0.000516 0.000417 0.0002 0.00624 Relative error (RE) 0.006o/oo 0.005o/oo 0.003o /oo 0.009o /oo Source: Authors’ calculations based on Moroccan HBS. Note: Variable: ∗∗ = Sex of the householder (0= man, 1= woman) ∗ = Dummy variable referring to the householder or the house (0= no, 1= yes). 21 Figure A1: Gini, Theil, and HCR Indices (2000-2008) 22