Policy Research Working Paper 10013
New Algorithm to Estimate Inequality
Measures in Cross-Survey Imputation
An Attempt to Correct the Underestimation
of Extreme Values
Gianni Betti
Vasco Molini
Lorenzo Mori
Development Economics
Development Data Group
April 2022
Policy Research Working Paper 10013
Abstract
This paper contributes to the debate on ways to improve proposes a method for overcoming these limitations based
the calculation of inequality measures in developing on an algorithm that minimizes the sum of the squared dif-
countries experiencing severe budget constraints. Linear ference between a certain number of direct estimates of an
regression-based survey-to-survey imputation techniques index and its empirical version obtained from the predicted
are most frequently discussed in the literature. These are values. Indeed, when comparing the estimated results with
effective at estimating predictions of poverty indicators those directly estimated from the original sample, the bias
but are much less accurate with inequality indicators. To is negligible. Furthermore, the inequality indices for the
demonstrate this limited accuracy, the first part of the years for which there are only model estimates, rather than
paper discusses several simulations using Moroccan House- direct information on expenditures, seem to be consistent
hold Budget Surveys and Labor Force Surveys. The paper with Moroccan economic trends.
This paper is a product of the Development Data Group, Development Economics. It is part of a larger effort by the
World Bank to provide open access to its research and make a contribution to development policy discussions around the
world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may
be contacted at vmolini@worldbank.org.
The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.
Produced by the Research Support Team
A New Algorithm to Estimate Inequality
Measures in Cross-Survey Imputation: An
Attempt to Correct the Underestimation of
Extreme Values
Gianni Betti* Vasco Molini† Lorenzo Mori*‡
Keywords: Inequality indicators, Bias reduction, Survey-to-Survey Imputation, Mo-
roccan HBS, Moroccan LFS
JEL Classiﬁcation: I32, C13 and E24
* University of Siena. Department of economics and statistics. Email: gianni.betti@unisi.it
† 50x2030 Initiative. Email: vmolini@worldbank.org
‡ University of Bologna. Department of Statistics P. Fortunati. Email: lorenzo.mori7@unibo.it
1 Introduction
Collecting data to produce reliable and timely estimations of poverty requires elaborate
and expensive surveys called Household Budget Surveys (HBSs). Only a few countries
are able to collect data annually on income or expenditure to facilitate the estimation
of poverty and inequality. Most National Statistical Ofﬁces (NSOs) worldwide col-
lect these surveys only every four to ﬁve years. Therefore, producing reliable indices
annually for monitoring poverty results remains quite challenging for many countries.
To overcome this challenge, scholars have focused on developing methods to compare
welfare indicators over time from surveys that are little comparable. These techniques,
broadly known as Survey-to-Survey Imputation Techniques (SSITs), proved successful
at predicting comparable poverty indicators, but as this paper argues, were less effective
at predicting comparable inequality indicators.
SSITs are a derivative of the poverty map literature (Elbers et al., 2003; Tarozzi and
Deaton, 2009, Christiaensen et al., 2012; Mathiassen, 2013). By imputing income onto
censuses, poverty maps have been applied to many developing countries’ data to pro-
vide geographically disaggregated estimates of poverty. In addition to survey-to-census
imputation, there has been a recent trend of survey-to-survey imputation, mapping from
surveys with consumption data to those with other outcomes of interest, but lacking
standard welfare aggregates (Elbers et al., 2004; Dabalen et al., 2014; Dang et al., 2019).
While this approach is well established for poverty estimates, a cursory overview
of the literature shows that little attention is devoted to potential problems in obtain-
ing accurate inequality measures with this method. Using actual data from a census
of households in a set of rural Mexican communities, Demombynes and Hoogeveen
(2007) show that correlations between the estimated and true welfare at the local level
are highest for mean expenditure and poverty measures and lower for inequality mea-
sures. Douidich et al. (2016) obtain accurate estimates for quarterly poverty rates using
a classic survey-to-survey imputation method that combines HBSs and Labor Force
Surveys (LFSs). Using a log-linear regression to impute a completely missing variable
(total expenditure) for certain years, they obtain reliable estimates for Head Count Ratio
(HCR) with an exogenous poverty line. Krafft et al. (2019) impute consumption from
HBSs onto LFSs and obtain information on poverty and inequality. Resulting measures
of consumption, poverty, and inequality are similar across survey pairs, particularly for
the data in Jordan and the Arab Republic of Egypt. These estimates are obtained based
on the hypothesis that the model is time invariant, that is, the coefﬁcients used to impute
from one survey to another are stable over time (e.g., that βt = β ), and that the error term
is homoscedastic and normally distributed. However, the concern is that incorrectly as-
suming normal errors or ignoring heteroscedasticity has the potential to introduce bias
when estimating poverty or inequality. Speciﬁcally, in the prediction of inequality, the
bias can be signiﬁcant. Dang et al. (2019) provide an alternative method of simulating
the ﬁnal consumption imputation, which also seems extremely accurate when estimat-
2
ing poverty but does not address the estimation of inequality. To analyze poverty trends
in Tunisia after 2010, Cuesta and Ibarra (2017) compare cross-survey micro imputation
with macro projection techniques based on sectoral GDP, unemployment, and inﬂation.
The SSIT used is an OLS regression, where the dependent variable of the model is the
logarithm of annual household consumption per capita. Newhouse et al. (2014) pro-
vide evidence, on the other hand, illustrating that Survey-to-Survey Imputation can fail.
They demonstrate that minor differences in the sampling scheme, sampling design, or
structure of the answers/questions can produce inaccurate SSITs. Corral et al. (2021)
show that there are more accurate methods (see also Rao and Molina, 2015) than poverty
mapping to obtain estimates for small areas. Given that the SSITs are a derivative of
this method, we argue their conclusions can also apply to SSITs.1
Overall, the literature seems to indicate that the SSITs that are valid for predicting
poverty can be equally valid for predicting inequality. By contrast, we argue that SSITs,
by construction, tend to underestimate the predicted inequality, and to obtain accurate
estimates, a different methodological approach should be chosen. The assumption of
residuals’ normality distribution and the fact that standard SSITs are based on regres-
sion make them more accurate at predicting the central moments of a distribution (or
transformations such as the poverty headcount) rather than the shape of the tails. This
latter is crucial in predicting inequality. These models, however, tend to predict distri-
bution compressed around the mean and, with thin tails, underestimate inequality. This
point is not critical if the target parameters to be estimated are related with poverty, how-
ever, as shown by Schluter (2012), it is a crucial aspect of inequality measures. In this
paper, we show that indicators of inequality obtained from ﬁtted values can be biased if
the variables common to the two surveys, as often happens, are not strongly correlated
with the dependent one, that is, when the R2 ad j of the regression is not high enough.
To overcome these limitations, our approach is to estimate a semi-parametric model
that combines the incorporation of the same covariates typically used by SSITs via
the parametric component with a ﬂexible ﬁt to the data at hand via the non-parametric
one. In general, these new techniques would seem particularly well suited to ﬁnd a
compromise between ﬂexibility in terms of adaptation to the indicator to be predicted
and consistency with standard survey-to-survey models. We tested this approach on a
data-set recently used in Morocco (Douidich et al., 2016).
The paper is organized as follows: section 2 describes the data; section 3 frames
the problem and includes simulations based on classic SSITs; section 4 presents the
methodology used to produce results with the classical techniques; section; section 5
presents the proposed algorithm; section 6 describes the results of the new method; and
section 7 concludes.
1 Incases where the true population total or mean of an independent variable is known, we suggest
using Small Area Estimation methods, in particular the M-quantile method developed by Chambers and
Tzavidis (2006).
3
2 Survey Procedure
Two sets of surveys were used, the HBS and the LFS. The HBS is carried out about every
seven years, while the LFS is conducted every year. The HBS is used to determine total
family expenditure, which we convert into expenditure per capita expressed in USD.
The Moroccan HBS includes two different surveys: the 2000–2001 National Survey on
Consumption and Expenditure (NSCE) and the 2006–2007 National Living Standards
Survey (NLSS), which both measure household expenditure and are representative of
the national and regional context and of urban and rural areas.
The NSCE covers 15,000 households and was administered between November
2000 and October 2000. The 2006–2007 NLSS, administered between December 2006
and November 2007, was smaller and covered only 7,200 households. The two ques-
tionnaires share many common modules, of which the most relevant address socio-
demographic characteristics, habitat, expenditures, durable goods, education, health,
and employment. The NSCE also has a module on transfers, subjective indicators of
well-being, nutrition, and a module administered to the community to measure access
to services.
Both surveys have the same structure. For urban areas, strata include the region,
province, city size (large, medium, and small) and ﬁve types of housing. For rural areas,
the strata are regions and provinces. The common points between the two surveys,
however, end here. All the surveys conducted up to 2005, including the NSCE, have
a master sample frame that is derived from the 1994 population census. The surveys
conducted after 2005 are, instead, based on the 2004 census.
The 2001 NCSE follows a two-stage sampling process, while the 2007 NLSS fol-
lows a three-stage sampling process. In the ﬁrst stage of NSCE, 1.250 Primary Sam-
pling Units (PSUs) of approximately 300 households each were extracted from the 1994
population census. In the second stage, a dozen households per PSU were extracted
randomly to constitute the ﬁnal sample. In the ﬁrst stage of NLSS, 1,848 PSUs of ap-
proximately 600 households each were selected from the 2004 population census. The
second stage subdivided each PSU into twelve Secondary Sample Units (SSUs), repre-
senting about 50 households each, and then randomly selected six of the twelve SSUs
from each PSU. In the third stage, a constant number of households was selected ran-
domly from each SSU.
The LFS, which was ﬁrst launched in 1976, follows a sampling process that is simi-
lar to that of the 2007 NLSS and shares a number of questions with the two censuses on
which the population frame is based. The LFS questionnaires share only a few questions
with the other two censuses, relating to sociodemographic characteristics, habitat, ed-
ucation, health and employment, but, signiﬁcantly, do not share any information about
expenditure.
4
3 A Critical Overview of Survey-to-Survey Regression
Techniques
In general, SSITs use a log-linear regression to estimate values for the parameter at
year t, then use those values to predict values for the previous year t − n if only the
independent variables are available. This process is done under the hypothesis that the
parameters are stable over the years. Often, this hypothesis is veriﬁed by estimating
parameters at two non-consecutive years to illustrate that they are similar irrespective of
the selected years. The regression usually uses individual or household expenditures as
the dependent variable. With the right skew and in order to avoid normality, the variable
is transformed into a logarithm and then transformed back to obtain the predicted values.
Regression is the best way to obtain predicted values only if all the assumptions are
fulﬁlled and the selected dependent variables have a higher explanatory power versus
the independent one. This last point is summarized in R2 ad j . Often, articles do not
2
report the Rad j , although multiple regressions are always used. Interaction or special
transformation of the dependent variables is also common. Unfortunately, failure to
report the R2 ad j hinders the ability to check the accuracy of the regression. Given the
connection between R2 2
ad j and R we can argue that the ﬁrst is at least equal to or —
more likely — lower than the ones reported (which is often around 0.5 or less). A
low R2 can result from a selection that prefers a common set of variables from the two
surveys — which is a key requirement for SSITs — over a high explanatory power.
Demographic variables have been identiﬁed in both surveys, such as age, sex, house-
hold size, or variables related to assets, such as the number of rooms per capita and
measures of physical assets such as the asset index, among other variables. However,
their choice might present two problems. First, it will be difﬁcult to capture changes
in household expenditure using only demographic variables (that change less rapidly).
Second, there is often a weak correlation between dependent and independent variables.
As highlighted by Ketkar and Ketkar (1987), household sociodemographic charac-
teristics are as important as determinants of expenditure patterns as price and income.
The standard consumer behavior model states that the economic agent maximizes her
utility subject only to relative prices and to income constraints. The two theories may
both be valid, but if prices and income are excluded from the standard SSITs models, the
sociodemographic variables alone may not adequately explain the individual or house-
hold consumption patterns. However, these conclusions must be veriﬁed for every data
set and are impossible to determine a priori.
A low value for R2 arising from insufﬁcient correlated independent variables, com-
bined with a logarithmic transformation, could also generate problems with establishing
predictions. The logarithmic transformation compresses the tail of skewed and kurtotic
variables, which effectively generate symmetric PDFs and, therefore, cause Gaussian-
like errors. But the predicted values could be applied in the tails and, given the low
5
R2ad j , they could all be near the mean, with a sharp decrease in data variability. In addi-
tion, logarithmic transformations are not immune to a re-transformation bias. There are
various ways to solve this issue, but the most frequently used ones are based on scale
correction of the predicted values. Both the relative Gini index and the relative Theil
index are scale invariant,2 which can render those corrections ineffective at reducing
bias.
To summarize the discussion thus far, the regression to obtain a predictor is the right
choice only if the R2 ad j is high, all the assumptions are respected, and the bias caused
by the log-transformation is negligible. In addition, it is important to bear in mind other
potential sources of bias (Newhouse et al., 2014) arising from the differences between
the survey design and the questionnaires.
4 Estimates of Biases Based on Moroccan Data
In this section, using data from Morocco, we will show that the most common regres-
sion techniques can fail to accurately estimate three parameters: the Gini index, Theil
index, and Head Count Ratio (HCR) with an endogenous poverty line equal to 60% of
the median value of the regional expenditure per capita. We will compare estimates
obtained with a direct estimator from the HBS at the rural/urban level in four different
regressions. In all cases, the reported results will have a high negative skewed bias. The
following regression techniques are used:
• Log-linear regression over two (split) data sets. We split the data set using a
dummy (0= urban, 1= rural) variable, ran the regressions, and then computed the
indices;
• Log-linear regression over the whole data set. The ﬁtted values obtained are then
split using the dummy (0= urban, 1= rural) variable and then the indices are esti-
mated, both with ordinary least squares and feasible generalized least squares;
• Random forest regression over the split data set. With a log-transformation of the
dependent variable;
• Quantile regression over the split data set with a log-transformation.
The fundamental problem with these regression techniques, even the robust ones, is
that all ﬁtted values will be near the mean value; therefore, variability will be reduced
compared to the real regression. To attenuate this problem, we use two methods previ-
ously used in the literature (inter alia, Douidich et al., 2016 and Newhouse et al., 2014),
2 Theproperty of scale invariance states that inequality remains unchanged when all incomes increase
by the same proportion. See Clementi et al. (2022)) for a discussion of differences between (relative)
scale invariant and non-scale invariant (absolute) measures of inequality.
6
that produce reliable results when data from the census can be used or when the focus is
on a poverty index with an exogenous poverty line. Table 1 reports the direct estimates
of three inequality indices calculated from the 2000 and 2007 HBSs. The standard error
reported in brackets is computed with a bootstrap estimator based on 500 iterations.
Table 1: Estimated Indicators by Sample Unit
2000 2007
Index
Rural Urban Rural Urban
0.349 0.408 0.353 0.430
Gini
(0.006) (0.005) (0.006) (0.006)
0.240 0.325 0.228 0.361
Theil
(0.017) (0.016) (0.012) (0.015)
0.161 0.209 0.177 0.194
HCR
(0.005) (0.004) (0.007) (0.006)
Source: Authors’ calculations based on Moroccan HBS.
Before moving to regression techniques, it is necessary to identify the variables that
could be used as regressors. To do this, we must bear in mind that a bias could also
arise from differences between questions and answers in the surveys.3 To better un-
derstand the SSIT results, the trends of dependent and independent variables should be
examined to predict trends for our indicators. The sample mean expenditure per capita
in 2000 was 1, 752.96$ for rural areas and 3, 514.9$ for urban areas, which increased to
2, 265.9$ (+29%) and to 4, 053$ (+15%) respectively in 2007. The bulk of the depen-
dent variables seem to have constant values over the years, with the notable exception
of electricity. In rural areas, the proportion of families with electricity increased from
0.35 to 0.70.
The ﬁrst model tested was the log-linear regression over the split data set. The
results (see table [A.1]) are consistent with those presented by Douidich et al. (2016).
Illiteracy and a low level of education have a negative impact on expenditure, while a
higher number of rooms per capita has a positive impact, all other things being equal.
3 Once the structure of questions and answers has been checked, we suggest also performing a graph-
ical analysis of PDF and checking position and variability parameters, see Betti et al. (2018). We do not
report this analysis for all the variables since the comparability has already been checked by Douidich
et al. (2016) and we use the same set of variables. It is, however, important to underline that since 2015,
Morocco has ofﬁcially administered 12 regions. Previously, there were 16 regions. From that moment
the new regions and the oldest ones are not directly comparable. In the interest of future comparability,
we decided to use 14 regions, which was also decided by the Moroccan statistics department that uniﬁed
the regions that had been merged. For more on this point, we recommend the ofﬁcial survey site. The
only other variable that needs an explanation is the “low level of education,” with a value of 0 “up to the
1st cycle" and 1 for "over the 2nd cycle.
7
Furthermore, compared to those working in agriculture, those working in manufacturing
or services are better off.
A comparison of the results from these models (table [A.2]) with those in table [1]
clearly shows that the predicted inequality indicators are signiﬁcantly biased downward.
In a second stage, to increase the R2 ad j , we regress the model over the whole data set,
expecting an increase in the coefﬁcient of determination. We obtain an R2 ad j of 0.5649
for 2000 and an R2 ad j of 0.4695 for 2007, indeed a signiﬁcant improvement observed
only in urban areas. This moderate increase in the R2 ad j for urban areas is caused by the
introduction of one new regressor, the variable “milieu = urban (0) and rural (1).”
However, the increase in R2 ad j does not signiﬁcantly reduce the bias. Following
Newhouse et al. (2014), all parameters for the regression coefﬁcients and the distribu-
tions of the error terms are estimated by feasible generalized least squares (FGLS4 ). We
do not consider the cluster-speciﬁc error terms since the clusters have changed between
the two surveys and are not comparable.5
Since the standard linear regression falls short of estimating our distributional indi-
cators, we ﬁrst use a random forest regression model followed by a quantile regression.
We perform the ﬁrst with p/3 = 5 knots and 500 trees. The pseudoR2 is equal to 0.561
for 2000 and 0.489 for 2007. In the latter case, it is performed for each decile. Table
[A.2] reports ﬁrst the random forest results then those obtained for the quantile regres-
sion for q=0.5. Here, the goodness of ﬁt is measured by R1 (q = 0.5) and is equal to
0.349 for 2000 and 0.289 for 2007. Similar results were obtained for the other deciles.
These results highlight another problem encountered with the regression techniques.
Usually, robust techniques perform very well due even in presence of outliers. Here,
however, we are aiming to predict indices rather than a central tendency because the
“extreme value” in the tails of the distributions are as important as those near the mean.
To sum up, the standard linear regression and the other two methods we tested dras-
tically reduced the variability of the ﬁtted values, which triggered a downward bias in
the distributional indicators.6 Hence, the proposed models failed to reproduce accurate
inequality indicators because of the reduced data variability when the ﬁtted values are
used. In section 4, we propose an iterative algorithm that takes its cue from regression
4 This method is also used because the studentized Breusch-Pagan test (BP = 307.78, df = 29, p-value
< 2.2e − 16) conﬁrms the presence of heteroscedasticity in the OLS model.
5 Nonetheless, we think that ignoring the cluster-speciﬁc error term is not the correct approach to
obtain sound estimates for a single year. However, it is the only method we can use to avoid introducing
another bias to the predicted values. There is persistent underestimation of the three parameters that is
not improved by this method.
6 The Ordinary Least Squares (OLS) method is based on the minimization of a function based on indi-
vidual values. As is known, the conﬁdence bands for the ﬁtted lines reﬂect the minimum correspondence
of the mean, which increases the error for the extreme values. This is acceptable if the goal is to predict a
central tendency or something correlated with it but could be problematic if the indicators of interest are
based on values in the tail of a distribution.
8
techniques but allows ﬂexibility in the hypothesis of normality, changes the objective
function, and does not use a transformation of the dependent variable.
5 The Proposed Algorithm
The previous section indicates that SSITs based on regression techniques do not accu-
rately predict inequality indicators. From this point on, we propose an algorithm that,
using regression techniques as a point of departure, improves the results by removing
hypotheses that can be too stringent and by changing the function to minimize. Our al-
gorithm is based on the Constrained Optimization by Linear Approximation (COBYLA)
method, a numerical optimization method developed by Powell (2007) and Conn et al.
(1997) for constrained problems where the derivative of the objective function is not
known.
Given a dependent variable, yi , observed on i = 1, . . . , n individuals, and a set of m
covariates Xn×m we predict yi as a linear combination of Xn×m , as follows:
yi = β1 x1i + β2 x2i + . . . + βm xmi (1)
We then deﬁne the objective function to minimize as the sum of the square difference
between the direct estimates of an index, Ind , and its empirical version obtained for the
ﬁtted values, Ine , that is:
J
∑ (Ind − Ine)2j (2)
j=1
Where J is the number of indices used. In this speciﬁc case, we used three indicators:
the Gini index (G), Thiel index (T), and Head Count Ratio (HCR) with an endogenous
poverty line. Equation [2] becomes:
(Gd − Ge )2 + (Td − Te )2 + (HCRd − HCRe )2 (3)
As the dependent variable is, by deﬁnition, non-negative, we introduce the constraint
that the mean of the ﬁtted values must be equal to the mean of the absolute value of the
ﬁtted values. That is:
1 n est 1 n est
yi − ∑ |yi | = 0 (4)
n i∑
=1 n i=1
where yest
i is obtained as follows:
yest
i = β1 x1i + β2 x2i + · · · + βn xni (5)
The other constraint used is that the absolute difference between the mean of the sam-
pled value and the mean of the ﬁtted ones must be less than 1, that is:
1 n 1 n est
y − yi ≤ 1 (6)
n i∑ n i∑
i
=1 =1
9
Hence, we add a lower bound and an upper bound to our parameters. Those bounds
will be equal to (−In f , 0) for those variables that result in a negative value from the
regression and to (0, +In f ) for all others. We perform a different regression for each
area by checking where the signs agree between them. If there are differences–in some
regressions some parameters could be either positive or negative–we opt for signs con-
sistent with economic theory. The COBYLA7 also requires an initial value from which
to start. We use a constant, k, such that the value of k multiplied by the sum of mean
values of covariates will be equal to the mean of the dependent variable. The regression
parameters are not used since, if we start with this initial value, there is a difference in
terms of scale in that some are denominated in tenths and others in thousandths, making
the conversion particularly complex.
To estimate the coefﬁcients’ variability, we use a bootstrap resampling method.
First, we resample the data with replacement and the size of the resample equal to
the size of the original data set. Then the parameters are computed from the resample
from the ﬁrst step. We repeat this routine n = 100 times. Once the parameters and their
variance are estimated and the results are veriﬁed, we rebuild estimates for the years in
between the two surveys.
Estimation of temporally linked values assumes that "everything is related to every-
thing else, but near things are more related than distant things.”8 We estimate param-
eters for the initial year (2000) and the ﬁnal year (2007) rather than for the years in
between and use a weighted mean of β2000 and β2007 . If ωty represents the weights and
βty the parameters, where ty indicates a generic year between βmin and βMAX and min
and MAX are, respectively, the ﬁrst and last year for which we have estimates, estimates
are obtained by:
βty = ωty βmin + (1 − ωty )βMAX (7)
Where:
(MAX − ty )
ωty = (8)
(MAX − min)
This type of weighting allows us to remove the hypothesis that regressors are stable
and completely invariant over time. Finally, we obtain:
yest
i,ty = β1,ty x1i,ty + β2,ty x2i,ty + · · · + βm,ty xmi,ty (9)
From these values, we decided not to use an expansion estimator because even a
small difference in the sampling scheme, and thus in the way the weights are obtained,
could produce a bias. As shown by Newhouse et al. (2014), a change in the sam-
pling scheme could be the reason that SSITs sometimes underperform. We estimate
7 The stopping criteria used for this algorithm are a tolerance of 1e−8 or a max evaluation of 100000
loop.
8 First law of geography, Tobler (1970)
10
the variable distribution using a maximum likelihood estimator and then sample from
the estimated distribution n-times a certain number of values to compute the parameter
of interest. Therefore, since the Gini, Theil, and HCR indices are closely associated
with the distribution of the variable, if we can estimate the exact distribution from the
sample data, we can derive an accurate estimate of the indicators. To clarify, the clas-
sic log-normal distribution is used most often for consumption data. Assuming that the
distribution and estimated parameters are correct, we will obtain an accurate estimate
of the parameters. For example, we know that for a given log-normal distribution the
related Gini coefﬁcient is G = 2Φ( σ 2 ) − 1.
It is not possible to obtain a perfect estimate of the distribution as it is almost impos-
sible to select the exact distribution function. Nonetheless, we argue that the bias will
be minimal. The more closely the selected distributions ﬁt the data, the greater the de-
crease in the margin of error. However, it is not strictly necessary for the sample scheme
to be identical, only that both represent the same area for which we need estimates.
6 Results
As mentioned in the section 4, we estimate the coefﬁcients of the covariates through
a linear regression and, in most cases, coefﬁcient signs correspond to those previously
estimated in table [A.1]. We prefer to use a linear regression and not a log one because
we will not compute our algorithm through a transformation. The coefﬁcient signs are
key in determining the upper and the lower bounds of our algorithm. Thus, we will have
positive values for all measures, except sex of household head, literacy, schooling level,
and household size. These signs are consistent with ﬁndings in related literature.
Before discussing the results, it is important to further explore one aspect of the eq.
[4]. As the Gini and HCR indices are between 0 and 1, and the Theil index is between
0 and log(N ), the value of the direct estimate is known, from which the maximum of
[4] can be easily computed. The reason for this is that for every index, the maximum
will be equal to the maximum difference between its direct value and its maximum. The
minimum will be chosen to maximize the single difference. This is a key step, since,
once the maximum is computed, it is also easy to determine a relative error for each
algorithm. The value of the estimated parameters, the number of iterations computed by
each algorithm, the value of eq. [4] and its corresponding relative error are summarized
in table [A.3].
The ﬁrst three methods perform similarly, and the RE is generally the same. The
last (urban areas for 2007) has the highest RE, about three times that for rural areas
in the same year. It is difﬁcult to say why this last has such a high RE. One plausible
explanation is that mean expenditure per capita signiﬁcantly increases from 2000 to
2007 in urban areas, but, because of the time-gap, covariates do not follow this pattern.
Since the covariates vary little in urban areas between 2000 and 2007, the algorithm
11
cannot easily minimize the objective function within the constraints.
The estimated parameters presented in table [A.3] show the expected signs (imposed
by the model) and coefﬁcient sizes. It is also worth noting that their impact decreases
in 2007; those increasing (in absolute terms) are mostly negative, suggesting that being
illiterate or having a low level of education in urban areas becomes more disadvanta-
geous in 2007 in economic terms. Using eq. [10] we compute the ﬁtted value, from
which we obtain the results reported in table 2. With θ ˆ as an estimated parameter and
θdir the corresponding direct estimates, it is possible to deﬁne the Absolute Relative
Bias (ARB):
ˆ − θdir
θ
ARB% = × 100 (10)
θdir
This index facilitates comparison of all models in terms of bias (ﬁgure 1). This
plot demonstrates that the only method that produces a low ARB is the one using the
algorithm, with consistent bias reduction when compared with the other methods.
Table 2: Estimated Indices with Fitted Values, Algorithm Method
2000 2007
Index
Rural Urban Rural Urban
0.354 0.425 0.356 0.442
Gini
(0.004) (0.003) (0.006) (0.005)
0.233 0.323 0.230 0.343
Theil
(0.009) (0.007) (0.011) (0.009)
0.175 0.207 0.167 0.268
HCR
(0.004) (0.004) (0.013) (0.006)
Source: Authors’ calculations based on Moroccan HBS
12
Figure 1: Absolute Relative Bias (%)
Note: R1 = Regression model for Urban and Rural area, R2 = Regression model over the whole data
set, R3 = Survey-to-survey imputation FGLS, R4 = Random Forest regression model, R5 = Quantile
regression model q=0.5. 1 = Rural 2000, 2 = Urban 2000, 3 = Rural 2007, 4 = Urban 2007.
The ﬁnal step (see section 4) is to estimate the log-normal distribution using the
MLE indicator for direct and empirical values. Starting with the national distribution
obtained by merging the previous estimates, differences between direct estimates and
empirical ones are small, especially in the meanlog (table 3). An F-test for variance
conﬁrms that direct estimates and empirical ones are equal in all cases except for the
last. For urban areas in 2007, where there is the biggest difference, the test rejects
the null hypothesis that the two variances are equal. The good news is that this bias,
according to eq. [10], will be mitigated by the results obtained for urban areas in 2000.
Table 3: National Estimated Log-Normal Distribution
2000 2007
Direct Empirical Direct Empirical
7.611 7.577 7.810 7.747
meanlog
(0.006) (0.006) (0.008) (0.007)
0.722 0.798 0.718 0.808
sdlog
(0.005) (0.004) (0.007) (0.005)
Source: Authors’ calculations based on Moroccan HBS.
13
Table 4: Estimation of Rural and Urban Log-Normal Parameters
2000 2007
Rural Urban Rural Urban
Dir. Emp. Dir. Emp. Dir. Emp. Dir. Emp.
7.265 7.265 7.880 7.827 7.518 7.521 8.001 7.907
meanlog (0.007) (0.007) (0.008) (0.010) (0.114) (0.011) (0.010) (0.015)
0.595 0.604 0.696 0.753 0.615 0.608 0.716 0.928
sdlog (0.006) (0.006) (0.007) (0.006) (0.009) (0.008) (0.009) (0.013)
Source: Authors’ calculations based on Moroccan HBS.
Applying eq. [10], we obtain the ﬁgures shown in table [5], which reports the en-
tire time series of inequality indicators between 2000 and 2007; 2000 and 2007 are
calculated based on the sample data while the rest is imputed.
During the period considered, inequality slightly increases in urban areas while re-
maining relatively stable in rural ones. Growth in the early 2000s is less volatile than in
the previous decade and posts substantially higher annual rates (Clementi et al., 2022).
This growth, however, has not been accompanied by rapid economic modernization or
signiﬁcant labor market outcomes. Indeed, the Moroccan economy created an aver-
age of 115,000 jobs annually between 2000 and 2010. However, this could not absorb
the annual increase in the working-age population (Lopez-Acevedo et al., 2021). Over
the same period, the slight increase in inequality resulted from two counter-balancing
trends: convergence of development across regions and increased intra-region inequality
in some regions. Indeed, inequality increased mainly in the coastal, urbanized regions
while decreasing in the — mainly rural — inner and eastern parts of the country (HCP,
2018).
As additional veriﬁcation of the validity of this trend, we compared inequality mea-
sures in rural areas with average annual precipitation data.9 As the Moroccan economy
highly depends on agriculture, we expect a signiﬁcant correlation between rainfall and
inequality. In periods of drought, we expect inequality to increase since most of the
agriculture is rainfed and only a few very developed areas (for example the Settat area)
are equipped with modern irrigation systems that enable farmers to withstand shocks,
which exacerbate rural inequality. In periods of abundant rainfall, we anticipate bet-
ter performance overall of the agricultural economy and thus a decline in inequality. As
expected, the correlation coefﬁcients between the three indices and this data are, respec-
tively, (G=-0.75, T=-0.67, HCR=-0.15). From these, we can deduce that the trend of the
indices is consistent with rainfall patterns, conﬁrming the validity of our hypothesis and
the economic signiﬁcance of our estimated results.
9 https://climateknowledgeportal.worldbank.org/country/morocco/climate-data-historical
14
Table 5: Predicted Values (2000-2007)
Weighted weights Log-normal dist.
Gini Theil HCR meanlog sdlog
0.408 0.325 0.209 7.880 0.696
Urban
(0.005) (0.016) (0.004) (0.010) (0.008)
2000∗
0.349 0.240 0.161 7.265 0.595
Rural
(0.006) (0.017) (0.005) (0.007) (0.006)
0.464 0.383 0.279 7.659 0.876
Urban
(0.003) (0.009) (0.003) (0.057) (0.004)
2001
0.328 0.180 0.197 7.242 0.600
Rural
(0.002) (0.003) (0.004) (0.021) (0.007)
0.416 0.300 0.254 7.736 0.775
Urban
(0.003) (0.005) (0.003) (0.002) (0.003)
2002
0.293 0.142 0.168 7.419 0.533
Rural
(0.002) (0.002) (0.003) (0.011) (0.010)
0.427 0.318 0.261 7.682 0.798
Urban
(0.003) (0.006) (0.003) (0.004) (0.003)
2003
0.305 0.154 0.178 7.445 0.554
Rural
(0.002) (0.002) (0.003) (0.007) (0.009)
0.453 0.365 0.275 7.594 0.855
Urban
(0.003) (0.008) (0.003) (0.014) (0.002)
2004
0.329 0.181 0.198 7.400 0.602
Rural
(0.002) (0.003) (0.003) (0.005) (0.003)
0.424 0.332 0.265 7.619 0.815
Urban
(0.001) (0.002) (0.001) (0.002) (0.112)
2005
0.330 0.181 0.198 7.448 0.602
Rural
(0.002) (0.003) (0.003) (0.004) (0.006)
0.415 0.300 0.254 7.544 0.777
Urban
(0.009) (0.006) (0.003) (0.012) (0.011)
2006
0.323 0.173 0.194 7.406 0.589
Rural
(0.002) (0.002) (0.004) (0.007) (0.003)
0.430 0.361 0.194 8.001 0.716
Urban
(0.006) (0.015) (0.006) (0.001) (0.009)
2007∗
0.353 0.228 0.177 7.518 0.615
Rural
(0.006) (0.012) (0.007) (0.114) (0.009)
Source: Authors’ calculations based on Moroccan HBS and LFS.
Note: year∗ index obtained from sample data
15
7 Conclusion
This paper aims to contribute to the ongoing debate on ways to improve the accuracy
and timeliness of welfare statistics in developing countries experiencing severe budget
constraints. Only a few developing countries have the capacity to collect annual data on
income or expenditure, therefore, indicators such as inequality or poverty rates can be
computed in these countries only when Household Budget Surveys are available, about
every four to ﬁve–and sometimes as many as seven–years. To overcome this problem,
methods have been developed to compare these indicators over time from surveys that
are little comparable. These techniques (SSITs) have proven effective at predicting
poverty indicators, but are much less accurate when used for inequality indicators.
To illustrate this limitation, we conducted several simulations based on data from
Moroccan Household Budget Surveys and Labor Force Surveys. Our results indicate
that the predicted inequality measures are, at best, one-third smaller than those directly
estimated from the data. Other regression methods were implemented, but none seems
to signiﬁcantly reduce the negative bias in the Gini, Theil, and HCR indices. Our theo-
retical explanation for this points to two main limitations of the Standard SSIT models
based on linear regressions: the overly stringent assumption of the residuals’ normality
distribution and the expectation that regression-based models predict distribution com-
pressed around the mean and with thin tails. Unfortunately, the shape of the tails is
crucial to correctly estimate inequality. Thus, almost by design, these models tend to
produce estimates that are far below the correct values.
The method we propose is based on an algorithm that minimizes the sum of the
squared differences between a certain number of direct estimates of an index and its
empirical version obtained from the predicted values. With this algorithm, we reduce
the bias and obtain results that are not systematically biased. When the estimated results
are compared with those directly estimated on the original sample, the bias is negligible.
Furthermore, the estimates of inequality indices for the years in which only labor force
data are available seem to be consistent with Moroccan economic trends.
In the future, it would be interesting to explore the implementation and estimation
of HCR for two or more areas where the poverty line stands at 60% of the national
median. An alternative approach would be to identify the starting points to estimate the
parameter and its variance. Other distributions of consumption should be tested in order
to reduce the margin of error. The authors are working on the log-student-t distribution.
For this paper, it is evident that bias reduction, realistic rebuilding of the values, and the
economic interpretation of results represent a good starting point.
16
References
Betti, G., Bici, R., Neri, L., Sohnesen, T. P., & Thomo, L. (2018). Local poverty and in-
equality in albania [Publisher: Taylor & Francis]. Eastern European Economics,
56(3), 223–245.
Chambers, R., & Tzavidis, N. (2006). M-quantile models for small area estimation.
Biometrika, 93(2), 255–268.
Christiaensen, L., Lanjouw, P., Luoto, J., & Stifel, D. (2012). Small area estimation-
based prediction methods to track poverty: Validation and applications. The
Journal of Economic Inequality, 10(2), 267–297.
Clementi, F., Fabiani, M., Molini, V., Schettino, F., & Kahn, H. (2022). Polarization and
its discontents: Morocco before and after the arab spring» journal of economic
inequality. forthcoming.
Conn, A. R., Scheinberg, K., & Toint, P. L. (1997). On the convergence of derivative-
free methods for unconstrained optimization. Approximation theory and opti-
mization: tributes to MJD Powell, 83–108.
Corral, P., Himelein, K., McGee, K., & Molina, I. (2021). A map of the poor or a poor
map? [Publisher: Multidisciplinary Digital Publishing Institute]. Mathematics,
9(21), 2780.
Cuesta, J., & Ibarra, L. G. (2017). Comparing cross-survey micro imputation and macro
projection techniques: Poverty in post revolution tunisia. Journal of Income Dis-
tribution®, 1–30.
Dabalen, A., Graham, E. G., Himelein, K., & Mungai, R. (2014). Estimating poverty
in the absence of consumption data: The case of liberia. World Bank Policy Re-
search Working Paper, (7024).
Dang, H.-A., Jolliffe, D., & Carletto, C. (2019). Data gaps, data incomparability and
data imputation: A review of poverty measurement methods for dara-scarce en-
viroments. Journal of Economic Surveys, 33(3), 757–797.
Demombynes, G., & Hoogeveen, J. G. (2007). Growth, inequality and simulated poverty
paths for tanzania, 1992–2002. Journal of African Economies, 16(4), 596–628.
Douidich, M., Ezzrari, A., Van der Weide, R., & Verme, P. (2016). Estimating quarterly
poverty rates using labor force surveys: A primer. The World Bank Economic
Review, 30(3), 475–500.
Elbers, C., Lanjouw, J. O., & Lanjouw, P. (2003). Micro-level estimation of poverty and
inequality. Econometrica, 71(1), 355–364.
Elbers, C., Lanjouw, P. F., Mistiaen, J. A., Özler, B., & Simler, K. (2004). On the unequal
inequality of poor communities. The World Bank Economic Review, 18(3), 401–
421.
Ketkar, K. W., & Ketkar, S. L. (1987). Socio-demographic dynamics and household
demand [Publisher: JSTOR]. Eastern economic journal, 13(1), 55–62.
17
Krafft, C., Assaad, R., Nazier, H., Ramadan, R., Vahidmanesh, A., & Zouari, S. (2019).
Estimating poverty and inequality in the absence of consumption data: An ap-
plication to the middle east and north africa. Middle East Development Journal,
11(1), 1–29.
Lopez-Acevedo, G., Betcherman, G., Khellaf, A., & Molini, V. (2021). Morocco’s jobs
landscape: Identifying constraints to an inclusive labor market. World Bank
Publications.
Mathiassen, A. (2013). Testing prediction performance of poverty models: Empirical
evidence from uganda. Review of Income and Wealth, 59(1), 91–112.
Newhouse, D. L., Shivakumaran, S., Takamatsu, S., & Yoshida, N. (2014). How survey-
to-survey imputation can fail. World Bank Policy Research Working Paper, (6961).
Powell, M. J. (2007). A view of algorithms for optimization without derivatives [Pub-
lisher: Citeseer]. Mathematics Today-Bulletin of the Institute of Mathematics
and its Applications, 43(5), 170–174.
Rao, J. N. K., & Molina, I. (2015). Small area estimation. John Wiley & Sons, Inc.
Schluter, C. (2012). On the problem of inference for inequality measures for heavy-
tailed distributions [Publisher: Oxford University Press Oxford, UK]. The Econo-
metrics Journal, 15(1), 125–153.
Tarozzi, A., & Deaton, A. (2009). Using census and survey data to estimate poverty and
inequality for small areas. Review of Economics and Statistics, 91(4), 773–792.
Tobler, W. R. (1970). A computer movie simulating urban growth in the detroit region
[Publisher: Taylor & Francis]. Economic geography, 46, 234–240.
18
Appendix
Table A1: Regression Parameters in Urban and Rural Areas (2000 and 2007)
Estimate 2000 Estimate 2007
Rural Urban Rural Urban
Intercept 6,848 ∗ ∗ ∗ 6,939 ∗ ∗ ∗ 7,316 ∗ ∗ ∗ 7.20 ∗ ∗ ∗
Log(age) 0.124 ∗ ∗ ∗ 0.135 ∗ ∗ ∗ 0.050 0.057
Sex -0.055 ∗ -0.044 ∗ -0.035 -0.006
Reg2 -0.027 0.014 -0.23 ∗∗ -0.05
Reg3 0.114 . -0.139 ∗ ∗ ∗ -0.264 ∗∗ -0.214 ∗ ∗ ∗
Reg4 0.258 ∗ ∗ ∗ -0.061 0.117 -0.103 .
Reg5 0.053 -0.165 ∗ ∗ ∗ -0.127 0.022
Reg6 0.057 -0.158 ∗ ∗ ∗ -0.01 -0.167 ∗ ∗ ∗
Reg7 0.630 ∗ ∗ ∗ 0.290 ∗ ∗ ∗ 0.130 0.080 .
Reg8 0.330 ∗ ∗ ∗ 0.058 -0.136 -0.016
Reg9 0.232 ∗ ∗ ∗ -0.055 -0.206 ∗ -0.173 ∗∗
Reg10 0.208 ∗ ∗ ∗ -0.059 -0.05 -0.165 ∗∗
Reg11 -0.151 ∗ -0.175 ∗ ∗ ∗ -0.122 -0.22 ∗ ∗ ∗
Reg12 0.180 ∗∗ -0.124 ∗∗ -0.013 -0.074
Reg13 0.330 ∗ ∗ ∗ 0.076 . -0.037 -0.279 ∗ ∗ ∗
Reg14 0.264 ∗ ∗ ∗ 0.035 -0.034 0.093 .
Unliter.∗ -0.115 ∗ ∗ ∗ -0.161 ∗ ∗ ∗ -0.137 ∗ ∗ ∗ -0.232 ∗ ∗ ∗
Electricity∗ 0.277 ∗ ∗ ∗ 0.380 ∗ ∗ ∗ 0.259 ∗ ∗ ∗ 0.296 ∗ ∗ ∗
Low school-level∗ -0.3 ∗∗∗ -0.302 ∗ ∗ ∗ 0.020 -0.197 ∗ ∗ ∗
Roompc2 -0.079 ∗ ∗ ∗ -0.076 ∗ ∗ ∗ -0.068 ∗ ∗ ∗ -0.1 ∗∗∗
Hhldsize2 0.002 ∗ ∗ ∗ 0.001 ∗ 0.005 ∗ ∗ ∗ 0.002 ∗∗
Employment ∗ 0.155 ∗∗ 0.182 ∗ ∗ ∗ 0.055 0.148 ∗∗
Household Size -0.09 ∗ ∗ ∗ -0.063 ∗ ∗ ∗ -0.127 ∗ ∗ ∗ -0.087 ∗ ∗ ∗
Inactive∗ 0.130 ∗ 0.258 ∗ ∗ ∗ 0.047 0.334 ∗ ∗ ∗
Room per cap. 0.623 ∗ ∗ ∗ 0.801 ∗ ∗ ∗ 0.567 ∗ ∗ ∗ 0.873 ∗ ∗ ∗
Ter. Sector∗ 0.015 0.238 ∗ ∗ ∗ 0.093 ∗∗ 0.110 ∗ ∗ ∗
Sec. Sector∗ 0.136 ∗ ∗ ∗ 0.104 ∗ ∗ ∗ 0.265 ∗ ∗ ∗ 0.416 ∗ ∗ ∗
Pubb. Empl.∗ 0.014 0.004 -0.104 ∗∗ 0.043 ∗
Married 0.019 -0.005 0.091 ∗ 0.075 ∗
Source: Authors’ calculations based on Moroccan HBS.
Note: Variable: ∗∗ = Sex of the householder (0= man, 1= woman) ∗ = Dummy variable referring to the
householder or the house (0= no, 1= yes).
19
Table A2: Empirical Gini, Theil, and HCR Indices from Fitted Values
2000 2007
Index
Rural Urban Rural Urban
Regression model for Urban and Rural area
0.222 0.300 0.219 0.299
Gini
(0.003) (0.003) (0.003) (0.004)
0.088 0.161 0.081 0.159
Theil
(0.002) (0.004) (0.003) (0.005)
0.031 0.106 0.046 0.104
HCR
(0.002) (0.004) (0.004) (0.005)
Regression model over the whole data set
0.247 0.278 0.256 0.253
Gini
(0.003) (0.002) (0.005) (0.004)
0.115 0.134 0.118 0.115
Theil
(0.004) (0.003) (0.006) (0.008)
0.031 0.090 0.087 0.084
HCR
(0.002) (0.003) (0.007) (0.005)
Survey-to-survey imputation FGLS
0.224 0.252 0.230 0.220
Gini
(0.001) (0.001) (0.002) (0.002)
0.087 0.105 0.088 0.079
Theil
(0.001) (0.001) (0.002) (0.002)
0.268 0.012 0.316 0.021
HCR
(0.004) (0.002) (0.001) (0.011)
Random forest regression model
0.222 0.299 0.215 0.299
Gini
(0.003) (0.003) (0.003) (0.004)
0.086 0.157 0.079 0.155
Theil
(0.002) (0.003) (0.002) (0.005)
0.049 0.120 0.047 0.071
HCR
(0.003) (0.004) (0.005) (0.005)
Quantile regression model q=0.5
0.246 0.278 0.256 0.261
Gini
(0.003) (0.002) (0.004) (0.003)
0.111 0.132 0.119 0.118
Theil
(0.004) (0.003) (0.005) (0.004)
0.034 0.091 0.076 0.081
HCR
(0.002) (0.004) (0.006) (0.004)
Source: Authors’ calculations based on Moroccan HBS.
20
Table A3: Estimation of the Algorithm by Parameter
2000 2007
Index
Rural Urban Rural Urban
log(age) 33.329 38.762 76.161 10.22
(8.858) (12.692) (6.647) (12.16)
Sex∗∗ -9.689 -40.827 -17.915 -41.608
(6.121) (9.513) (9.804) (21.143)
Unliterary∗ -8.625 -39.2 -10.232 -33.304
(5.517) (10.733) (8.01) (22.914)
Electricity∗ 79.099 29.932 78.351 16.032
(7.312) (9.705) (8.801) (12.067)
School-level∗ -7.341 -39.089 -10.931 -56.25
(5.576) (11.039) (7.137) (22.227)
Roompc2 81.917 33.189 64.259 32.888
(5.988) (15.183) (16.176) (22.49)
Hhldsize2 27.102 94.116 45.403 106.98
(1.552) (4.037) (2.745) (5.925)
Employee∗ 79.51 40.955 71.25 13.088
(6.877) (11.953) (13.76) (11.634)
Household Size -13.403 -53.406 -14.251 -75.127
(8.952) (14.347) (1.412) (21.315)
Inactive∗ 76.823 49.091 54.147 24.974
(6.087) (13.572) (17.247) (14.862)
Room per cap. 78.579 36.9 50.111 8.349
(6.954) (11.225) (15.159) (1.269)
Ter. Sector∗ 75.436 39.092 59.612 21.122
(6.893) (10.191) (19.61) (16.823)
Sec. Sector ∗ 63.27 37.287 53.303 22.181
(17.695) (7.439) (16.51) (17.134)
Pubb. employee ∗ 69.907 39.669 52.742 20.146
(14.13) (7.472) (16.686) (15.656)
Married 74.462 34.356 50.417 1.768
(7.104) (9.581) (13.522) (6.688)
Number of it. 2272.5 886.46 1471.26 5923.3
Min. function value 0.000516 0.000417 0.0002 0.00624
Relative error (RE) 0.006o/oo 0.005o/oo 0.003o
/oo 0.009o
/oo
Source: Authors’ calculations based on Moroccan HBS.
Note: Variable: ∗∗ = Sex of the householder (0= man, 1= woman) ∗ = Dummy variable referring to the
householder or the house (0= no, 1= yes).
21
Figure A1: Gini, Theil, and HCR Indices (2000-2008)
22