WPS8630
Policy Research Working Paper 8630
sae
A Stata Package for Unit Level Small Area Estimation
Minh Cong Nguyen
Paul Corral
João Pedro Azevedo
Qinghua Zhao
Poverty and Equity Global Practice
October 2018
Policy Research Working Paper 8630
Abstract
This paper presents a new family of Stata functions devoted relevant variables such as welfare or caloric intake. The
to small area estimation. Small area methods attempt to family of functions introduced follow a modular design to
solve low representativeness of surveys within areas, or have the flexibility with which these can be expanded in
the lack of data for specific areas/sub-populations. This is the future. This can be accomplished by the authors and/
accomplished by incorporating information from outside or other collaborators from the Stata community. Thus far,
sources. Such target data sets are becoming increasingly a major limitation of such analysis in Stata has been the
available and can take the form of a traditional population large size of target data sets. The package introduces new
census, but also large scale administrative records from tax mata functions and a plugin used to circumvent memory
administrations, or geospatial information produced using limitations that inevitably arise when working with big data.
remote sensing. The strength of these target data sets is From an estimation perspective, the paper starts by imple-
their granularity on the subpopulations of interest, however, menting a methodology that has been widely used for the
in many cases they lack the ability to collect analytically production of several poverty maps.
This paper is a product of the Poverty and Equity Global Practice. It is part of a larger effort by the World Bank to provide
open access to its research and make a contribution to development policy discussions around the world. Policy Research
Working Papers are also posted on the Web at http://www.worldbank.org/research. The authors may be contacted at
mnguyen3@worldbank.org.
The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.
Produced by the Research Support Team
sae: A Stata Package for Unit Level Small Area Estimation
∗ †
Minh Cong Nguyen, Paul Corral, João Pedro Azevedo, and Qinghua Zhao‡
Octber, 2018
Key words: Small area estimation, ELL, Poverty mapping, Poverty map, Big Data, Geospatial
JEL classiﬁcation: C55, C87, C15
∗ Minh Cong Nguyen, Paul Corral, and João Pedro Azevedo: The World Bank Group - Poverty and Inequality Global
Practice
† Qinghua Zhao: The World Bank Group - Development Economics Research Group
‡ The authors acknowledge ﬁnancial support from the World Bank. We also thank Ken Simler, and Roy Van der Weide
for comments on an earlier draft and several discussions. We thank Alexandru Cojocaru, William Seitz, Kristen Himelein,
Fernando Morales, Pallavi Vyas, and David Newhouse for beta testing earlier versions of the command. We also thank Carolina
Sanchez, Francisco Ferreira, and Luis Felipe Lopez-Calva for providing support and space to work on this. Finally we thank the
Global Solutions Group on Welfare Measurement and Statistical Capacity, as well as all attendants of the ECAPOV Seminar
Series and Summer University courses on Small Area Estimation.
Any error or omission is the authors’ responsiblility alone.
1 Introduction
Household surveys that are designed for national or sub-national (i.e. regions or states) level parameter
estimates often lack a suﬃciently large sample size to generate accurate direct estimates for smaller domains
or sub-populations.1 Any area for which the sample is not suﬃciently large to produce an estimate of
adequate precision is referred to as a small area. Consequently small area methods attempt to address low
representativeness of surveys within areas, or the lack of data for speciﬁc areas/sub-populations. This is
accomplished by incorporating information from supplementary sources. In the methodology presented by
Elbers, Lanjouw, and Lanjouw (2003) the desired estimate is produced by a model done using survey data
which is then linked to outside information such as a population census, administrative records, and/or
geospatial information obtained through remote sensing. As Rao and Molina (2015) state, the availability
of these auxiliary data as well as a valid model are essential for obtaining successful small area estimates
(SAE).
In the case of poverty measures, which are nonlinear functions, small area estimation methods coupled
with Monte Carlo simulations are a useful statistical technique for monitoring poverty along with its spatial
distribution and evolution.2 Poverty estimates are often produced by statistical agencies, and commonly are
the product of a household survey. Household surveys usually are the main source of welfare (expenditure
or income) used to produce poverty estimates, yet most are only reliable up to a certain geographical level.
Therefore, a commonly adopted solution has been to borrow strength from a national census, administrative
records, and/or geospatial information which allows for representativeness for small areas. Nevertheless,
these outside data sources often lack detailed expenditure or income information required for producing
estimates of poverty. Small area methods attempt to exploit each data’s attributes to obtain estimators that
can be used at dis-aggregated levels.
Poverty at lower geographical levels can be used to identify areas that are in need of attention, or that may
be lagging behind the rest of the country. For example, the Government of Ecuador after an earthquake
that occurred on April 16, 2016 relied on small area estimates of poverty to decide where help was needed
most. The small area estimates of poverty for Ecuador, which were released not long before the earthquake,
proved to be an invaluable resource for the rebuilding eﬀort in the country.
One of the most common small area methods used for poverty estimates is the one proposed by Elbers,
Lanjouw, and Lanjouw (2003, henceforth ELL).3 This methodology has been widely adopted by the World
Bank and has been applied in numerous poverty maps4 conducted by the institution. In its eﬀorts to
make the implementation of the ELL methodology as straightforward as possible, the World Bank created
a software package that could be easily used by anyone. The software, PovMap (Zhao, 2006),5 has proven
to be an invaluable resource for the World Bank as well as for many statistical agencies, line ministries, and
other international organizations seeking to create their own small area estimates of poverty. The software
is freely available and has a graphical user interface which simpliﬁes its use. Nevertheless, more advanced
1 For example districts, municipalities, migrant populations, or populations with disabilities.
2 Poverty is a nonlinear function of welfare, consequently small area estimation methods of linear characteristics are invalid
(Molina and Rao, 2010). A proposed solution to this problem is to use Monte Carlo simulation to obtain multiple vectors of
the measure of interest (Elbers, Lanjouw, and Lanjouw, 2003).
3 For a detailed discussion of diﬀerent applications of small area poverty estimates by the World Bank, readers should refer
to Bedi et al. (2007), for other applications readers should refer to Rao and Moline (2015).
4 Poverty map is the common name within the World Bank for the methodology where the obtained estimates are mapped
for illustrative purposes.
5 downloadable from: http://iresearch.worldbank.org/PovMap/PovMap2/setup.zip
2
practitioners who may wish for more functionality and options may have to program it themselves. In an
eﬀort to simplify the process, we have created a family of Stata commands which implement small area
estimation methods, and provide users with a valid, modular, and ﬂexible alternative to PovMap. All of the
results produced with sae have been bench-marked with the well established PovMap software, and are an
exact match.6
In the following section we discuss the linking model and focus on the ﬁrst stage of the estimation. How the
variance of the location eﬀect present in the linking model is accounted for is discussed afterwards. This is
initiated by presenting the ELL methodology for decomposing the ﬁrst stage residuals’s variance parameters.
Afterwards, this is followed by Henderson’s Method III decomposition of the residual’s variance parameters
and the adaptation of Huang and Hidiroglou’s (2003) GLS and implemented by Van der Weide (2014) in
an update to the PovMap software. Then we present Empirical Bayes prediction of the location eﬀects. We
ﬁnally present the command’s syntax along with examples of its use.
2 First Stage: Model
When survey data is not suﬃciently precise for reliable estimates at low geographical levels, small area
estimation can be implemented. This commonly relies on borrowing strength from outside sources to generate
indirect estimates (Rao and Molina, 2015). For example, when attempting to obtain welfare estimates at low
geographical levels, it is possible to borrow strength from census data. Censuses in most cases do not collect
suﬃcient information on incomes and/or expenditures. On the other hand household surveys tend to collect
detailed information on incomes and/or expenditures which allows the generation of welfare statistics such
as poverty and inequality measures. Although welfare statistics can be obtained using household surveys,
these are rarely suﬃciently large to allow for statistics corrsponding to small areas (Tarozzi and Deaton,
2009). Small area estimation methods attempt to exploit the large sample size of census data and combine
this with the information from household surveys in an attempt to obtain reliable statistics for small areas.
In its essence the small area estimation presented here relies on using household survey data to estimate
a joint distribution of the variable of interest and observed correlates, and use the parameters to simulate
welfare using census data (Demombynes et al., 2008). Throughout the text we focus on welfare imputation,
since this is what the Elbers, Lanjouw, and Lanjouw (2003) methodology focused on, nevertheless it can be
applied for other continuous measures aside from welfare.
The ﬁrst step towards small area welfare simulation is a per-capita7 welfare model, estimated via ordinary
least squares (OLS ), or weighted least squares (W LS ):
ln Y = Xβols + u (1)
where ln Y is a N × 1 matrix indicating the household’s welfare measure (usually in logarithms, but not
necessarily),8 X is a N × K matrix of household characteristics, and u which is a N × 1 matrix of disturbances.
To achieve valid small area estimates it is necessary that the set of explanatory variables, X, used in the
6 For the 1st stage of the analysis. The second stage relies on Monte-Carlo simulations, and thus diﬀerences are likely to
arise.
7 Other household equivalization methods are also feasible, for example adult equivalent household welfare.
8 Logarithmic transformation is usually preferred since it makes the data more symmetrical (Haslett et al., 2010).
3
ﬁrst stage model can also be found in the census data. It is important that the variables are compared
beforehand to verify that not only their deﬁnitions are in agreement, but also their distributions.9
In the design of household surveys, clusters are commonly the primary sampling unit. Households within
a cluster are usually not independent from one another, to allow for the clustering of households and their
interrelatedness it is possible to specify equation (1) for a household as:
ln ych = xch β + uch
,– where the indexes c and h stand for cluster and observation, respectively – and disturbances (uch ) are
assumed to have the following speciﬁcation (Haslett et al., 2010):
uch = ηc + ech (2)
where ηc and ech , are assumed to be independent from each other with diﬀerent data generating processes
(Haslett et al., 2010).10 Therefore the resulting model we wish to estimate is a linear mixed model (Van der
Weide, 2014).11
The literature has suggested diﬀerent methods for estimating the unconditional variances of these param-
eters;12 for the purposes of this paper focus is given to the methods presented by Elbers, Lanjouw, and
Lanjouw (2003), and the adaptation of Henderson’s Method III by Huang and Hidiroglou (2003) detailed
and expanded upon by Van der Weide (2014). The next section describes in detail these two approaches.
3 Estimating the unconditional variance of the residual parameters
3.1 The ELL Methodology
The methodology for estimating the location’s unconditional variance detailed by ELL (2002 and 2003) is
presented in the discussion below. The method consists of two steps. The initial step relies on estimating
a welfare model (Eq:1) using household survey data, and then obtaining generalized least square (GLS )
estimates for the model. Given the interrelatedness between households in a cluster, ordinary least squares
is not the most eﬃcient estimator. The second stage consists in utilizing the parameter estimates from the
ﬁrst stage and applying these to census data to obtain small area poverty and inequality measures.
Because the motivation is to implement the methodology into a Stata command, where possible, we also
follow the methods implemented by PovMap (Zhao, 2006). It must be noted, however, that the methodology
utilized by ELL is not necessarily the one followed by the PovMap software. Places where methodologies
diﬀer will be indicated in footnotes.
9 This point is crucial, and the quality of a good poverty map is highly dependent on these two criteria being met. For further
discussion and motivation refer to: Tarozzi and Deaton (2009).
10 Note that the interrelatedness of η across observations is already a violation of OLS assumptions.
c
11 Molina and Rao (2010) follow the naming convention of a nested error linear regression model. Regardless of its name,
standard OLS is invalid under this structure.
12 Rao and Molina (2015).
4
ˆch , and by deﬁning u
From the estimation of equation 1 we obtain the residuals u ˆc. as the weighted average
of u ˆch :13
ˆch for a speciﬁc cluster we can obtain e
u
ˆch = u uch − u
ˆc. + (ˆ ˆc. ) = ηc + e
ˆch (3)
ˆc. is equal to ηc .14
where u
The unconditional variance of the location eﬀect ηc given by ELL (2002) is:
2 c wc (uc. − u..)2 − c wc (1 − wc )ˆ2
τc
σ
ˆη = max ;0 (4)
c wc (1 − wc )
where u.. = c wc uc. (where wc represents the cluster’s weight, i.e. wc = h whc ) and
2
2 h(ech − ec. )
τ
ˆc = (5)
nc (nc − 1)
ech 15
where ec. = h
nc .
3.1.1 Estimating the Alpha Model (heteroskedasticity)16
The variance of the idiosyncratic error (ech ) is speciﬁed by ELL (2003) through a parametric form of het-
eroskedasticity:17
A expZbh α +B
E e2 2
ch = σech = (6)
1 + expZbh α
ˆ2
In ELL (2003) this is simpliﬁed by setting B = 0 and A = 1.05 max e ch , and thus the simpler form is
estimated via OLS :18
e2
ch
ln = Zch α + rch (7)
A − e2ch
This approach resembles that of Harvey (1976), nevertheless the prediction is bounded.
Zch α
By deﬁning exp = D and using the delta method,19 the household speciﬁc conditional variance
estimator for ech is equal to (Elbers, Lanjouw, and Lanjouw, 2002):
13 e
ˆch denotes the household speciﬁc error component.
14 Although, ELL (2003) notes that weigths are to be used, the PovMap software does not utilize weights when obtaining u
ˆc.
15 n is the number of observations in cluster c.
c
16 The heteroskadistic model is relevant for the varianceestimation under ELL’s method, as well as the one under Henderson’s
method III. In its current form, the methodology is from ELL (2003).
17 Users have the option to allow for homoskedasticity, in which case σ ˆu − σ
ˆe = σ ˆη , where σ
ˆu is the estimated variance of the
residuals from the consumption model (equation 1). If the user chooses homoskedasticity, equations 6 through 8 are omitted.
18 This is the actual model used by PovMap, which we also implement.
19 The result is the outcome from a second order Taylor expansion for the E σ 2
ech .
5
2 AD 1 AD(1 − D)
σ
ˆe,ch ≈ + Var(r) 3 (8)
1+D 2 (1 + D)
where Var(r) is the estimated variance from the residuals of the model in equation (7).
2
3.1.2 Estimating the distribution of σ
ˆη
2
ELL (2002) proposes two methods to obtain the variance of σ
ˆη :
1. By simulation (ELL, 2002):
2 2
(a) Estimate ση from equation (4) which yields σ
ˆη
2 2
(b) Estimate σe,ch from equation (8) which yields σ
ˆe,ch
(c) Assuming ηc and ech are independent and normally distributed with mean 0, new values for uch
are generated using equation (2)
2
(d) Compute a new estimate ση from equation (4)
2
(e) Repeat (a) - (d) and keep all simulated values of σ
ˆη
2 2 2
From the simulated values of ση , the sampling variance of σ
ˆη Var σ
ˆη can be obtained directly.
2
2. The sampling variance of σ
ˆη can also be obtained by the following formula (ELL, 2002):20
22
2 2 2 τ
ˆc
Var σ
ˆη = 2 a2
c σ
ˆη2
+ τ
ˆc2
ση
+ 2ˆ 2 2
ˆc + b2
τ c (9)
c
nc − 1
where ac = wc / [ c wc (1 − wc )], and bc = wc (1 − wc )/ [ c wc (1 − wc )].
2
(a) This is noted by ELL to be an approximation of the sampling variance of σ
ˆη . It assumes within
cluster homoskedasticity of the household error component. For a more comprehensive discussion
on this matter refer to ELL (2002).
The command, in its present form only allows for estimation using the second method,21 when estimating
2
the unconditional variance of ση using the ELL methodology.
3.1.3 ELL’s GLS Estimator
The GLS estimator oﬀered in ELL’s paper is presented in this section. Although not implemented,22 the
estimator is presented for completeness of the original methodology presented by ELL.
Once σ 2
ˆe,ch (8) and σ 2
ˆη (4) are estimated these are used to construct a matrix Ω ˆ of dimension N × N .
The matrix for Ωˆc is a square block matrix corresponding to cluster c, where the rows and columns are the
20 This is the only option available in PovMap software.
21 This is also true for the PovMap software.
22 This is due to the lack of symmetry of the resulting variance covariance matrix when weights diﬀer across observations in
a cluster.
6
2 2
number of observations in the cluster. Within cluster covariance is given by σ
ˆη . On the diagonal σ
ˆe,ch is
added to obtain the observation speciﬁc disturbance varaince. The resulting cluster block of this N × N
matrix is equal to:
2 2 2 2
σ
ˆη +σ
ˆe,ch σ
ˆη ··· σ
ˆη
2 2 2 2
σ
ˆη σ
ˆη +σ ˆe,ch ... σ
ˆη
ˆ
Ωc = . . .. .
.
. .
. . .
.
2 2 2 2
σ
ˆη σ
ˆη ··· σ
ˆη +σ
ˆe,ch
ˆ1
Ω 0 ... 0
0 ˆ2
Ω ... 0
ˆ
⇒Ω= . . .. .
.
. .
. . .
.
0 0 ··· ˆ
ΩC
ˆ c from above, and the number of rows to
where 0 is a block of zeroes; the number of columns equals the Ω
ˆ c next to it.
the Ω
Ω is the variance-covariance matrix of the error vector necessary to estimate equation (1) via GLS and obtain
ˆGLS and Var β
β ˆGLS . The estimates for the GLS detailed by ELL (2003) are:
ˆGLS = X W Ω−1 X
β −1
X W Ω− 1 Y (10)
and
ˆGLS = X W Ω−1 X
Var β −1
X W Ω−1 W X X W Ω− 1 X −1
(11)
where W is a N × N diagonal matrix of sampling weights. Because W Ω−1 is usually not symmetric,23
as noted by Haslett et al. (2010), the variance covariance matrix must be adjusted to obtain a symmetric
matrix. This is done by obtaining the average of the variance covariance matrix and its transpose (Haslett
et al., 2010).
Newer versions of the PovMap software no longer obtain GLS estimates using this approach. Adjustments
have been made in order to better incorporate the survey weights into Ω. The newest version of PovMap
uses the GLS estimator presented by Huang and Hidiroglou (2003), which is discussed below.
3.2 Henderson’s Method III decomposition
After a decade of use of PovMap, and numerous poverty maps completed by the World Bank using the original
ELL small area methodology; the software was updated with Henderson’s Method III (H3) decomposition of
the variance components and an update of the GLS with a modiﬁcation of Huang and Hidiroglou (2003).24
Obtaining the GLS estimates once again requires the estimation of the variance components, which are
23 The lack of symmetry is due to the diﬀerence in sampling weights between observations.
24 This is detailed in Van der Weide (2014).
7
estimated using a variation of Henderson’s Method III (Henderson, 1953). The variation takes into account
the use of survey weights and was presented by Huang and Hidiroglou (2003). We follow the presentation
oﬀered by Van der Weide (2014) which builds on Huang and Hidiroglou (2003).
In order to obtain the variance components of the βGLS it is necessary to ﬁrst estimate a model which
subtracts each of the cluster’s means from the variables. To achieve this we ﬁrst need to deﬁne the left hand
side of the model. For each cluster in our estimation the left hand side is (omitting the natural logarithm
only for display purposes):
y
˜c
¯c ⊗ 1T
= Yc − Y
where Yc is a T × 1 vector (where T is the number of surveyed observations in the cluster) , corresponding
¯c is a scalar which is the weighted mean value of Yc
to cluster c, of our welfare variable used in equation 1 Y
˜ as a N × 1
for cluster c. 1T is a T × 1 vector of 1s, and ⊗ is the Kroenecker product. Finally, we deﬁne y
matrix which is a vector of all the cluster y
˜c s.
For the right hand side we follow the same de-meaning procedure and obtain a matrix x ˜ . Additionally,
¯ of dimensions C × K where C is the number of areas in the survey, with each row representing the
deﬁne x
demeaned x
˜c for a speciﬁc cluster. With these in hand it is possible to deﬁne the following:
−1
SSE = y ˜−y
˜ Wy ˜ x
˜ Wx ˜ Wx
˜ x
˜ Wy
˜
−1 −1
t2 = tr ˜ Wx
x ˜ ˜ (W ◦ W ) x
x ˜
−1
t3 = tr (X W X ) X (W ◦ W ) X
−1
t4 = tr (X W X ) ¯ (Wc ◦ Wc ) x
x ¯
where Wc is a C × C diagonal matrix of cluster weights, and ◦ represents the Hadamard product. Using
2 2
SSE, t2 , t3 , and t4 it is possible to estimate the variances σe and ση :
2 SSE
σ
ˆe = 2
(12)
wch
ch wch − c
h
w
− t2
h ch
−1 2
2 Y W Y − Y W X (X W X ) X WY − ( ch wch − t3 ) σ
ˆe
σ
ˆη = (13)
ch ch − t4
w
ˆc Van der Weide (2014) provides the following for each cluster c:
To obtain the estimates of η
2
σ
ˆη
γc,w =
2+σ 2 wch
σ
ˆη ˆe h
w 2
h ch
consequently the η
ˆc for a particular cluster is:
1
η
ˆc = γc,w ˆch −
wch u γc,w wch u
ˆch (14)
C c
h h
8
ˆc it is possible to update the estimate of e
With the updated estimate of η ˆch :
e ˆc −
ˆch − η
ˆch = u uch − η
(ˆ ˆc ) (15)
ch
2
ˆch is adjusted such that its variance is equal to σ
Additionally the distribution of e ˆe .
3.3 The GLS Estimator
With the idiosyncratic error terms in hand the heteroskedasticity of the observation speciﬁc residual may
be speciﬁed.25 In this instance we follow the same steps detailed above for equation 7, from which we can
2
obtain observation speciﬁc σ
ˆe,ch by using equation 8.
The estimated variances are then used to construct a pair of matrices used to obtain the GLS estimates.
The GLS estimator for β is:
−1
ˆGLS = X Ω
β ˆ −1 X ˆ −1 Y
XΩ (16)
and the variance-covariance matrix of the GLS estimator is:
−1 −1
ˆGLS = X Ω
Var β ˆ −1 X ˆ −1 V
XΩ ˆΩˆ −1 X ˆ −1 X
XΩ (17)
ˆ as opposed to the one detailed in the ELL method above, incorporates the survey weights into the
where Ω
ˆ for each cluster is equal to:
matrix. Ω
σ 2
ˆe,ch
wch 2 wch 2 wch 2
h
w 2 σ
ˆη + wch
h
w2
σ
ˆη ··· h
w 2 σ
ˆη
h ch h ch h ch
2
w 2 w 2 σ
ˆe,ch w 2
h ch h ch h ch
σ
ˆη σˆη + ... σ
ˆη
w2 w2 w2
ˆc = wch
Ω h ch
.
h ch
.
h ch
.
...
.
. .
. .
.
σ 2
ˆe,ch
wch 2 wch 2 wch 2
h
w2
σ
ˆη h
w2
σ
ˆη ··· h
w 2 σ
ˆη + wch
h ch h ch h ch
ˆ1
Ω 0 ... 0
0 ˆ2
Ω ... 0
ˆ =
⇒Ω . . .. .
.
. .
. . .
.
0 0 ··· ˆ
ΩC
where 0 is a block of zeroes; the number of columns equals the Ωˆ c from above, and the number of rows to
ˆ c next to it. The V
the Ω ˆ for a particular cluster is equal to:
25 The user may also choose to forego the modeling of household level heteroskedasticity, in which case σ
ˆe2 is constant for all
observations.
9
2 2 2 2
σ
ˆη +σ
ˆe,ch σ
ˆη ··· σ
ˆη
2 2 2 2
σ
ˆη σ
ˆη +σ ˆe,ch ... σ
ˆη
ˆ
Vc = . . .. .
.
. .
. . .
.
2 2 2 2
σ
ˆη σ
ˆη ··· σ
ˆη +σ
ˆe,ch
ˆ1
V 0 ... 0
0 ˆ
V2 ... 0
ˆ
⇒V = . . .. .
.
. .
. . .
.
0 0 ··· ˆC
V
ˆc from above, and the number of rows to
where 0 is a block of zeroes; the number of columns equals the V
ˆ
the Vc next to it.
ˆ matrix from the ELL methodology. In the most current version of PovMap
ˆ matrix is similar to the Ω
The V
users no longer have the option to use the GLS estimator originally oﬀered by ELL (2003), and independently
of how users choose to model the location eﬀect, the manner in which the GLS estimators are obtained is
using the estimators from equations 16 and 17.
4 The Second Stage: Simulation
The ﬁnal goal of the process described up to this point is to simulate values for the variable of interest. In
the Elbers, Lanjouw, and Lanjouw (2003) context, this entailed log welfare and poverty rates for speciﬁc
locations using the census data. Monte Carlo simulation is used to obtain the expected welfare measures
given the ﬁrst stage model (ELL, 2003). This is done by applying the parameter and error estimates from
the survey to the census data. The goal is to obtain a suﬃcient number of simulations in order to obtain
reliable levels of welfare. The section begins by introducing Empirical Bayes prediction as is done by Van der
Weide (2014).
4.1 Empirical Bayes Prediction Assuming Normality
Along with the GLS and Henderson’s Method III additions to the ELL approach mentioned before, Empirical
Bayes (EB) prediction was also added. EB prediction makes use of the survey data in order to improve
predictions on the location eﬀect. Since EB makes explicit use of data from the survey, its use only improves
predictions for areas that are included in the survey.
If we assume that ηc and ech (from equation 1) are normally distributed, then the distribution of ηc conditional
on ec will also be normally distributed (Van der Weide, 2014). The distribution of the random location eﬀect,
ηc , is obtained conditional on the observation speciﬁc residuals of the observations sampled in the location.
Van der Weide (2014) indicates that the assumption of normality for both components of the residual is
necessary to derive the distribution of the random location eﬀects. In order to proceed, Van der Weide
(2014) deﬁnes the following:
10
2
ση
γc,w = −1
2+ 2 wch
ση h wch h wch 2
h σe,ch
With this deﬁned, the expected value of the location eﬀect conditional on the residuals of the households
within the location can be obtained:
wc,h
h 2
σe ech
E [ η c |ec ] = η
ˆc = γc,w ch
wc,h
2
h σe
ch
as well as the variance:
2
wc,h
2
σe
2 2 2 2
ηc ] = ση
Var [ˆ − γc,w ση + ch
wc,h
σe
ch
h σe 2
h ch
EB prediction is expected to perform well in the presence of large η , if many of the locations are covered by
the survey, and if the distributions of the error terms approximate a normal distribution (Van der Weide,
2014).
4.2 Monte Carlo Simulations
ELL (2003) use various speciﬁcations in their paper. The options and methods diﬀer depending on how the
variance for the decomposed error terms is obtained. The simulation may be done via parametric drawing of
the parameters, and via bootstrap. The PovMap manual (Zhao, 2006) only details the parametric approach.
Nevertheless, the newest version of the software has incorporated a bootstrap approach to obtaining the
parameters. The steps for the simulation are:
˜GLS ∼ N β
1. Obtain GLS coeﬃcients by drawing from β ˆGLS
ˆGLS , Var β . For reasons that will
become apparent in the following steps, this approach is only possible when the user chooses ELL
variance estimation.
(a) A diﬀerent approach relies on obtaining bootstrapped samples of the survey data used in the ﬁrst
˜ for every
stage, this is repeated for every single simulation. The latter method yields a set of β
single simulation.
2. Under the ELL variance estimation approach the cluster component of the error term, η
˜c is obtained
2
˜c ∼ N 0, σ
by drawing from η ˆη where we:
2 2 2 26
(a) Draw σ
ˆη ∼ Gamma σ
ˆη , Var σ
ˆη
2 2
i. With each σ
ˆη ˜c ∼ N 0, σ
in hand it is possible to draw the location eﬀects from η ˆη .
26 As described in Demombynes et al. (2008), Var σ ˆη2 may be estimated by simulation as detailed above, or from equation
9. In its present form the only available option is that from equation 9.
11
ii. Additionally, η
˜c could be drawn from those estimated in the ﬁrst stage (semi-parametric),
this is possible to do in the parametric and bootstrapped simulation.
2
Under Henderson’s method III (H3) there is no deﬁned distribution for the σ
ˆη parameter. Consequently,
2
bootstrapped samples of the survey data are used to obtain σ
ˆη and all other parameters for every single
simulation. This approach is also available under ELL methods. The bootstrap is also necessary for
2
the instances when EB is chosen. When EB is chosen, the bootstrap produces a vector of η s and of σ
ˆηc
2
˜c ∼ N η
for each of the clusters included in the survey and thus for every simulation we draw η ˆc , σ
ˆηc .
2
˜c ∼ N 0, σ
For clusters not included in the survey the drawing is: η ˆη .
3. e
˜ch can be drawn from a normal distribution with variance such as the one speciﬁed in equation 8.
This is done by:
(a) Drawing α ∼ N (α α)) (it may also be the product of the bootstrap process) and using it on
ˆ , Var(ˆ
Zch α
the census data, we obtain a new D = exp , and using the new D we obtain in conjunction
ˆ2
with the ﬁrst stage A = 1.05 max e ch and Var(r ) we get:
ˆ
AD 1 ˆ ˆ
AB (1 − D)
2
σ
˜e,ch ≈ + Var(r) 3
1+Dˆ 2 ˆ
1+D
2
˜ch ∼ N 0, σ
From this it is possible to draw e ˜e,ch .
(b) As an alternative it is also possible to draw from the estimated e
ˆch from the ﬁrst phase (semi-
parametric). In the case of the ELL variance estimation with a parametric drawing of β s then
this will be done from one N × 1 vector, when performing bootstrap the drawing is done from the
N × 1 vector corresponding to each simulation.
2
˜ch ∼ N 0, σ
(c) Under the assumption of homoskedasticity, the error terms are drawn from e ˜e .
4. Once all simulated parameters have been obtained, it is possible to obtain the simulated vector of
target values:
˜GLS + η
˜ch = X β
Y ˜c + e
˜ch (18)
(a) This is repeated multiple times (usually 100) in order to obtain a full set of simulated household
welfare vectors.
If the interest is small area estimates of income and poverty, once all simulations are ﬁnalized a set of incomes
for all observations in the target data may be used to obtain R simulated poverty, or inequality rates for
each domain of interest. Making use of the R indicators produced, it is possible to obtain the mean indicator
for each domain of interest (i.e. municipality in the context of a poverty map), as well as the associated
standard errors for the indicator.
5 Computational Considerations
The ﬁrst stage of the analysis is standard procedure among the typical Stata commands, since the analysis
is done on survey data. There are no considerable computational requirements, and the analysis may likely
12
be performed using the majority of computers available to users. On the other hand, the Monte Carlo
simulation requires the use of the census data. Census data, depending on the country, can range from
anywhere between roughly 30,000 observations to almost half a billion. Depending on a user’s computer
setup, a couple million observations can really slow down the computer to a crawl. Assuming variables
in double type precision, a 5 million observation dataset with 30 variables will require roughly 1.1 GB of
memory. Therefore, in order to speed up the second stage operation and to be able to operate with larger
datasets, a couple of modiﬁcations are implemented.
The ﬁrst modiﬁcation concerns importing the census data and formatting it in a more memory friendly way.
Along with the main command we supply a sub-command that allows the user to import the census data
for the second stage, sae data import. The data is imported one vector at a time and saved into a Mata
format ﬁle which is used for the processing of each regressor at a time for each simulation to obtain the
predicted welfare. Proceeding in this manner, the maximum number of observations that can be used in the
simulation stage is increased. As shown in the examples section, this setup requires preparing the target
dataset beforehand. Due to this setup, the Monte Carlo simulations are executed one vector at a time. For
smaller datasets it is likely that performing all simulations in one go results in quicker execution times,27
once we move on to larger census data this method provides faster execution times and allows for more
eﬃcient memory management.
The second modiﬁcation, is a plugin for processing the simulations and producing the required indicators.
This is only used/necessary for processing indicators. The processing is also done one simulation at a time,
just like the Monte Carlo simulations. The plugin speeds up the process considerably, especially when
requesting Gini coeﬃcients.28
6 The sae Command and Sub-Commands
The sae command and sub-commands for modeling, simulating, and data manipulation are introduced
below. The common syntax for the command is:
sae [routine] [sub routine]
Currently available routines and subroutines are:
• sae data import||export : This is used to import the target dataset to a more manageable
format for the simulations. It is also used to export the resulting simulations to a dataset of the
user’s preference.
• sae model lmm/povmap : This routine is for obtaining the GLS estimates of the ﬁrst stage. The
sub-routines, lmm (linear mixed model) and povmap are used interchangeably.29
• sae simulate lmm/povmap : This routine and sub-routine obtains the GLS estimates of the ﬁrst
stage, and goes on to perform the Monte Carlo simulations.
27 This is particularly true for the MP version of Stata, which makes use of more than one core for its operations.
28 Gini coeﬃcients require sorting at every level, this is done much faster using a C plugin. Stata’s sorting speed (as of this
writing), including Mata’s, is much slower than that of many other software.
29 Future work aims to incorporate additional methodologies, such as models for discrete left hand side variables .
13
• sae proc stats||inds : The stats and inds sub-routines are useful for processing Mata formatted
simulation output and producing indicators with new thresholds or weights, as well as proﬁling.
The routines and sub-routines are described in the sections below.
6.1 Preparing the Target Dataset
Due to the potential large size of the target data set, the sae command comes with an ancillary sub-command
(sae data import) useful for preparing the target dataset and converting it into a Mata format dataset to
minimize the overall burden put on the system. The command’s syntax is as follows:
sae data import, datain(string) varlist(string) area(string) uniqid(string)
dataout(string)
The options of the command are all required.
• The datain() option indicates the path and ﬁle name of the Stata format dataset to be converted
into a Mata format dataset.
• The varlist() option speciﬁes all the variables to be imported into the Mata format dataset. The
variables speciﬁed will be available for the simulation stage of the analysis. Variables must have similar
names between datasets. Additionally, users should include here any additional variables they wish to
include, as well as expansion factors for the target data.
• The uniqid() option speciﬁes the numeric variable that indicates the unique identiﬁers in the target
dataset. This is necessary to ensure replicability of the analysis, and the name should match the one
of the unique identiﬁer from the source dataset.
• The area() option is necessary and speciﬁes at which level the clustering is done, it indicates at
which level the ηc is obtained. The only constraint is that the variable must be numeric and should
match across datasets, although it is recommended it follows a hierarchical structure similar to the one
proposed by Zhao (2006).
– The hierarchical id should be of the same length for all observations. For example: AAMMEEE.30
This structure facilitiates getting ﬁnal estimates at diﬀerent aggregation levels.
• The dataout() option indicates the path and ﬁlename of the Mata format dataset that will be used
when running the Monte Carlo simulations.
6.1.1 Example: Preparing the Target Data
The sae command requires users to import the target data into a Mata data format ﬁle. This is done to
facilitate the process of simulations in the second stage due to the potential large size of the target data.
30 In the case this were done for a speciﬁc country, AA stands for the highest aggregation level, MM stands for the second
highest aggregation level, and EEE, stands for the lowest aggregation level.
14
. sae data import, varlist(`xvar´ p_hhsize_hh `zvar´ pline pline2 rdef rdef2 rentdef rdef_rentroom2 rdef_ren
> tall2) area(localityid_n) uniqid(hhid) datain(`censo´) dataout(`matadata´)
Saving data variables into mata matrix file (38)
1 2 3 4 5
......................................
6.2 Model
In this section the command for the modeling stage of the analysis is presented. The syntax of this is as
follows:
sae model lmm/povmap depvar indepvar [if] [in] [aw pw fw] , area(varname numeric)
varest(string) [zvar(varlist) yhat(varlist) yhat2(varlist) alfatest(string) vce(string)]
• area(): The area option is necessary and speciﬁes at which level the clustering is done, it indicates
at which level the ηc is obtained. The only constraint is that the variable must be numeric and should
match across datasets, although it is recommended it follows a hierarchical structure similar to the one
proposed by Zhao (2006).
– The hierarchical id should be of the same length for all observations for example: AAMMEEE.31
• varest(): The varest option allows the user to select between H3 or ELL methods for obtaining the
variance of the decomposed ﬁrst stage residuals. The selection has repercussions on the options available
afterwards. For example, if the the user selects H3, parameters must be obtained via bootstrapping.
The following are optional. In the case of homoskedasticity, the zvar, yhat, and yhat options should not
be speciﬁed.
• zvar(): The zvar option is necessary for specifying the alpha model, the user must place the inde-
pendent variables of the alpha model under the option.
• yhat(): The yhat option is also a part of the alpha model. Variables listed here will be interacted
ˆ = Xβ from the OLS model.
with the predicted y
• yhat2(): The yhat2 option is also a part of the alpha model. Variables listed here will be interacted
2
ˆ2 = (Xβ ) from the OLS model.
with the predicted y
• alfatest(): The alfatest option may be run in any stage, but is useful for selecting a proper
ﬁrst stage. It requests the command to output the dependent variable of the alpha model for users to
model for heteroskedasticity.
• vce(): The vce option allows users to replicate the variance covariance matrix from the OLS in the
PovMap 2.5 software. The default option is the variance covariance matrix from the PovMap software,
vce(ell), the user may specify robust or clustered variance covariance matrix to replicate the results
from the regress command.32
31 In the case this were done for a speciﬁc country, AA stands for the highest aggregation level, MM stands for the second
highest aggregation level, and EEE, stands for the lowest aggregation level.
32 The variance covariance matrix presented by the PovMap software is not standard in the literature. The variance covariance
15
6.2.1 Running the First Stage (welfare model)
The entire process of small area estimation may be run in two stages. It is possible to test the ﬁrst stage
of the analysis before moving on to the simulation which makes use of the target data. To obtain the ﬁrst
stage of the analysis, the user must have the data ready, and it is recommended that the user has predeﬁned
as macros the variables to be used.
. sae model povmap `y´ `xvar´ [aw=hhw], area(localityid_n) zvar(`zvar´) ///
> varest(h3) vce(ell) alfatest(pp)
note: p_age omitted because of collinearity
WARNING: 76 observations removed due to less than 3 observations in the cluster.
OLS model:
lny Coef. Std. Err. z P>|z| [95% Conf. Interval]
dw_bath .1397767 .0200267 6.98 0.000 .100525 .1790283
dw_btype_d3 .0496587 .0211568 2.35 0.019 .0081921 .0911253
edu_postsec .0816588 .0883387 0.92 0.355 -.0914818 .2547994
edu_prim -.8848284 .1744707 -5.07 0.000 -1.226785 -.5428722
j_firm_access1 .0382874 .0129973 2.95 0.003 .0128132 .0637616
jud_erate .8928555 .2215139 4.03 0.000 .4586962 1.327015
jud_occup_el_sh -1.333666 .2481222 -5.38 0.000 -1.819977 -.8473555
p_activity_1_sh -.204576 .0306396 -6.68 0.000 -.2646285 -.1445235
p_age .0134131 .0023387 5.74 0.000 .0088293 .0179968
p_age2 -.0000788 .0000238 -3.31 0.001 -.0001255 -.0000321
p_ecstat_2_sh -.2178931 .0840416 -2.59 0.010 -.3826117 -.0531745
p_ecstat_4_sh .4529504 .0297431 15.23 0.000 .394655 .5112458
p_ecstat_hh_2 -.1803367 .0690036 -2.61 0.009 -.3155813 -.0450921
p_elder_sh .4871691 .0363326 13.41 0.000 .4159585 .5583797
p_hhsize_1 -.1331209 .0253237 -5.26 0.000 -.1827544 -.0834874
p_isced_max_0 -.5169389 .0540608 -9.56 0.000 -.6228961 -.4109817
p_isced_max_1 -.3758244 .0257253 -14.61 0.000 -.426245 -.3254037
p_isced_max_2 -.221132 .0191942 -11.52 0.000 -.2587519 -.1835122
p_isced_max_4 .138612 .0286283 4.84 0.000 .0825016 .1947225
p_isced_max_5 .2343928 .024233 9.67 0.000 .1868969 .2818887
p_isced_max_6 .237176 .0802218 2.96 0.003 .0799442 .3944078
p_male .1836228 .0218718 8.40 0.000 .140755 .2264906
p_marital_hh_1 -.052893 .030501 -1.73 0.083 -.1126738 .0068878
p_marital_hh_3 -.0809455 .0206275 -3.92 0.000 -.1213747 -.0405163
p_occup_pr_sh .3670144 .0363299 10.10 0.000 .2958092 .4382196
p_workstat_1_sh .5260943 .0270113 19.48 0.000 .4731532 .5790354
rooms_pc .0470549 .0109036 4.32 0.000 .0256843 .0684255
_cons 7.132082 .1597874 44.63 0.000 6.818904 7.445259
Alpha model:
Residual Coef. Std. Err. z P>|z| [95% Conf. Interval]
p_activity_1_sh -.3447938 .1425119 -2.42 0.016 -.624112 -.0654756
p_ecstat_4_sh -.8035096 .1585224 -5.07 0.000 -1.114208 -.4928114
p_ecstat_hh_4 -.3006224 .118973 -2.53 0.012 -.5338051 -.0674397
matrix presented by the PovMap software is equal to σ 2 (X W X )−1 X W 2 X (X W X )−1 where σ 2 is an estimate of
Var(wch uch ). It is easy to see that the weights are included twice in the variance covariance estimator, which makes it
non-standard.
16
p_elder_sh -.8877215 .1511032 -5.87 0.000 -1.183878 -.5915647
p_isced_max_1 -.1701583 .10368 -1.64 0.101 -.3733673 .0330507
p_isced_max_5 .2211589 .1155722 1.91 0.056 -.0053584 .4476762
p_male .156548 .1078049 1.45 0.146 -.0547458 .3678417
p_marital_hh_1 .1685545 .1322579 1.27 0.203 -.0906662 .4277752
p_occup_pr_sh .5477886 .1793579 3.05 0.002 .1962536 .8993236
p_workstat_1_sh -1.289935 .1263831 -10.21 0.000 -1.537641 -1.042228
rooms_pc .124589 .0411818 3.03 0.002 .0438741 .2053038
_cons -6.317578 .13152 -48.04 0.000 -6.575353 -6.059804
GLS model:
lny Coef. Std. Err. z P>|z| [95% Conf. Interval]
dw_bath .1397681 .018897 7.40 0.000 .1027307 .1768054
dw_btype_d3 .043487 .0194832 2.23 0.026 .0053006 .0816734
edu_postsec .2515946 .1255387 2.00 0.045 .0055434 .4976459
edu_prim -.4613187 .2236142 -2.06 0.039 -.8995944 -.023043
j_firm_access1 .077631 .0206401 3.76 0.000 .0371772 .1180849
jud_erate .8129607 .3335183 2.44 0.015 .1592769 1.466644
jud_occup_el_sh -1.238829 .352069 -3.52 0.000 -1.928872 -.5487866
p_activity_1_sh -.1952054 .0310771 -6.28 0.000 -.2561153 -.1342954
p_age .0146863 .0021284 6.90 0.000 .0105146 .0188579
p_age2 -.0000884 .0000208 -4.25 0.000 -.0001292 -.0000477
p_ecstat_2_sh -.1790066 .090007 -1.99 0.047 -.3554171 -.0025961
p_ecstat_4_sh .3970182 .0272502 14.57 0.000 .3436088 .4504275
p_ecstat_hh_2 -.1760779 .0761019 -2.31 0.021 -.3252349 -.0269209
p_elder_sh .408984 .0333136 12.28 0.000 .3436906 .4742775
p_hhsize_1 -.1450279 .0228201 -6.36 0.000 -.1897545 -.1003014
p_isced_max_0 -.4641431 .0497277 -9.33 0.000 -.5616076 -.3666786
p_isced_max_1 -.3250672 .022455 -14.48 0.000 -.3690782 -.2810562
p_isced_max_2 -.1953666 .0172695 -11.31 0.000 -.2292142 -.1615191
p_isced_max_4 .1499224 .0233206 6.43 0.000 .1042149 .1956299
p_isced_max_5 .2418694 .0224473 10.77 0.000 .1978734 .2858654
p_isced_max_6 .2282902 .0750887 3.04 0.002 .0811191 .3754613
p_male .1945154 .0199999 9.73 0.000 .1553163 .2337145
p_marital_hh_1 -.046479 .0299398 -1.55 0.121 -.1051599 .0122019
p_marital_hh_3 -.0799847 .0177162 -4.51 0.000 -.1147078 -.0452616
p_occup_pr_sh .3263021 .0359602 9.07 0.000 .2558214 .3967828
p_workstat_1_sh .4861585 .0257385 18.89 0.000 .4357121 .536605
rooms_pc .0572602 .0106962 5.35 0.000 .036296 .0782244
_cons 6.668453 .2825154 23.60 0.000 6.114733 7.222173
Comparison between OLS and GLS models:
Variable bOLS bGLS
dw_bath .13977666 .13976806
dw_btype_d3 .04965872 .04348699
edu_postsec .0816588 .25159463
edu_prim -.88482838 -.46131871
j_firm_acc~1 .03828739 .07763102
jud_erate .89285555 .81296068
jud_occup_~h -1.333666 -1.2388292
p_acti~_1_sh -.20457601 -.19520536
p_age .01341306 .01468626
17
p_age2 -.00007879 -.00008844
p_ecstat_2~h -.21789309 -.17900661
p_ecstat_4~h .45295039 .39701819
p_ecstat_h~2 -.18033673 -.17607789
p_elder_sh .48716911 .40898405
p_hhsize_1 -.13312094 -.14502795
p_isced_ma~0 -.51693886 -.4641431
p_isced_ma~1 -.37582438 -.32506718
p_isced_ma~2 -.22113204 -.19536665
p_isced_ma~4 .13861204 .14992238
p_isced_ma~5 .23439281 .24186941
p_isced_ma~6 .23717604 .22829018
p_male .1836228 .19451539
p_marital_~1 -.05289303 -.04647901
p_marital_~3 -.0809455 -.0799847
p_occup_pr~h .36701439 .32630214
p_works~1_sh .52609429 .48615854
rooms_pc .04705493 .0572602
_cons 7.1320819 6.6684532
Model settings
Error decomposition H3
Beta model diagnostics
Number of observations = 7564
Adjusted R-squared = .55294896
R-squared = .55454493
Root MSE = .45080868
F-stat = 347.46412
Alpha model diagnostics
Number of observations = 7564
Adjusted R-squared = .03563669
R-squared = .03703931
Root MSE = 2.2858123
F-stat = 26.407274
Model parameters
Sigma ETA sq. = .02312296
Ratio of sigma eta sq over MSE = .11377818
Variance of epsilon = .18255558
6.3 Monte Carlo Simulation
The simulation part of the analysis requires more inputs from the user. Depending on the details given and
the purpose of the analysis, the user may obtain poverty rates by the diﬀerent locations speciﬁed or just
output the simulated vectors to a dataset of her choosing. The syntax for the simulation stage is:
sae simulate lmm/povmap depvar indepvar [if] [in] [aw pw fw], area(varname numeric)
varest(string) eta(string) epsilon(string) uniqid(varname numeric) [vce(string)
18
zvar(varlist numeric) yhat(varlist numeric) yhat2(varlist numeric) psu(varname numeric)
matin(string) pwcensus(string) rep(integer 1) seed(integer 123456789) bootstrap ebest
colprocess(integer 1) lny addvars(string) ydump(string) plinevar(varname numeric)
plines(numlist sort) aggids(numlist sort) indicators(string) results(string) allmata]
The possible options are:
• area(): The area option is necessary and speciﬁes at which level the clustering is done, it indicates
at which level the ηc is obtained. The only constraint is that the variable must be numeric and should
match across datasets, although it is recommended it follows a hierarchical structure similar to the one
proposed by Zhao (2006).
– The hierarchical id should be of the same length for all observations for example: AAMMEEE.33
• varest(): The varest option allows the user to select between H3 or ELL methods for obtaining the
variance of the decomposed ﬁrst stage residuals. The selection has repercussions on the options available
afterwards. For example, if the user selects H3, parameters must be obtained via bootstrapping.
• eta(): The eta option allows users to specify how they would like to draw ηc for the diﬀerent clusters
in the second stage of the analysis. The available options are normal and non-normal. If non-normal
is chosen empirical Bayes is not available to users.
• epsilon(): The epsilon option allows users to specify how they would like to draw εch for the
diﬀerent observations in the second stage of the analysis. The available options are normal and non-
normal. If non-normal is chosen empirical Bayes is not available to users.
• uniqid(): The uniqid option speciﬁes the numeric variable that indicates the unique identiﬁers in
the source and target datasets. This is necessary to ensure replicability of the analysis.
• vce(): The vce option allows users to replicate the variance covariance matrix from the OLS in the
PovMap 2.5 software. The default option is the variance covariance matrix from the PovMap software
(ell), the user may specify robust or clustered variance covariance matrix to replicate the results from
the regress command.34
• zvar(): The zvar option is necessary for specifying the alpha model, the user must place the inde-
pendent variables of the alpha model under the option.
• yhat(): The yhat option is also a part of the alpha model. Variables listed here will be interacted
ˆ = Xβ from the OLS model.
with the predicted y
• yhat2(): The yhat2 option is also a part of the alpha model. Variables listed here will be interacted
2
ˆ2 = (Xβ ) from the OLS model.
with the predicted y
33 In the case this were done for a speciﬁc country, AA stands for the highest aggregation level, MM stands for the second
highest aggregation level, and EEE, stands for the lowest aggregation level.
34 The variance covariance matrix presented by the PovMap software is not standard in the literature. The variance covariance
matrix presented by the PovMap software is equal to σ 2 (X W X )−1 X W 2 X (X W X )−1 where σ 2 is an estimate of
Var(wch uch ). It is easy to see that the weights are included twice in the variance covariance estimator, which makes it
non-standard.
19
• psu(): The psu option indicates the numeric variable in the source data for the level at which boot-
strapped samples are to be obtained. This option is required for the cases when obtaining bootstrapped
parameters is necessary. If not speciﬁed, the level defaults to the cluster level, that is the level speciﬁed
in the area option.
• matin(): The matin option indicates the path and ﬁlename of the Mata format target dataset. The
dataset is created from the sae data import command; it is necessary for the second stage.
• pwcensus(): The pwcensus option indicates the variable which corresponds to the expansion factors
to be used in the target dataset, it must always be speciﬁed for the second stage. The user must have
added the variable to the imported data (sae data import) i.e. the target data.
• rep(): The rep option is necessary for the second stage, and indicates the number of Monte-Carlo
simulations to be done in the second stage of the procedure.
• seed(): The seed option is necessary for the second stage of the analysis and ensures replicability.
Users should be aware that Stata’s default pseudo-random number generator in Stata 14 is diﬀerent
than that of previous versions.
• bootstrap: The bootstrap option indicates that the parameters used for the second stage of the
analysis are to be obtained via bootstrap methods. If this option is not speciﬁed the default method
is parametric drawing of the parameters.
• ebest: The ebest option indicates that empirical Bayes methods are to be used for the second
stage. If this option is used, it is necessary that eta(normal), epsilon(normal), and bootsrap
options be used.
• colprocess(): The colprocess option is related to the processing of the second stage. Because of
the potential large size of the target data set the default is one column at a time, this however may be
increased with potential gains in speed.
• lny: The lny option indicates that the dependent variable in the welfare model is in log form. This
is relevant for the second stage of the analysis in order to get appropriate simulated values.
• addvars(): The addvars option allows users to add variables to the dataset created from the sim-
ulations. These variables must have been included into the target dataset created with the sae data
import command.
• ydump(): The ydump option is necessary for the second stage of the analysis. The user must provide
path and ﬁlename for a Mata format dataset to be created with the simulated dependent variables.
• plinevar(): The plinevar option allows users to indicate a variable in the target data set which is
to be used as the threshold for the Foster Greer Thorbeck indexes (Foster, Greer, and Thorbeck 1984)
to be predicted from the second stage simulations. The user must have added the variable in the sae
data import command when preparing the target dataset. Only one variable may be speciﬁed.
• plines(): The plines option allows users to explicitly indicate the threshold to be used, this option
is preferred when the threshold is constant across all observations. Additionally, it is possible to specify
multiple lines, separated by a space.
20
• indicators(): The indicators option is used to request the indicators to be estimated from the
simulated vectors of welfare. The list of possible indicators is:
– The set of Foster Greer Thorbeck indexes (Foster, Greer, and Thorbeck 1984) F GT0 , F GT1 , and
F GT2 ; also known as poverty headcount, poverty gap, and poverty severity respectively.
– The set of inequality indexes: Gini, and Generalized Entropy Index with α = 0, 1, 2
– Set of Atkinson indexes
• aggids(): The aggids option indicates the diﬀerent aggregation levels for which the indicators are
to be obtained, values placed here tell the command how many digits to the left to move to get the
indicators at that level. Using the same hierarchical id speciﬁed in the area option, AAMMEEE, if
the user speciﬁes 0, 3, 5, and 7, it would lead to aggregates at each of the levels E, M, A and the
national level.
• results(): The results option speciﬁes the path and ﬁlename for users to save as a txt ﬁle the
results the analysis speciﬁed in the indicators option.
• allmata: The allmata option skips use of the plugin and does all poverty calculations in Mata.
6.3.1 Running the Second Stage (Simulation)
The second stage of the command takes considerably longer to run, depending on the size of the target
dataset and the number of simulations requested. The command for the second stage would look like the
following:
. sae sim povmap `y´ `xvar´ [aw=hhw], area(localityid_n) uniqid(hhid) psu(thepsu) zvar(`zvar´) ///
> eta(normal) epsilon(normal) varest(h3) lny pwcensus(p_hhsize_hh) ///
> vce(ell) rep(100) seed(89546) ydump(`sfiles´) res(`result´) addvars(pline ///
> pline2 rdef rdef2 rentdef rdef_rentroom2 rdef_rentall2) ///
> matin(`matadata´) col(10) boot ebest stage(second) aggids(0) indicators(fgt0) ///
> plines(5291.52)
note: p_age omitted because of collinearity
Note: Dependent variable in logarithmic form
WARNING: 76 observations removed due to less than 3 observations in the cluster.
OLS model:
lny Coef. Std. Err. z P>|z| [95% Conf. Interval]
dw_bath .1397767 .0200267 6.98 0.000 .100525 .1790283
dw_btype_d3 .0496587 .0211568 2.35 0.019 .0081921 .0911253
edu_postsec .0816588 .0883387 0.92 0.355 -.0914818 .2547994
edu_prim -.8848284 .1744707 -5.07 0.000 -1.226785 -.5428722
j_firm_access1 .0382874 .0129973 2.95 0.003 .0128132 .0637616
jud_erate .8928555 .2215139 4.03 0.000 .4586962 1.327015
jud_occup_el_sh -1.333666 .2481222 -5.38 0.000 -1.819977 -.8473555
p_activity_1_sh -.204576 .0306396 -6.68 0.000 -.2646285 -.1445235
p_age .0134131 .0023387 5.74 0.000 .0088293 .0179968
p_age2 -.0000788 .0000238 -3.31 0.001 -.0001255 -.0000321
p_ecstat_2_sh -.2178931 .0840416 -2.59 0.010 -.3826117 -.0531745
p_ecstat_4_sh .4529504 .0297431 15.23 0.000 .394655 .5112458
p_ecstat_hh_2 -.1803367 .0690036 -2.61 0.009 -.3155813 -.0450921
21
p_elder_sh .4871691 .0363326 13.41 0.000 .4159585 .5583797
p_hhsize_1 -.1331209 .0253237 -5.26 0.000 -.1827544 -.0834874
p_isced_max_0 -.5169389 .0540608 -9.56 0.000 -.6228961 -.4109817
p_isced_max_1 -.3758244 .0257253 -14.61 0.000 -.426245 -.3254037
p_isced_max_2 -.221132 .0191942 -11.52 0.000 -.2587519 -.1835122
p_isced_max_4 .138612 .0286283 4.84 0.000 .0825016 .1947225
p_isced_max_5 .2343928 .024233 9.67 0.000 .1868969 .2818887
p_isced_max_6 .237176 .0802218 2.96 0.003 .0799442 .3944078
p_male .1836228 .0218718 8.40 0.000 .140755 .2264906
p_marital_hh_1 -.052893 .030501 -1.73 0.083 -.1126738 .0068878
p_marital_hh_3 -.0809455 .0206275 -3.92 0.000 -.1213747 -.0405163
p_occup_pr_sh .3670144 .0363299 10.10 0.000 .2958092 .4382196
p_workstat_1_sh .5260943 .0270113 19.48 0.000 .4731532 .5790354
rooms_pc .0470549 .0109036 4.32 0.000 .0256843 .0684255
_cons 7.132082 .1597874 44.63 0.000 6.818904 7.445259
Alpha model:
Residual Coef. Std. Err. z P>|z| [95% Conf. Interval]
p_activity_1_sh -.3447938 .1425119 -2.42 0.016 -.624112 -.0654756
p_ecstat_4_sh -.8035096 .1585224 -5.07 0.000 -1.114208 -.4928114
p_ecstat_hh_4 -.3006224 .118973 -2.53 0.012 -.5338051 -.0674397
p_elder_sh -.8877215 .1511032 -5.87 0.000 -1.183878 -.5915647
p_isced_max_1 -.1701583 .10368 -1.64 0.101 -.3733673 .0330507
p_isced_max_5 .2211589 .1155722 1.91 0.056 -.0053584 .4476762
p_male .156548 .1078049 1.45 0.146 -.0547458 .3678417
p_marital_hh_1 .1685545 .1322579 1.27 0.203 -.0906662 .4277752
p_occup_pr_sh .5477886 .1793579 3.05 0.002 .1962536 .8993236
p_workstat_1_sh -1.289935 .1263831 -10.21 0.000 -1.537641 -1.042228
rooms_pc .124589 .0411818 3.03 0.002 .0438741 .2053038
_cons -6.317578 .13152 -48.04 0.000 -6.575353 -6.059804
GLS model:
lny Coef. Std. Err. z P>|z| [95% Conf. Interval]
dw_bath .1397681 .018897 7.40 0.000 .1027307 .1768054
dw_btype_d3 .043487 .0194832 2.23 0.026 .0053006 .0816734
edu_postsec .2515946 .1255387 2.00 0.045 .0055434 .4976459
edu_prim -.4613187 .2236142 -2.06 0.039 -.8995944 -.023043
j_firm_access1 .077631 .0206401 3.76 0.000 .0371772 .1180849
jud_erate .8129607 .3335183 2.44 0.015 .1592769 1.466644
jud_occup_el_sh -1.238829 .352069 -3.52 0.000 -1.928872 -.5487866
p_activity_1_sh -.1952054 .0310771 -6.28 0.000 -.2561153 -.1342954
p_age .0146863 .0021284 6.90 0.000 .0105146 .0188579
p_age2 -.0000884 .0000208 -4.25 0.000 -.0001292 -.0000477
p_ecstat_2_sh -.1790066 .090007 -1.99 0.047 -.3554171 -.0025961
p_ecstat_4_sh .3970182 .0272502 14.57 0.000 .3436088 .4504275
p_ecstat_hh_2 -.1760779 .0761019 -2.31 0.021 -.3252349 -.0269209
p_elder_sh .408984 .0333136 12.28 0.000 .3436906 .4742775
p_hhsize_1 -.1450279 .0228201 -6.36 0.000 -.1897545 -.1003014
p_isced_max_0 -.4641431 .0497277 -9.33 0.000 -.5616076 -.3666786
p_isced_max_1 -.3250672 .022455 -14.48 0.000 -.3690782 -.2810562
p_isced_max_2 -.1953666 .0172695 -11.31 0.000 -.2292142 -.1615191
p_isced_max_4 .1499224 .0233206 6.43 0.000 .1042149 .1956299
22
p_isced_max_5 .2418694 .0224473 10.77 0.000 .1978734 .2858654
p_isced_max_6 .2282902 .0750887 3.04 0.002 .0811191 .3754613
p_male .1945154 .0199999 9.73 0.000 .1553163 .2337145
p_marital_hh_1 -.046479 .0299398 -1.55 0.121 -.1051599 .0122019
p_marital_hh_3 -.0799847 .0177162 -4.51 0.000 -.1147078 -.0452616
p_occup_pr_sh .3263021 .0359602 9.07 0.000 .2558214 .3967828
p_workstat_1_sh .4861585 .0257385 18.89 0.000 .4357121 .536605
rooms_pc .0572602 .0106962 5.35 0.000 .036296 .0782244
_cons 6.668453 .2825154 23.60 0.000 6.114733 7.222173
Comparison between OLS and GLS models:
Variable bOLS bGLS
dw_bath .13977666 .13976806
dw_btype_d3 .04965872 .04348699
edu_postsec .0816588 .25159463
edu_prim -.88482838 -.46131871
j_firm_acc~1 .03828739 .07763102
jud_erate .89285555 .81296068
jud_occup_~h -1.333666 -1.2388292
p_acti~_1_sh -.20457601 -.19520536
p_age .01341306 .01468626
p_age2 -.00007879 -.00008844
p_ecstat_2~h -.21789309 -.17900661
p_ecstat_4~h .45295039 .39701819
p_ecstat_h~2 -.18033673 -.17607789
p_elder_sh .48716911 .40898405
p_hhsize_1 -.13312094 -.14502795
p_isced_ma~0 -.51693886 -.4641431
p_isced_ma~1 -.37582438 -.32506718
p_isced_ma~2 -.22113204 -.19536665
p_isced_ma~4 .13861204 .14992238
p_isced_ma~5 .23439281 .24186941
p_isced_ma~6 .23717604 .22829018
p_male .1836228 .19451539
p_marital_~1 -.05289303 -.04647901
p_marital_~3 -.0809455 -.0799847
p_occup_pr~h .36701439 .32630214
p_works~1_sh .52609429 .48615854
rooms_pc .04705493 .0572602
_cons 7.1320819 6.6684532
Model settings
Error decomposition H3
Beta drawing Bootstrapped
Eta drawing method normal
Epsilon drawing method normal
Empirical best methods Yes
Beta model diagnostics
Number of observations = 7564
Adjusted R-squared = .55294896
R-squared = .55454493
Root MSE = .45080868
23
F-stat = 347.46412
Alpha model diagnostics
Number of observations = 7564
Adjusted R-squared = .03563669
R-squared = .03703931
Root MSE = 2.2858123
F-stat = 26.407274
Model parameters
Sigma ETA sq. = .02312296
Ratio of sigma eta sq over MSE = .11377818
Variance of epsilon = .18255558
Initializing the Second Stage, this may take a while...
Bootstrapped drawing of betas and parameters
Number of simulations: 100. Each dot (.) represents 10 simulation(s).
1 2 3 4 5
..........
Finished running the Second Stage
6.4 Exporting the Simulated Vectors into Stata
This subroutine is useful for bringing the simulated vectors into Stata. It provides the user the ability to
further manipulate these vectors. Due to the unique structure of the Mata data, this command will only
work with data created in the simulation stage.
The command’s structure is:
sae data export, matasource(string) [numfiles(integer 1) prefix(string) saveold
datasave(string)]
• matasource(): The matasource option allows users to specify the source ydump ﬁle created by the
sae simulate routine . Because the size of the ﬁle can be quite large, it is advisable to use this with
the numfiles option.
• numfiles(): The numfiles option is to be used in conjunction with the ydumpdta option; it spec-
iﬁes the number of datasets to be created from the simulations.
• prefix(): The prefix option may be used to give a preﬁx to the simulated vectors.
• saveold: The saveold option can be speciﬁed in conjunction with the ydumpdta option, and makes
the ﬁles readable by older versions of Stata.
• datasave(): The datasave option allows users to specify a path where to save the exported data,
this is recommended when using the numfiles option.
24
6.5 Processing the Simulated Vectors
A set of commands that facilitate the processing of the outputs from the simulation are provided. This
subroutine is useful for post-estimation calculations based on the output of Mata data created in the sim-
ulation stage (6.3). It provides the user the ability to further calculate poverty and inequality indicators.
Those indicators may be based on new aggregated levels, new poverty lines, or diﬀerent weights that were
not implemented in the simulation stage. In the case of new weights, it is necessary for the user to have
included these new weights in the addvars option.
6.5.1 Process Indicators
This sub-routine allows for processing the simulated vectors to obtain poverty and inequality indicators.
The command’s structure is:
sae proc indicator, matasource(string) aggids(numlist sort) [INDicators(string) plinevar(string)
plines(numlist sort) area(string) weight(string)]
• matasource(): The matasource option allows users to specify the source ydump ﬁle created by the
sae simulate routine. Because the size of the ﬁle can be quite large, it is advisable to use this with
the numfiles option.
• aggids(): The aggids option indicates the diﬀerent aggregation levels for which the indicators are
to be obtained, values placed here tell the command how many digits to the left to move to get the
indicators at that level. Using the same hierarchical id speciﬁed in the area option, AAMMEEE, if
the user speciﬁes 0, 3, 5, and 7, it would lead to aggregates at each of the levels E, M, A and the
national level.
• indicators(): The indicators option is used to request the indicators to be estimated from the
simulated vectors of welfare. The list of possible indicators is:
– The set of Foster Greer Thorbeck indexes (Foster, Greer, and Thorbeck 1984) F GT0 , F GT1 , and
F GT2 ; also known as poverty headcount, poverty gap, and poverty severity respectively.
– The set of inequality indexes: Gini, and Generalized Entropy Index with α = 0, 1, 2
– Set of Atkinson indexes
• plinevar(): The plinevar option allows users to indicate a variable in the target data set which is
to be used as the threshold for the Foster Greer Thorbeck indexes (Foster, Greer, and Thorbeck 1984)
to be predicted from the second stage simulations. The user must have added the variable in the sae
data import command when preparing the target dataset. Only one variable may be speciﬁed.
• plines(): The plines option allows users to explicitly indicate the threshold to be used, this option
is preferred when the threshold is constant across all observations. Additionally, it is possible to specify
multiple lines, separated by a space.
• area(): The area option is necessary and speciﬁes at which level the clustering is done, it indicates
at which level the ηc is obtained. The only constraint is that the variable must be numeric and should
25
match across datasets, although it is recommended it follows a hierarchical structure similar to the
one proposed by Zhao (2006). Note that in this step, the default is to use the deﬁned areas from the
simulation step. In this option the user is given the opportunity to change this grouping.
– The hierarchical id should be of the same length for all observations for example: AAMMEEE.
• weight(): The weight option indicates the new variable which corresponds to the expansion factors
to be used in the target/ydump dataset. The default option is to use the weight variable saved in the
ydump ﬁle, if a variable is speciﬁed here all results will be obtained with this new weighing. The user
must have added the variable to the target data imported (sae data import).
6.5.2 Proﬁling
This sub-routine aids researchers in creating a proﬁle from group classiﬁcations that are the outcome from
the simulated census vectors. For example in the context of poverty it allows researchers to break down
characteristics of the poor and non poor. In the context of anthropometric Z-scores it would allow researchers
to obtain characteristics for the individuals who fall below the indicator’s threshold. The command syntax
is as follows:
sae process stats, matasource(string) aggids(numlist sort) [contvar(string) catvar(string)
plinevar(string) plines(numlist sort) area(string) weight(string)]
• contvar(): The contvar option indicates the continuous variables for which the user wants to estimate
the mean/distribution based on poor and non-poor groups deﬁned from the deﬁned poverty lines in
either plines or plinevar. Those statistics will be aggregated at the aggregation levels indicated in
the aggids option. The user must have added the variable in the sae data import command when
preparing the target dataset.
• catvar(): The catvar option indicates the categorical variables for which the user wants to estimate
the two-way frequencies/distributions based on poor and non-poor groups deﬁned from the deﬁned
poverty lines in either plines or plinevar. Those statistics will be aggregated at the aggregation
levels indicated in the aggids option. The user must have added the variable in the sae data import
command when preparing the target dataset.
All other options resemble those detailed for other subroutines.
7 Conclusions
As we approach the 2020 round of the population census, governments and their respective national systems
of statistics have renewed their international commitments towards the Sustainable Development Goals
(SDGs). The SDGs will need to be operationalized at a sub-national level or reported for speciﬁc subgroups
of the population. Moreover, the expansion of the availability and use of administrative, geospatial and Big
Data sources for evidence based policy making is on the rise, creating a number of important opportunities
for potentially better and more up-to-date measures of social well-being. Against this backdrop, to ensure
26
the much needed quality and rigor of the analysis produced, the advancement and availability of statistically
valid methods with proper inference such as small area estimates are required.
Direct estimates of poverty and inequality from household surveys are only reliable up to a certain regional
level. When estimates are needed at lower, more disaggregated levels, the reliability of these is questionable.
Under a set of speciﬁc assumptions, data from outside sources along with household survey data may be
combined in order to provide policy makers with a more complete picture of poverty and inequality along
with the spatial heterogeneity of these indicators.
In this paper we introduce a new family of Stata functions, sae, which were designed in a modular and ﬂexible
way to manage the data, estimate models, conduct simulations, and compute indicators of interest. The
estimation functions have been bench-marked against the World Bank’s PovMap software for full validation.
We hope that the ﬂexibility of this new family of functions will encourage producers of SAE to document
and report in a more systematic manner the robustness of their results to alternative methodological choices
made, improving the replicability and increasing the transparency of this estimation process. Its modular
nature creates a platform for the introduction of new estimation techniques, such as count and binary models.
Additionally, the modular nature encourages collaboration from the broader Stata community.
27
References
Bedi, T., A. Coudouel, and K. Simler (2007). More than a pretty picture: using poverty maps to design
better policies and interventions. World Bank Publications.
Demombynes, G., C. Elbers, J. O. Lanjouw, and P. Lanjouw (2008). How good is a map? putting small
area estimation to the test. Rivista Internazionale di Scienze Sociali , 465–494.
Elbers, C., J. O. Lanjouw, and P. Lanjouw (2002). Micro-level estimation of welfare. 2911.
Elbers, C., J. O. Lanjouw, and P. Lanjouw (2003). Micro–level estimation of poverty and inequality. Econo-
metrica 71 (1), 355–364.
Foster, J., J. Greer, and E. Thorbecke (1984). A class of decomposable poverty measures. Econometrica:
Journal of the Econometric Society , 761–766.
Harvey, A. C. (1976). Estimating regression models with multiplicative heteroscedasticity. Econometrica:
Journal of the Econometric Society , 461–465.
Haslett, S., M. Isidro, and G. Jones (2010). Comparison of survey regression techniques in the context of
small area estimation of poverty. Survey Methodology 36 (2), 157–170.
Henderson, C. R. (1953). Estimation of variance and covariance components. Biometrics 9 (2), 226–252.
Huang, R. and M. Hidiroglou (2003). Design consistent estimators for a mixed linear model on survey data.
Proceedings of the Survey Research Methods Section, American Statistical Association (2003), 1897–1904.
Molina, I. and J. Rao (2010). Small area estimation of poverty indicators. Canadian Journal of Statis-
tics 38 (3), 369–385.
Rao, J. N. and I. Molina (2015). Small area estimation. John Wiley & Sons.
Tarozzi, A. and A. Deaton (2009). Using census and survey data to estimate poverty and inequality for
small areas. The review of economics and statistics 91 (4), 773–792.
Van der Weide, R. (2014). Gls estimation and empirical bayes prediction for linear mixed models with
heteroskedasticity and sampling weights: a background study for the povmap project. World Bank Policy
Research Working Paper (7028).
Zhao, Q. (2006). User manual for povmap. World Bank. http://siteresources. worldbank.
org/INTPGI/Resources/342674-1092157888460/Zhao_ ManualPovMap. pdf .
28