Guidelines to Small Area Estimation for Poverty Mapping
Paul Corral, Isabel Molina, Alexandru Cojocaru, and Sandra Segovia
June 16, 2022
i
Contents
1 Small Area Estimation for Poverty Mapping 1
1.1 A Brief Background to Small Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 About the Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Direct Estimates 9
2.1 Design-Based Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Basic Direct Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 GREG and Calibration Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Pros and Cons of Direct Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.4 Example: Direct Estimates of Poverty Indicators . . . . . . . . . . . . . . . . . . . 12
2.2 Evaluation of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Example: Bias vs MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Area-level Models for Small Area Estimation 18
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Conducting Your First SAE Application with the Fay-Herriot Model . . . . . . . . . . . . 19
3.3 Pros and Cons of Fay-Herriot Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Alternative Software Packages for Area-Level Models and Extensions . . . . . . . . 25
3.4 Technical Annex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.1 Aggregation to Higher Geographic Areas . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Unit-level Models for Small Area Estimation 32
4.1 Preparing for Unit-Level Small Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.1 The Assumed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Conducting Your First SAE Application with a Unit-Level Model . . . . . . . . . . . . . 35
4.2.1 Model Choice Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
ii
CONTENTS iii
4.2.2 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.3 The Alpha Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.4 Producing Small Area Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Pros and Cons of Unit-Level Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.1 ELL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.2 Empirical Best and CensusEB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.3 Alternative Software Packages for Unit-Level Models and Extensions . . . . . . . . 45
4.4 Unit-Level Models – Technical Annex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.1 The Assumed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.2 Monte Carlo Simulation and Bootstrap Procedures Under CensusEB . . . . . . . . 51
4.4.2.1 Molina and Rao’s (2010) Monte Carlo Simulation Procedure for Point
Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.2.2 Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5.1 Model Selection Do-File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5.2 SAE Simulation Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5 Poverty Mapping in Oﬀ-Census Years 67
5.1 Unit-Context Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1.1 Limitations of Unit-Context Models . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Gradient Boosting for Poverty Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Pros and Cons of Methods for Poverty Mapping in Oﬀ-Census Years . . . . . . . . . . . . 77
5.3.1 Unit-Context Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.2 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Unit-Context Models – Technical Annex . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4.1 Producing Estimators Based on Unit-Context Models . . . . . . . . . . . . . . . . 80
5.5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5.1 Simulation Experiment 1 for Unit-Context Models . . . . . . . . . . . . . . . . . . 82
5.5.1.1 Unit-Context Models – Validation . . . . . . . . . . . . . . . . . . . . . . 83
5.5.1.2 Unit-Context Models – Validation with Better Model Fit . . . . . . . . . 87
5.5.2 Simulation Experiment 2 for Unit-Context Models . . . . . . . . . . . . . . . . . . 93
5.5.2.1 Unit-Context Models – Validation Across All Poverty Lines . . . . . . . 94
CONTENTS iv
6 Model Diagnostics 103
6.1 The nested-error model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2 Model and variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3 Regression diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3.1 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3.2 Inﬂuence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.3.3 Residual analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3.3.1 Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3.3.2 Tests for normality of residuals and random eﬀects . . . . . . . . . . . . . 110
6.4 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.4.1 Useful deﬁnitions and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.4.2 Regression diagnostics do-ﬁle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7 Concluding Remarks 121
Acknowledgments
“If I have seen further it is by standing in the shoulder of giants.” – Sir Isaac Newton
These guidelines were prepared by Paul Corral, Isabel Molina, Alexandru Cojocaru, and Sandra Segovia.
The work has been carried out under the general direction of Carolina Sanchez, Benu Bidani, Carlos
Rodriguez Castelan and Kristen Himelein. These guidelines build upon the work of people within the
World Bank and academia like Chris Elbers, Jean Lanjouw and Pete Lanjouw, who developed the ELL
methodology, as well as the work of many other statisticians outside of the World Bank. Special thanks
to Qinghua Zhao for his advice and guidance during the creation of the Stata version of PovMap. Also
thanks to João Pedro de Azevedo, Minh Cong Nguyen, and Roy Van der Weide who all played crucial
roles in the advancement of the methods and tools used for poverty mapping within the World Bank. The
team also appreciates the many helpful comments on the guidelines and the background papers from
Minh Cong Nguyen, William Seitz, João Pedro de Azevedo, Moritz Meyer, Samuel Freije-Rodriguez,
Roy Van der Weide and David Newhouse. Also a special thanks to the reviewers of the ﬁnal guidelines:
Domingo Morales, Harm Jan Boonstra, Roy Van der Weide and Partha Lahiri. Finally, thanks to Kristen
Himelein and C. Jason Smith for their excellent editing. The team is also grateful to our many colleagues
in national statistical agencies across the globe who have helped us to improve our approach to the work.
A special thanks is also given to Mexico’s Instituto Nacional de Estadística y Geograﬁa (INEGI) for
providing us access to their Intercensal Survey. Without this survey, much of the work would not have
been possible.
v
Chapter 1
Small Area Estimation for Poverty
Mapping
The eradication of poverty, which was the ﬁrst of the Millennium Development Goals (MDG) established
by the United Nations and followed by the Sustainable Development Goals (SDG), requires knowing
where the poor are located. Traditionally, household surveys are considered the best source of information
on the living standards of a country’s population. Data from these surveys typically provide a suﬃciently
accurate direct estimate of household expenditures or income and thus estimates of poverty at the
national level and larger international regions. However, when one starts to disaggregate data by local
areas or population subgroups, the quality of these direct estimates diminishes. Consequently, National
Statistical Oﬃces (NSOs) cannot provide reliable wellbeing statistical ﬁgures at a local level. For example,
the "Module of Socioeconomic Conditions of the Mexican National Survey of Household Income and
Expenditure" (ENIGH in Spanish) is designed to produce estimates of poverty and inequality at the
national level and for the 32 federate entities (31 states and Mexico City) with disaggregation by rural
and urban zones, every two years, but there is a mandate to produce estimates by municipality every
ﬁve years, and the ENIGH alone cannot provide estimates for all municipalities with adequate precision.
This makes monitoring progress toward the Sustainable Development Goals more diﬃcult.
Beyond the SDGs, poverty estimates at a local level are crucial for poverty reduction. Policymakers
interested in poverty reduction and alleviation will often have a limited budget, and information on
poverty at a local level can potentially improve targeting. The need for cost-eﬀective disaggregated
statistics makes it necessary to use indirect techniques for poverty estimation at a local level. To better
understand the spatial distribution of poverty within countries, the World Bank has been producing
poverty estimates at increasingly higher resolutions (poverty maps) since before the turn of the 21st
century. A poverty map is the illustration of poverty rates on a map and is used to highlight the spatial
distribution of poverty and inequality across a given country. Poverty mapping relies on Small Area
Estimation (SAE) methods which are statistical tools that can be deployed to produce statistics for
small areas with much better quality than those obtained directly from a sample survey.1
Small area techniques combine auxiliary information from censuses, registers, or other larger surveys
to produce disaggregated statistics of suﬃcient precision where survey data alone is insuﬃcient. The
1 There is no universal threshold used to deﬁne a small area; every NSO establishes its own limit of what is an acceptable
coeﬃcient of variation (CV). Anything above the threshold is unlikely to be published by the NSO. Moreover, there is also
no deﬁned sample size that can determine if a group can be considered a small area since this is also dependent on the
indicator one wishes to estimate. For poverty, for example, the magnitude of the poverty rate is also related to the necessary
sample size for its accurate estimation, with higher proportions requiring smaller sample sizes.
1
CHAPTER 1. SMALL AREA ESTIMATION FOR POVERTY MAPPING 2
techniques achieve this by specifying some homogeneity relationships across areas; hence, they are based
on the idea that “union is strength.” Despite small area estimation being useful for many indicators, the
focus here is on small area methods used for poverty and related indicators derived from welfare.
Ending extreme poverty is one of the World Bank’s twin goals and poverty reduction is how the institu-
tion’s eﬀorts are often gauged. Accurate estimates of poverty at a granular level are an essential tool for
monitoring poverty reduction and spatial inequalities. The World Bank’s eﬀorts to develop small area
estimates of poverty can be traced back to the late 1990s. An initial eﬀort by the institution was imple-
mented by Hentschel et al. (1998) in an application in Ecuador and is the bedrock to what came after.
Much of the research eﬀorts on small area estimation within the World Bank came to a zenith in the early
2000s with the publication of the paper “Micro-level estimation of poverty and inequality” by Elbers,
Lanjouw, and Lanjouw (2003). The method came to be known as ELL, and for over a decade-and-a-half,
it was the default method used for all poverty mapping conducted by the World Bank. The method’s
popularity was spurred by the World Bank’s development of a software application called PovMap (Zhao
2006).2 The software provided a user-friendly interface for practitioners and did not require knowledge
of specialized programming languages. The PovMap software was followed nearly a decade after by a
Stata version called sae (Nguyen et al. 2018).3
The Guidelines to Small Area Estimation for Poverty Mapping (hereafter the Guidelines) come after more
than two decades of poverty mapping by the World Bank. These guidelines on small area estimation are
meant as a practical guide to enable practitioners working on small area estimation of poverty to make
informed choices between the available methods and make them aware of the strengths and limitations
of the methodologies at hand. The Guidelines are also presented for practitioners who wish to learn
about updates to the software and methods used within the World Bank for small area estimation.
The Guidelines are not meant to be an exhaustive presentation of the small area estimation literature,
instead they focus almost exclusively on poverty and are a practical guide for practitioners to assist
them in navigating the multiple diﬀerent small area techniques available.4 The methods include, but
are not limited to, those proposed by Elbers, Lanjouw, and Lanjouw (2003) that generated a lot of the
early poverty mapping work at the World Bank, a number of subsequent reﬁnements and improvements
to the original ELL methodology, area-level models, and some ongoing research. The Guidelines advise
readers on what may be the best approach for diﬀerent poverty mapping projects and diﬀerent sets of
data constraints by reviewing the relevant literature and conducting simulation exercises that compare
across available methodologies. It is expected that the Guidelines will become a valuable resource to
practitioners working in the area of poverty mapping.
Before jumping into the Guidelines, a brief background to small area estimation is provided, as well as
a decision tree aimed at practitioners (Figure 1.1) who seek what may be the best approach given their
context. After the brief background, the guidelines present direct estimates in Chapter 2. Area-level
models are presented in Chapter 3, focusing on Fay-Herriot methods (Fay and Herriot 1979). Unit-level
methods are presented in Chapter 4, giving special attention to the approach from Elbers, Lanjouw, and
Lanjouw (2003) as well as that of Molina and Rao (2010). Chapter 5, presents proposed alternatives
for poverty mapping in oﬀ-census years beyond area-level models. Finally, Chapter 6 is devoted to
model selection and other considerations. Throughout chapters 2–5 the pros and cons of the chapter’s
corresponding approaches are noted.
2 Demombynes (2002) presents earlier work of a module in SAS of the methodology.
3 Although sae was initially created to replicate many of the features and methods implemented in PovMap, the package
has undergone many updates. Corral, Molina, and Nguyen (2021) detail the main updates and the reasons behind these.
4 For a more thorough discussion on small area estimation, including applications for other indicators of interest, readers
should refer to Rao and Molina (2015).
CHAPTER 1. SMALL AREA ESTIMATION FOR POVERTY MAPPING 3
1.1 A Brief Background to Small Area Estimation
Small area estimates are often used to create a poverty map, but the two terms, "Poverty Mapping" and
"Small Area Estimation", should not be used interchangeably. Small area estimation is a set of methods
used by National Statistical Oﬃces (NSO) and other organizations to achieve estimates of acceptable
quality for publication. As statistical information is disaggregated into smaller population subgroups, it
will become noisier and will yield larger Coeﬃcients of Variation (CV), which may exceed the threshold
established by the NSO above which they will choose not to publish that speciﬁc estimate. Model-based
small area estimation techniques incorporate auxiliary information from censuses, registers, geospatial
data or other large surveys, to produce estimates of much better quality, even for areas with very small
sample sizes.
Model-based techniques are popular because they yield good quality estimates even for areas with very
small sample sizes. Of course, to achieve this, one needs to make model assumptions, and these as-
sumptions need to be validated using the available data (Molina, Corral, and Nguyen 2021). Moreover,
the quality of the model behind the estimates should be thoroughly evaluated and small area estimates
should be accompanied by valid measures of their precision (Rao and Molina 2015).
Among small area estimation methodologies, area-level models use only aggregate auxiliary information
at the area level. Aggregated data are more readily available because it is typically not subject to
conﬁdentiality. These models have other advantages that come with aggregate data, such as lower
sensitivity to unit-level outliers. The disadvantage of the area-level method is less detailed information
than unit-level models, speciﬁcally at the microdata level. Perhaps the most popular area method is
the one proposed by Fay and Herriot (1979) which is discussed in more detail in Chapter 3 of these
Guidelines.
Unit-level models take advantage of detailed unit-record information, when such information is avail-
able. The models rely on detailed income/expenditure information from household surveys and a set of
household-level characteristics in both a household survey and a population census to simulate household
expenditures or incomes for each household in the population census data.5
The ﬁrst unit-level model for small area estimation, discussed in more detail in Chapter 4 of these
Guidelines, was proposed by Battese, Harter, and Fuller (1988) to estimate county crop areas of corn
and soybeans. That model, called hereafter the BHF Model, includes, apart from the usual individual
errors, random eﬀects for the areas representing the between-area heterogeneity or idiosyncrasy the
available auxiliary variables cannot explain. These models might also include area-level (or contextual)
variables and thus may incorporate much more information in the estimation process than area-level
models. As noted by Robinson (1991), models with random eﬀects belong to the general class of mixed
models and are commonly used in other ﬁelds such as Biostatistics, Engineering, Econometrics, and
additional Social Sciences.6
Model-based small area estimates rely on statistical models to establish a relationship between a depen-
dent variable and a set of covariates. The covariates and the parameters obtained from the ﬁtted model
are then used to predict the dependent variable in the auxiliary data. In the context of small area esti-
mates of poverty, under unit-level models, the dependent variable is typically consumption/expenditure
or income and rely on properly replicating the welfare distribution. The simulated welfare distribution
makes it possible to obtain well-being indicators, such as the FGT class of indicators (Foster, Greer, and
5 Note that simulations are not required when using analytical formulas to obtain poverty estimates. When assuming
normality, the expected value of the poverty headcount and gap can be calculated without resorting to Monte Carlo
simulations (Molina 2019).
6 Early applications were in the area of animal and plant breeding to estimate genetic merits (Robinson 1991).
CHAPTER 1. SMALL AREA ESTIMATION FOR POVERTY MAPPING 4
Thorbecke 1984). In the case of area-level models, the dependent variable is the direct estimator of the
same indicator for the area.
The ELL method falls under the class of methods based on unit-level models for small area estimation.
The method relies on a BHF model speciﬁed at the household level and may also include area or sub-
area-level (contextual) variables. At the time of publication, it was one of the ﬁrst times a model relied on
using microdata to simulate the welfare distribution, which allowed the calculation of numerous indicators
beyond just area aggregates.7 The methodology quickly gained traction within the World Bank, as well
as among National Statistical Oﬃces around the world, and has been one of the most applied methods
to obtain small area estimates. Since ELL’s publication, however, there have been considerable advances
in the literature, including Molina and Rao’s (2010) Empirical Best (EB) method.
When deciding which small area estimation method is preferable, there are many considerations the
practitioner should take into account, but ultimately, the method chosen is often driven by the data
at hand. When the practitioner has access to census data and a contemporaneous household survey, it
is nearly always advisable to apply a unit-level model (Chapter 4), because they often yield the most
signiﬁcant gains in precision. However, the data requirements for these models are considerable, and
these data need to be checked to ensure the two data sources are also comparable beyond the temporal
aspect. Potential variables need to be deﬁned similarly in both data sources and must have similar
distributions, as diﬀerences may lead to biased estimators. When choosing a unit-level model-based
procedure, the next decision is often the aggregation level at which estimates will be reported. If there
is a need for estimates at two diﬀerent levels, then two-fold nested error models may be used; otherwise
one-fold nested error models are adequate.
When unit-level models are not possible, area-level models may be a viable alternative (Chapter 3).
Area-level models are often easier to implement than unit-level models, since data requirements are
minimal and may be used when census data and survey data are not contemporaneous or incongruous.
Also, area-level models rely on aggregated data, which is often not subject to conﬁdentiality. In addition,
if the most recent census is too old for a unit-level model, the recommended approach is to consider an
area-level model. The most popular area-level model for small area estimation is the Fay-Herriot (FH)
model, introduced by Fay and Herriot (1979), to estimate per capita income of small areas in the United
States. The U.S. Census Bureau has used area-level models within the Small Area Income and Poverty
Estimates (SAIPE) project to inform the allocation of federal funds to school districts. The Fay-Herriot
method, discussed in the area-level model chapter, will often require a separate model for each indicator,
implying that area-level covariates are linearly related to the indicator of interest. If there is data at
diﬀerent aggregation levels, it may be possible to implement a two-fold area-level model introduced by
Torabi and Rao (2014), although no software implementation is currently available. Alternatively, it is
possible to perform an FH model at the lowest aggregation level, and aggregate these FH estimates to
a higher level (see section 3.4.1). Although area-level models predate ELL and in principle, should be
simpler to execute, they have not been as commonly implemented by World Bank teams.
In addition to unit- and area-level models, unit-context models have been proposed in the context of
SAE for poverty during oﬀ-census years (Chapter 5). Unit-context models (section 5.1) attempt to
model the population’s welfare distribution by using only area-level covariates. This method may be
attractive because it can rely on easily accessible aggregate data and may beneﬁt from the larger sample
size obtained via household-level welfare distribution and thus appears to combine the best features of
both area-level and unit-level methodologies. Nevertheless, simulated data and real-world data have
shown that unit-context models yield biased poverty estimates and, therefore, based on the literature
7 ELL follows and builds upon the work of Hentschel et al. (1998).
CHAPTER 1. SMALL AREA ESTIMATION FOR POVERTY MAPPING 5
to date, are discouraged and not considered preferable over estimators based on FH area-level models.
Additionally, unit-context models rely on a parametric bootstrap procedure to estimate Mean Square
Error (MSE). Given the assumptions of the unit-context model, that household-level welfare distribution
does not depend on household-speciﬁc characteristics, it is unlikely that the MSE can be accurately
estimated following a parametric bootstrap, making it diﬃcult to evaluate the resulting small area
estimates properly.
Figure 1.1: SAE Decision tree
Machine learning methods have also been proposed to obtain poverty maps in oﬀ-census years or when
census data is unavailable. The methods often combine household survey data with ancillary Geo-coded
(publicly available and proprietary) data to present local estimates of welfare. These methods have
gained popularity among academia and policy circles since the publication of Jean et al. (2016). For
example, during the COVID-19 pandemic, cash transfers in Nigeria were informed by poverty estimates
obtained from a machine learning approach by Chi et al. (2021). However, these methods lack an accurate
estimate of the method’s noise and, to date, have not been submitted to a rigorous statistical validation.
The Guidelines present a brief look into how these machine learning methods compare to other available
small area estimation approaches. Although much work remains to be done regarding machine learning
CHAPTER 1. SMALL AREA ESTIMATION FOR POVERTY MAPPING 6
methods, the Guidelines posit that these methods are promising (see section 5.2).
As the reader delves into the Guidelines and cited literature, the complexity of which method to choose
becomes apparent. Despite this caveat, a simple decision tree on method availability is presented in
Figure 1.1. This decision tree can be used to assist practitioners in choosing which model is the best
route for their context. The decision tree simpliﬁes the process by directing readers to speciﬁc chapters
in the Guidelines relevant to their particular situation.
1.2 About the Guidelines
The Guidelines are meant as a practical guide for readers. Stata scripts are provided throughout the
chapters. These scripts allow replicating most sections and ﬁgures, which hopefully can assist new and
more experienced practitioners in their SAE applications. Readers may download the latest Stata (SAE)
packages used throughout the Guidelines from:
1. Unit-level models: https://github.com/pcorralrodas/SAE-Stata-Package
2. Area-level models: https://github.com/jpazvd/fhsae
Although R oﬀers a wide variety of packages for poverty-related small area estimation these are described
in other sources: Rao and Molina (2015), Pratesi (2016), Molina (2019), and Morales et al. (2021), for
example.
To assist readers throughout the text, simulated data (created following an assumed model) as well as real
world data are used. These data are used for method comparisons. Simulated data follow an assumed
model and data generating process and are used for model-based simulations. Scripts are provided in the
case of simulated data which can assist readers to replicate many of the analyses in the text. These data
assist illustrating the application of the diﬀerent methods discussed and are used to make salient the
advantages and disadvantages of the methods discussed.8 In addition to simulated data, the Guidelines
use the Mexican Intracensal survey for design-based validations where methods are compared using real
world data and the real data generating process is unknown.9
8 See Tzavidis et al. (2018) for more details.
9 The Mexican Intracensal survey contains a welfare measure and is modiﬁed to mimic a census of 3.9 million households
and 1,865 municipalities to allow for a design-based validation. From the created census, 500 survey samples were drawn for
validations shown throughout the document (see Corral et al. (2021) for more details on the modiﬁcations to the Mexican
Intracensal survey and the sampling approach taken).
CHAPTER 1. SMALL AREA ESTIMATION FOR POVERTY MAPPING 7
References
Battese, George E., Rachel M. Harter, and Wayne A. Fuller (1988). “An Error-Components Model for
Prediction of County Crop Areas Using Survey and Satellite Data”. In: Journal of the American
Statistical Association 83.401, pp. 28–36. issn: 01621459. url: http : / / www . jstor . org / stable /
2288915.
Chi, Guanghua, Han Fang, Sourav Chatterjee, and Joshua E Blumenstock (2021). “Micro-estimates of
Wealth for All Low-and Middle-income Countries”. In: arXiv preprint arXiv:2104.07761.
Corral, Paul, Kristen Himelein, Kevin McGee, and Isabel Molina (2021). “A Map of the Poor or a Poor
Map?” In: Mathematics 9.21. issn: 2227-7390. doi: 10.3390/math9212780. url: https://www.mdpi.
com/2227-7390/9/21/2780.
Corral, Paul, Isabel Molina, and Minh Cong Nguyen (2021). “Pull Your Small Area Estimates up by
the Bootstraps”. In: Journal of Statistical Computation and Simulation 91.16, pp. 3304–3357. doi:
10 . 1080 / 00949655 . 2021 . 1926460. url: https : / / www . tandfonline . com / doi / abs / 10 . 1080 /
00949655.2021.1926460.
Demombynes, Gabriel (2002). “A Manual for the Poverty and Inequality Mapper Module”. In: University
of California, Berkeley and Development Research Group, the World Bank.
Elbers, Chris, Jean O Lanjouw, and Peter Lanjouw (2003). “Micro–level Estimation of Poverty and
Inequality”. In: Econometrica 71.1, pp. 355–364.
Fay, Robert E and Roger A Herriot (1979). “Estimates of Income for Small Places: An Application of
James-Stein Procedures to Census Data”. In: Journal of the American Statistical Association 74.366a,
pp. 269–277.
Foster, James, Joel Greer, and Erik Thorbecke (1984). “A Class of Decomposable Poverty Measures”. In:
Econometrica: Journal of the Econometric Society 52, pp. 761–766.
Hentschel, Jesko, Jean Olson Lanjouw, Peter Lanjouw, and Poggi Javier (1998). “Combining Census and
Survey Data to Study Spatial Dimensions of Poverty”. In: World Bank Policy Research Working Paper
1928.
Jean, Neal, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell, and Stefano Ermon (2016).
“Combining Satellite Imagery and Machine Learning to Predict Poverty”. In: Science 353.6301, pp. 790–
794. url: https://www.science.org/doi/abs/10.1126/science.aaf7894.
Molina, Isabel (2019). Desagregación De Datos En Encuestas De Hogares: Metodologías De Estimación
En áreas Pequeñas. CEPAL. url: https://repositorio.cepal.org/handle/11362/44214.
Molina, Isabel, Paul Corral, and Minh Cong Nguyen (2021). “Model-based Methods for Poverty Mapping:
A Review”.
Molina, Isabel and JNK Rao (2010). “Small Area Estimation of Poverty Indicators”. In: Canadian Journal
of Statistics 38.3, pp. 369–385.
Morales, Domingo, María Dolores Esteban, Agustín Pérez, and Tomáš Hobza (2021). A Course on Small
Area Estimation and Mixed Models: Methods, Theory and Applications in R. Springer Nature.
Nguyen, Minh Cong, Paul Corral, João Pedro Azevedo, and Qinghua Zhao (2018). “sae: A Stata Package
for Unit Level Small Area Estimation”. In: World Bank Policy Research Working Paper 8630.
Pratesi, Monica (2016). Analysis of Poverty Data by Small Area Estimation. 1st. Wiley Series in Survey
Methodology. John Wiley and Sons Ltd. isbn: 978-1-118-81501-4. url: https : / / books . google .
com.gh/books?hl=en&lr=&id=TS29BgAAQBAJ&oi=fnd&pg=PR15&dq=Analysis%20of%20poverty%
20data%20by%20small%20area%20estimation%C3%A2%C2%80%C2%9D.&ots=cPy8eHP_Xu&sig=Ktc_
IhhF1IjHGGdpnDKSA7vivDE&redir_esc=y#v=onepage&q=Analysis%20of%20poverty%20data%20by%
20small%20area%20estimation%C3%A2%C2%80%C2%9D.&f=false.
Rao, JNK and Isabel Molina (2015). Small Area Estimation. 2nd. John Wiley & Sons.
CHAPTER 1. SMALL AREA ESTIMATION FOR POVERTY MAPPING 8
Robinson, George K (1991). “That BLUP Is a Good Thing: The Estimation of Random Eﬀects”. In:
Statistical science 6.1, pp. 15–32.
Torabi, Mahmoud and JNK Rao (2014). “On Small Area Estimation under a Sub-area Level Model”. In:
Journal of Multivariate Analysis 127, pp. 36–55.
Tzavidis, Nikos, Li-Chun Zhang, Angela Luna, Timo Schmid, and Natalia Rojas-Perilla (2018). “From
Start to Finish: A Framework for the Production of Small Area Oﬃcial Statistics”. In: Journal of the
Royal Statistical Society: Series A (Statistics in Society) 181.4, pp. 927–979.
Zhao, Qinghua (2006). User Manual for Povmap. url: http://siteresources.%20worldbank.%20org/
INTPGI/Resources/342674-1092157888460/Zhao_%20ManualPovMap.%20pdf.
Chapter 2
Direct Estimates
To fully beneﬁt from the Guidelines, it is necessary ﬁrst to understand the background materials, in-
cluding a description of the design-based setting, followed by typical direct estimators, along with the
advantages and disadvantages in their applications. The goal of Chapter 2 is to present background
material for the chapters that follow. The chapter begins by describing the design-based setting. Then
it gives a description of the typical direct estimators and their advantages and disadvantages in their
applications. Finally, some tips to evaluate estimates are provided.
Direct estimates are generated by taking a sample from the target universe of observations, surveying
that sample, then using statistical techniques to generalize the results to the entire population of interest.
Sample surveys provide reliable estimates of a target population without observing the entire population.
Sample survey data is used to produce direct estimates of totals, means, and other characteristics for
the whole population and other domains. Here, “direct” means that only the sample units from the area
are used (See Lohr (2010) or Cochran (2007) for a thorough treatment of sampling theory and direct
estimation).
In areas with suﬃciently large sample size, statistical institutes often report direct estimates by default
due to their desirable properties with respect to the sampling design (Molina 2019). Though there are
some global norms, deﬁning “suﬃciently large,” usually measured by the coeﬃcient of variation, is done
by the data producers. In cases where direct estimates are not suﬃciently large, the results may still be
reported, but the lack of quality is usually ﬂagged for users. In instances where the desired quality is
not met, small area estimates are necessary. Hence, small area estimates may be judged by how much
they improve the quality of the direct estimates.
2.1 Design-Based Estimation
The traditional way to provide estimates of ﬁnite population parameters based on survey data is to
rely on design-based (or repeated sampling) estimation methods. However, using a survey to produce
direct estimates of adequate precision for particular domains of the population can be economically
prohibitive since it may require considerably large sample sizes (Cochran 2007). Even in the case of
limitless resources, huge samples may not be advisable because of a heightened risk of non-sampling or
implementation error. The quality of direct design-based estimates is evaluated with respect to their
sampling design, that is, with respect to all the possible samples taken from the population under the
9
CHAPTER 2. DIRECT ESTIMATES 10
proposed sampling design (Molina 2019). In this scenario the values of the variable of interest are
considered ﬁxed and they only vary due to the variation induced by the sampling design (ibid ).1
Consider the following notation for the next sections:
• U is the target population, e.g. the set of inhabitants in a country, of size N .
• y1 , . . . , yN are measurements of the study variable at the population units (assumed to be ﬁxed).
• Target ﬁnite population parameter or quantity of interest: δ = h(y1 , . . . , yN ). For example, the
population mean Y¯ = 1 N yj .
N j =1
• s is a random sample of size n drawn from the population U according to a given survey design.
• r = U − s is the set of non-sampled units (size N − n).
• There are D subpopulations/domains/areas, indexed by d = 1, . . . , D, of sizes N1 , . . . , ND .
¯d = N −1 Nd
• The mean of a variable Y in a subpopulation/domain/area d is given by Y d i=1 ydi , where
ydi denotes the value of Y for individual i in area/domain d.
D
• sd is the subsample of size nd (which may be the empty set) from domain/area d, where d=1 nd =
n.
• rd is the set of elements outside the sample of area d = 1, . . . , D.
• πdi is the inclusion probability of individual i in the sample from area/domain d.
−1
• ωdi = πdi is the sampling weight of individual i in area/domain d.
• πd,ij is the joint inclusion probability of individuals i and j in the sample from area/domain d.
With this notation in mind, the following subsection presents the standard direct estimators, including
basic direct estimators such as Horvitz-Thompson (HT), Hájek, and estimators requiring auxiliary data
like GREG and calibration estimators.
2.1.1 Basic Direct Estimators
Horvitz-Thompson (HT)
Nd
The Horvitz-Thompson (HT) estimator of the total of area/domain d, Yd = i=1 ydi , is Y
ˆd =
i∈sd ωdi ydi .
This estimator is unbiased under the sampling design.
¯d = N −1
To estimate the mean of area/domain d, Y
Nd ˆ
¯d = N −1
ydi , the HT estimator is Y ωdi ydi .
d i=1 d i∈sd
Note that here it is assumed that the true area’s population Nd , is known.
Assuming positive inclusion probabilities πdi and πd,ij for every pair of units i, j in the area, an unbiased
estimate of the variance is given by:
2
ˆ
¯d ) = N −2 ydi ydi ydj πd,ij − πdi πdj
varπ (Y d 2 (1 − πdi ) + 2
πdi π π πd,ij
)
i∈sd i∈ s
d j ∈sd ,j>i di dj
1 Rao and Molina (2015) note that, it is possible to model population values as random instead of ﬁxed to obtain
model-dependent direct estimates. For such estimates, their inferences are based on the probability distribution induced
by the model; and will typically not depend on survey weights.
CHAPTER 2. DIRECT ESTIMATES 11
One typical challenge in applications is that there is not enough information about the sampling design.
Without the second-order inclusion probabilities πd,ij , the formula above cannot be applied. Nevertheless,
for certain sampling designs, where πdij ≈ πdi πdj , for j = i, the second term of the formula approximates
zero. Therefore, a simpler variance estimator, which does not depend on the second-order inclusion
ˆ
¯d ) = N −2
probabilities πd,ij , would be: varπ (Y −1
i∈sd ωdi (wdi − 1)ydi , where ωdi = πdi is the sampling
2
d
weight of unit i in area/domain d.
Hájek
Despite the HT estimator being unbiased under the sampling design, its variance under the design may
be quite large. An alternative, although slightly biased, is Hájek’s estimator. This estimator is obtained
just by using the sampling weights for the considered area to estimate Nd ; consequently, for estimation of
¯d , it does not require knowing the population size of area d. The Hájek estimator uses N
Y ˆd = ωdi i∈sd
instead of Nd to estimate the area mean Y¯d . However, an unbiased estimator of its variance is obtained
ˆ
¯HA in the variance estimator of the HT estimator.
replacing ydi by edi = ydi − Y d
Both HT and Hájek use only area-speciﬁc sample data, that is to say, they only use the portion of the
survey data for a given area d.
Variance Estimation in Applications
Another challenge that may be encountered in applications is that, in many domains/areas d, there
is only a single primary sampling unit (PSU) present in the survey data; hence, the habitual variance
estimator cannot be calculated and some approximation of the variance is required.
Alternatives to approximate the variance of the estimator that depend on the sampling design and on
the information available to the practitioner include linearization, a balanced repeated replication BRR
method with Fay’s method,2 as well as jackknife and bootstrap techniques. Rao (1988) and Rust and Rao
(1996) provide a deeper discussion on variance estimation in sample surveys and application examples.
Additionally, Bruch, Münnich, and Zins (2011) provide an overview of variance estimation in the presence
of complex survey designs.
2.1.2 GREG and Calibration Estimators
Unlike the basic direct estimators presented above, the Generalized Regression (GREG) and, more
generally, calibration estimators, require auxiliary information, such as the population totals or means of
auxiliary variables in d, X¯ d = N −1 Nd ydi . If Xˆ ¯ d , then the GREG estimator
¯ d is the HT estimator of X
d i=1
¯ ˆ
¯ ˆ
¯ ¯ ˆ ˆ
of Y is given by: Y + (X − X , ) B , where B is the weighted least squares estimator of the coeﬃcient
¯ ¯
d d d d d d
vector β d in linear regression of ydi in terms of the vector xdi in area d. For an exhaustive presentation
of GREG estimators, see Rao and Molina (2015).
2.1.3 Pros and Cons of Direct Estimators
Data requirements:
• Sampling weights ωdi for survey units.
¯d .
• True population size Nd of area d for HT estimator of Y
2 Section 4 of Dippo, Fay, and Morganstein (1984) details the application of the method at the U.S. Census Bureau.
CHAPTER 2. DIRECT ESTIMATES 12
Some advantages and disadvantages of direct estimates are the following:
Pros:
• They make no model assumptions (non-parametric).
• HT is exactly unbiased under the sampling design for large areas. Hájek and GREG estimators
are approximately unbiased under the sampling design for large areas.
• It is an unbiased estimator under the sampling design if the expected value (mean) of the
estimator across all the possible samples sd drawn from area d with the given design is equal to
the population parameter. In other words, if its design bias is equal to zero for all the values of
ˆd ) − θd = 0.
ˆd of parameter θ is unbiased if and only if Eπ (θ
the parameter. The estimator θ
ˆ diﬀers from the
• Consistency: as the sample size increases, the probability that the estimator θ
true value θ by more than ε approximates 0, for every ε > 0.
D
• Additivity (Benchmarking property): d=1
ˆ , that is, the direct estimate of the total
ˆd = Y
Y
for a larger area covering several areas coincides with the aggregation of the estimates of the totals
for the areas within the larger area.
Cons:
• Ineﬃcient for small areas. For an area d with small nd , traditional area-speciﬁc direct estimators
do not provide adequate precision.
• Direct estimates cannot be calculated for non-sampled domains.
2.1.4 Example: Direct Estimates of Poverty Indicators
Example based on Molina and Rao (2010) and Molina (2019).
• Target population: N households in a country.
• There are C diﬀerent areas/clusters c = 1, . . . , C of sizes N1 , . . . , NC respectively.
• Let ych be total income for each household h = 1, . . . , Nc in area/cluster c = 1, . . . , C .
• Let z be a ﬁxed poverty line.
z −ych α
• FGT indicators are deﬁned as Fαch = z I (ych < z ), h = 1, . . . , Nc , α = 0, 1, 2, where the
indicator function I (·) is 1 if household h is under the poverty line z , and 0 otherwise (Foster,
Greer, and Thorbecke 1984).
Nc
• Then, for every cluster/area c, the FGT poverty indicator of order α for area c is Fαc = 1
Nc h=1 Fαch .
• The goal is to estimate the poverty indicator Fαc for every cluster/area c; take a random sample
sc of size nc (which might be equal to zero) from cluster/area c.
• Horvitz-Thompson estimator: F
ˆαc = 1
Nc h∈sc ωch Fαch .
• Hájek estimator: F
ˆαc = 1
ˆc
N h∈sc ωch Fαch , where N
ˆc =
h∈sc ωch .
−1
• Here, ωch = πch , where πch is the inclusion probability of household h in the sample of cluster/area
c.
CHAPTER 2. DIRECT ESTIMATES 13
2.2 Evaluation of Estimates
Comparison of estimates is based on statistical measures of precision and accuracy. It is stated that
ˆ has more precision than estimate θ
estimate θ ˆ if the (replicable) values of the estimate θˆ are closer to
ˆ have less dispersion. On the other
ˆ , that is to say, if the realizations of θ
each other than the ones of θ
ˆ), to the true value of the population
ˆ, i.e. E (θ
hand, accuracy measures the closeness of estimate θ
parameter θ.
For example, the estimated mean squared error (MSE) is a measure of precision (measured by the
ˆ).
ˆ) and accuracy (measured by the square bias of the estimate θ
variance of the estimate θ
The particular error measures and “acceptable” error levels depend on National Statistical Oﬃce’s guide-
lines or practitioners/researchers. The most common error measures of an estimator are: standard error,
variance, coeﬃcient of variation (CV), and the mean squared error (MSE). The smaller the value for all
these measures, the better is the estimator.
The sections above reviewed direct estimators for means and totals. Direct estimators are obtained simi-
larly for any other target parameter that is additive in the individual observations. Direct estimators are
recommended at the national level and for disaggregated levels where direct estimators have a coeﬃcient
of variation (CV) below an established threshold for every area (Molina 2019). For disaggregations where
direct estimators have a CV above the CV threshold or absolute relative bias (ARB) above a given ARB
threshold, direct estimates are not recommended (see Chapters 3 and 4 for the application of indirect
estimators). If there are areas where not even the indirect estimators satisfy the former requirements, it
is recommended to properly highlight these estimates, indicating that these have low quality.3
In order to obtain a clear assessment of the precision and accuracy gains of model-based SAE meth-
ods over direct estimation, it is helpful to present the following information when reporting poverty
estimates:4
• Area code d
• True population size of area d, Nd
• Sample size of area d, nd
• Direct estimate of poverty for area d, F
ˆαd
• Small area estimate of poverty for area d, for example EB F
ˆ EB
αd
• Estimated variance of area d, F
ˆαd
ˆ EB in area d or
• Estimated mean squared error (MSE) of F αd
ˆαd and F
• Coeﬃcient of variation (CV) of both F ˆ EB in area d
αd
Direct estimators have the advantage that they are (approximately) unbiased under the sampling design.
When comparing two unbiased estimators, choosing the one with the smallest variance is recommended.5
If one of them is not unbiased, it is recommended to select the one with the smallest MSE (Molina and
3 The CV threshold can vary from scenario to scenario; some practitioners and national statistical oﬃces consider that
estimates with a CV above 20% are not reliable.
4 See Molina and Rao (2010) for a clear example of poverty indicators report and assessment.
5 Since M SE (F
π
ˆαd ) + V arπ (F
ˆαd ) = Bias2 (F ˆαd ) and direct estimators are (approximately) unbiased, both MSE and
π
variance are approximately equal. This holds for areas with large sample size nd , for areas with small sample size, the
Hájek estimator is biased.
CHAPTER 2. DIRECT ESTIMATES 14
García-Portugues 2021), as it is preferable to tolerate small amounts of bias if there are large gains
in precision (the bias-variance tradeoﬀ). The model-based SAE methods presented in the subsequent
chapters introduce some bias under the design while decreasing the MSE. There is, however, a limit to
the amount of bias that can be tolerated before the estimators lose utility.
ˆ) =
ˆ, cv (θ
Why use the coeﬃcient of variation (CV)? The CV of θ ˆ)/θ, is a relative measure of
var(θ
error, which makes it unitless and easier to interpret than the MSE. Nevertheless, for the CV to have a
correct interpretation, the value of the estimate must be greater than zero. Moreover, when estimating
a proportion P (poverty rates, for example), larger sample sizes are often necessary to keep the error
measure below a certain limit than the sample size required when estimating totals (Molina 2019).6 Since
the CV increases as the proportion P decreases, the CV is no longer a measure of the relative error but
a measure of how small the proportion P is. Therefore, calculating the MSE (or its root) for proportions
P is a better option.
2.2.1 Example: Bias vs MSE
Figure 2.1 shows the design bias (left) and design MSE (right) of direct estimates of FGT0 indicators
(poverty rates) compared to unit-level model-based estimators, both presented at the municipal level. It
is essential to notice that even though the direct estimate’s bias is approximately equal to 0, its variance
(or MSE) in many areas is much greater than the one from the model-based estimators.
Figure 2.1: Bias and MSE of direct estimates of the FGT0 indicators at the municipal level
Source: Based on 500 samples drawn from the Mexican Intracensal survey which is used as a census of 3.9 million
households (see Corral et al. (2021)). Figure shows the comparison between design bias, where Hájek direct estimates are
approximately unbiased, but have a large MSE. On the other hand, indirect methods such as unit-level SAE, Census EB,
sacriﬁce design bias in pursuit of improved precision.
6 The CV will tend to be large when the poverty rate for a given area is low since even a small variance will yield a
relatively large CV. The only way to remedy such a result is by increasing the sample size.
CHAPTER 2. DIRECT ESTIMATES 15
2.3 Appendix
2.3.1 Deﬁnitions
Deﬁnitions are based on Rao and Molina (2015) and Cochran (2007).
• Unit of analysis: the level at which a measurement is taken. For example, persons, households,
farms, ﬁrms.
• Target population or universe: The population of interest from which the relevant information
is desired. For example, the country’s population in the case where national living standards are
the objective.
• Domain/area: domain d may be deﬁned by geographic areas (state, county, school district,
health service area) or socio-demographic groups or both (speciﬁc age-sex-race group within a
large geographic area) or other subpopulations (e.g., set of ﬁrms belonging to a census division by
industry group).
• Finite population parameter θ: quantity computed from the N measurements in the units of
analysis in the population. It is often a descriptive measure of a population, such as the mean,
variance, rate or proportion.
ˆ: function of the sample data that takes values “close” to θ.
• Estimator θ
ˆd of θd : estimator based only on the domain-speciﬁc sample data. An
• Direct estimator θ
estimator of an indicator for area d, θd , is direct if it is calculated only using data from that
domain/area d, without using data from any other area. These estimators have the advantage of
being (at least approximately) unbiased but tend to have low precision in areas with a small sample
size.
• Unbiased estimator under the sampling design: If the estimator’s expected value (mean)
across all the possible samples is equal to the population parameter. If its design bias is equal to
zero for all the values of the parameter. The estimator θˆ of parameter θ is design-unbiased if and
ˆ) − θ = 0.
only if Eπ (θ
ˆ − θ, note that the estimation error for a given sample is typically diﬀerent
• Estimation error: θ
from zero even if the estimator is unbiased. The design bias is the mean estimation error across all
the possible samples: Biasπ (θ ˆ) − θ = Eπ (θ
ˆ) = Eπ (θ ˆ − θ).
• Mean squared (estimation) error (MSE): M SE (θ ˆ) = E [(θ ˆ) =
ˆ − θ)2 ], note that M SE (θ
ˆ) in the design-based setup, where θ is not random. In the model-based setup,
ˆ) + V arπ (θ
Bias2 (θ π
ˆ) + V ar(θ
ˆ) = Bias2 (θ
it is M SE (θ ˆ − θ ).
ˆ: is the standard deviation of the sampling distribution of the estimator
• Standard error (SE) of θ
ˆ of parameter θ.
θ
ˆ) =
ˆ: cv (θ
• Coeﬃcient of variation (CV) of θ var(θ ˆ) is the associated standard error of
ˆ)/E (θ
ˆ) = θ when the estimator is unbiased.
the estimate over the expected value of the estimate (E (θ
The CV is also known as the relative standard deviation (RSD).
ˆ) = |(E (θ
ˆ: ARB (θ
• Absolute Relative Bias (ARB) of θ ˆ)|.
ˆ) − θ)/E (θ
CHAPTER 2. DIRECT ESTIMATES 16
ˆ diﬀers from
ˆ: as the sample size increases, the probability that the estimator θ
• Consistency of θ
the true value θ in more than ε approximates 0, for every ε > 0.
• Simple random sampling (SRS): Sampling approach where a sample is chosen from a larger
population at random. First-order inclusion probabilities are equal across elements (Särndal,
Swensson, and Wretman 2003).
– With replacement: Elements within the population may be randomly sampled multiple times.
In an ordered design sample containing information on the drawing order and the number of
times an element is drawn, every ordered sample has the same selection probability (ibid ).
– Without replacement: Elements are selected from a population without replacement. Every
sample of a ﬁxed size has an equal probability of being selected (ibid ).
• Sampling proportional to size: Method of sampling from a ﬁnite population where size mea-
sures for each unit are available. The probability of selecting a unit is proportional to its size. It
is often used for multistage sampling (Skinner 2014).
• Informative sampling: If the sample selection probabilities are related to the outcome values
(Rao and Molina 2015). This results in a set of base weights reﬂecting the unequal probability of
being sampled. Under informative sampling, in the absence of weights, the values of the outcome
variable are not representative of the population (Pfeﬀermann and Sverchkov 2009).
• Benchmarking: A practice where the direct estimator is assumed to be reliable and adjustments
to the small area estimates are necessary to ensure agreement with the reliable estimator (Rao
and Molina 2015). For example, the direct estimator could be the national level poverty rate and
small area estimates for the areas within the country are adjusted to ensure agreement between
the aggregate national poverty rate and the direct estimate of national poverty.
CHAPTER 2. DIRECT ESTIMATES 17
References
Bruch, Christian, Ralf Münnich, and Stefan Zins (2011). “Variance Estimation for Complex Surveys”.
In: Deliverable D3 1.
Cochran, William G (2007). Sampling Techniques. John Wiley & Sons.
Corral, Paul, Kristen Himelein, Kevin McGee, and Isabel Molina (2021). “A Map of the Poor or a Poor
Map?” In: Mathematics 9.21. issn: 2227-7390. doi: 10.3390/math9212780. url: https://www.mdpi.
com/2227-7390/9/21/2780.
Dippo, Cathryn S, Robert E Fay, and David H Morganstein (1984). “Computing variances from complex
samples with replicate weights”. In: Proceedings of the Survey Research Methods Section. American
Statistical Association Alexandria, VA, pp. 489–494.
Foster, James, Joel Greer, and Erik Thorbecke (1984). “A Class of Decomposable Poverty Measures”. In:
Econometrica: Journal of the Econometric Society 52, pp. 761–766.
Lohr, Sharon L (2010). Sampling: Design and Analysis (advanced Series). 2nd. Cengage Learning. isbn:
0495105279. url: http://dni.dali.dartmouth.edu/8vxeqmju4q43/12- bailee- klein- 1/read-
0495105279-sampling-design-and-analysis-advanced-series.pdf.
Molina, Isabel (2019). Desagregación De Datos En Encuestas De Hogares: Metodologías De Estimación
En áreas Pequeñas. CEPAL. url: https://repositorio.cepal.org/handle/11362/44214.
Molina, Isabel and Eduardo García-Portugues (2021). A First Course on Statistical Inference. Accessed:
2010-06-22. Bookdown.org. url: https://bookdown.org/egarpor/inference/.
Molina, Isabel and JNK Rao (2010). “Small Area Estimation of Poverty Indicators”. In: Canadian Journal
of Statistics 38.3, pp. 369–385.
Pfeﬀermann, Danny and Michail Sverchkov (2009). “Inference under Informative Sampling”. In: Handbook
of statistics. Vol. 29. Elsevier, pp. 455–487.
Rao, JNK (1988). “Variance Estimation in Sample Surveys”. In: Handbook of Statistics. Ed. by PK
Krishnaiah and CR Rao. Vol. 6. Elsevier B.V. Chap. 17, pp. 427–447. url: https://doi.org/10.
1016/S0169-7161(88)06019-5.
Rao, JNK and Isabel Molina (2015). Small Area Estimation. 2nd. John Wiley & Sons.
Rust, Keith F and JNK Rao (1996). “Variance Estimation for Complex Surveys Using Replication Tech-
niques”. In: Statistical methods in medical research 5.3, pp. 283–310.
Särndal, Carl-Erik, Bengt Swensson, and Jan Wretman (2003). Model Assisted Survey Sampling. Springer
Science & Business Media. isbn: 0-387-40620-4. url: https://books.google.com.gh/books?hl=
en&lr=&id=ufdONK3E1TcC&oi=fnd&pg=PR5&dq=Model%20assisted%20survey%20sampling&ots=
7hYW4A7HRH&sig=qkwyn4uwUZjs-RvboBiQt5RzW04&redir_esc=y#v=onepage&q=Model%20assisted%
20survey%20sampling&f=false.
Skinner, Chris J (2014). “Probability Proportional to Size (PPS) Sampling”. In: Wiley StatsRef: Statistics
Reference Online, pp. 1–5.
Chapter 3
Area-level Models for Small Area
Estimation
This chapter focuses on the application of the basic area-level models for small area estimation, using
Fay and Herriot’s (1979) model as an example. Following a brief introduction, the next section provides
an example of an application of the Fay-Herriot Model for small area estimation. Then the advantages
and disadvantages of the method are discussed. Finally, a technical appendix for the methodology is
provided. This section draws heavily from Molina (2019) and Molina, Corral, and Nguyen (2021), and
readers are advised to consult these texts for further information.
3.1 Overview
In the context of poverty, methods based on area-level models typically rely on estimating the distribution
of the target area-level poverty indicator, given a set of area-level covariates based on two models – a
sampling model for direct area estimates of poverty and a linking model where the true indicator is
linearly related to a vector of auxiliary variables across all areas (see Technical appendix in this chapter).
In contrast to unit-level methods, these methods do not require unit record (most commonly household
level) census data since they use only auxiliary information aggregated to the area level. Not typically
subject to conﬁdentiality concerns, aggregated data are usually more available. Such is often the case
with population census data, where aggregated estimates for localities may be readily available while
the detailed household-level data are restricted. Beyond censuses, data from other sources, such as
administrative records, remote sensing data, and previous census aggregates, may also be available at
the level of the areas or can be readily aggregated to that level and may be included as covariates in
an area-level model. Aggregated data are also less sensitive to unit-level outliers. However, if there is
high heterogeneity within the geographic units at which the area models are estimated, these beneﬁts
are balanced by a substantial loss of information when unit record level data are aggregated. Ultimately,
whether unit record-level models oﬀer a greater degree of precision depends on the explanatory power of
the covariates that can be included in the area-level and unit-level models.
The most popular area-level model is the Fay-Herriot (FH) model, introduced by Fay and Herriot (1979)
to estimate mean per capita income in small places in the United States of America (USA). The U.S.
Census Bureau has regularly used it within the Small Area Income and Poverty Estimates (SAIPE)
project to produce estimates for counties and states of the total number of people in poverty by age
18
CHAPTER 3. AREA-LEVEL MODELS FOR SMALL AREA ESTIMATION 19
groups, household median income, and mean per capita income (see Bell 1997). The U.S. Department of
Education uses these estimates to determine annual allocations of federal funds given to school districts.
Several other studies have employed area-level models to obtain poverty estimates small areas around the
world. Molina and Morales (2009) used these models to obtain estimates for Spanish provinces, Corral
and Cojocaru (2019) in Moldova, Casas-Cordero Valencia, Encina, and Lahiri (2016) in Chile, as well as
Seitz (2019) which applies the FH model to obtain district-level estimates of poverty in Central Asian
countries.
Extensions of the basic FH model to account for temporal and spatial correlation have also been consid-
ered in the poverty mapping context, although not explored in these Guidelines. Interested readers may
refer to Esteban et al. (2012a; 2012b), who used the FH model with temporal correlation ﬁrst proposed
by Rao and Yu (1994) to estimate small area poverty indicators. Giusti, Masserini, and Pratesi (2017)
used a spatial FH model to estimate mean income and poverty rates for the 57 Labor Local Systems
of the Tuscany region in Italy for the year 2011. Marhuenda, Molina, and Morales (2013) considered
a spatio-temporal model to estimate poverty indicators for Spanish provinces in 2008, making use of
survey data from years 2004-2008.
3.2 Conducting Your First SAE Application with the Fay-Herriot
Model
The ﬁrst step towards obtaining small area estimates based on the FH model is to compute direct
estimates. For the purpose of this exercise, a 1 percent SRS sample within municipalities was drawn
from the Mexican intracensal survey of 2015 (Encuesta Intracensal). To obtain direct estimates with
correctly calculated standard errors from a complex sample, it is necessary to input the structure by
specifying the stratiﬁcation and clustering used. Complex survey designs usually include a minimum
of two clusters per stratum to permit the estimation of sampling variance using standard procedures,
but as SAE reports results for areas below which the sample was originally designed, there may be
areas represented in the sample by just one PSU; in such cases, the usual variance estimates cannot be
computed and it may be necessary to obtain an estimate of the sampling variance via alternative methods.
This is a clear limitation of the FH method.1 Another limitation of the method is that, in locations with
a small sample size, everyone may be poor or no one is poor, which leads to an estimated variance of 0.
In this case, the FH model is not directly applicable in those locations unless another method is used
to predict those variances. The Stata script below describes how to obtain direct estimates of poverty
rates τ
ˆdDIR
and their sampling variances ψd for each location d in the sample:
clear all
set more off
/*==============================================================================
Do-file prepared for SAE Guidelines
- Real world data application
- authors Paul Corral
*==============================================================================*/
global main "C:\Users\\`c(username)´\OneDrive\SAE Guidelines 2021\"
global section "$main\2_area_level\"
global mdata "$section\1_data\"
global survey "$mdata\Survey_1p_mun.dta"
global povline = 715
1 See Section 3.4.
CHAPTER 3. AREA-LEVEL MODELS FOR SMALL AREA ESTIMATION 20
version 15
set matsize 8000
set seed 648743
*===============================================================================
// End of preamble
*===============================================================================
//load in survey data
use "$survey", clear
//Population weights and survey setup
gen popw = Whh*hhsize
//FGT
forval a=0/2{
gen fgt`a´ = (e_y<${povline})*(1-e_y/(${povline}))^`a´
}
//SVY set, SRS
svyset [pw=popw], strata(HID_mun)
//Get direct estimates
qui:svy: proportion fgt0, over(HID_mun)
//Extract poverty rates and (1,865 Municipalities) - ordered from
//smallest HID_mun to largest HID_mun
mata: fgt0 = st_matrix("e(b)")
mata: fgt0 = fgt0[1866..2*1865]´
mata: fgt0_var = st_matrix("e(V)")
mata: fgt0_var = diagonal(fgt0_var)[1866..2*1865]
//Our survey already has the area level means of interest obtained from
// the census. Leave data at the municipality level.
gen num_obs = 1
groupfunction [aw=popw], rawsum(num_obs) mean(poor fgt0) first(mun_*) by(HID_mun)
sort HID_mun //ordered to match proportion output
//Pull proportion´s results
getmata dir_fgt0 = fgt0 dir_fgt0_var = fgt0_var
replace dir_fgt0_var = . if dir_fgt0_var==0
replace dir_fgt0 = . if missing(dir_fgt0_var)
save "$mdata\direct.dta", replace
With the direct survey estimates τ
ˆdDIR
of the indicators τd in hand, it is possible to produce new estimates
of τd based on the ﬁtted model. This model takes as an input, apart from the direct estimates for each
location, τ
ˆdDIR
, their corresponding (heteroskedastic) sampling variances ψd , which are assumed known,
and the aggregated values of the covariates xd . This information is used to estimate the FH regression
coeﬃcients β and the variance of the area eﬀects ud , σu
2
. There are multiple approaches to predict the
unknown random eﬀects of the model, ud , in Eq. 3.1. Stata’s fhsae package (Corral et al. 2018)2
obtains estimates similar to those provided from the eblupFH() function of Molina and Marhuenda’s
(2015) R sae package, while the fayherriot package in Stata by Halbmeier et al. (2019) is an excellent
alternative since it oﬀers transformations to the data to ensure the assumptions of the model are met.
The available model-ﬁtting methods in the fhsae package are maximum likelihood (ML), restricted
maximum likelihood (REML), and an approach presented by Fay and Herriot (1979) based on the
method of moments. As noted by Molina (2019), REML corrects the ML estimate of σu
2
by the degrees
of freedom due to estimating the regression coeﬃcients, β , and yields a less biased estimator in ﬁnite
samples; thus REML is used in the example below.
2 Downloadable from https://github.com/pcorralrodas/fhsae.
CHAPTER 3. AREA-LEVEL MODELS FOR SMALL AREA ESTIMATION 21
The ﬁnal estimate of the target indicator τd is the empirical best linear unbiased (EBLUP) predictor of τd
under the FH model. This estimator uses the estimated value σ
ˆu2
of σu
2
and can be expressed as a weighted
average between the direct estimator and the regression-synthetic estimator x β ˆ as noted by Rao and
d
Molina (2015). The ﬁnal EBLUP referred to as FH estimator, is given by τ
ˆdFH
=γ ˆd
ˆd τ DIR
+ 1−γ ˆ .
ˆd xd β
2
ˆu
σ
In this estimator, the weight attached to the direct estimator τ
ˆdDIR
is given by γ
ˆd = σ
ˆu2 +ψ
d
which
lies between 0 and 1 and decreases in the error variance, ψd . In locations d with small sample size
and consequently where the direct estimator has a large sampling variance, ψd , or when the auxiliary
information in the model yields good explanatory power, a higher weight is assigned to the corresponding
synthetic estimator.
The script illustrating model selection and small area estimation based on FH is provided below. Al-
though there are options such as stepwise or lasso regression, it is preferable to use methods that consider
the random eﬀects (Lahiri and Suntornchost 2015). The method for variable selection used in the script
considers the random eﬀects and is similar in spirit to a stepwise regression. Initially, the random ef-
fect model is ﬁt including all the covariates. Due to lower computational requirements, the process
employs the feasible generalized least squares (FGLS) method from Chandra, Sud, and Gupta (2013).
Non-signiﬁcant covariates are removed from the model sequentially, starting with those with the largest
p-values.3 Then, from the remaining covariates, those with a variance inﬂation factor (VIF) above ﬁve
are similarly removed. Finally, it is also essential to check that the model’s assumptions hold. In this
case, the model residuals are relatively well aligned to the theoretical quantiles of a normal distribution,
although some outliers are present (Figure 3.1).
Figure 3.1: Fay-Herriot model residuals (left) and predicted area eﬀects (right)
Source: Based on a 1% SRS sample by municipality of Mexican Intracensal survey of 2015. The ﬁgure on the left presents
a normal q-q plot where the empirical quantiles of the standardized residuals from the FH model are plotted against the
theoretical quantiles of the normal distribution. The ﬁgure on the right plots the empirical quantiles of the predicted area
eﬀects against the theoretical quantiles of the normal distribution.
clear all
set more off
/*==============================================================================
Do-file prepared for SAE Guidelines
- Real world data application
- authors Paul Corral
*==============================================================================*/
3 The removal of non-signiﬁcant covariates is done using FGLS to ensure that the method’s covariance matrix is con-
sidered.
CHAPTER 3. AREA-LEVEL MODELS FOR SMALL AREA ESTIMATION 22
global main "C:\Users\\`c(username)´\OneDrive\SAE Guidelines 2021\"
global section "$main\2_area_level\"
global mdata "$section\1_data\"
global survey "$mdata\Survey_1p_mun.dta"
global povline = 715
//Global with eligible variables
global thevar mun_hhsize mun_age_hh mun_male_hh mun_piped_water ///
mun_no_piped_water mun_no_sewage mun_sewage_pub mun_sewage_priv ///
mun_electricity mun_telephone mun_cellphone mun_internet ///
mun_computer mun_washmachine mun_fridge mun_television mun_share_under15 ///
mun_share_elderly mun_share_adult mun_max_tertiary mun_max_secondary ///
mun_share_female
version 15
set matsize 8000
set seed 648743
*===============================================================================
// End of preamble
*===============================================================================
use "$mdata\direct.dta", clear
//Fit full model
fhsae dir_fgt0 $thevar, revar(dir_fgt0_var) method(fh)
local hhvars $thevar
//Removal of non-significant variables
forval z= 0.8(-0.05)0.0001{
qui:fhsae dir_fgt0 `hhvars´, revar(dir_fgt0_var) method(fh)
mata: bb=st_matrix("e(b)")
mata: se=sqrt(diagonal(st_matrix("e(V)")))
mata: zvals = bb´:/se
mata: st_matrix("min",min(abs(zvals)))
local zv = (-min[1,1])
if (2*normal(`zv´)<`z´) exit
foreach x of varlist `hhvars´{
local hhvars1
qui: fhsae dir_fgt0 `hhvars´, revar(dir_fgt0_var) method(fh)
qui: test `x´
if (r(p)>`z´){
local hhvars1
foreach yy of local hhvars{
if ("`yy´"=="`x´") dis ""
else local hhvars1 `hhvars1´ `yy´
}
}
else local hhvars1 `hhvars´
local hhvars `hhvars1´
}
}
//Global with non-significant variables removed
global postsign `hhvars´
//Final model without non-significant variables
fhsae dir_fgt0 $postsign, revar(dir_fgt0_var) method(fh)
//Check VIF
reg dir_fgt0 $postsign, r
gen touse = e(sample)
gen weight = 1
mata: ds = _f_stepvif("$postsign","weight",5,"touse")
global postvif `vifvar´
local hhvars $postvif
CHAPTER 3. AREA-LEVEL MODELS FOR SMALL AREA ESTIMATION 23
//One final removal of non-significant covariates
forval z= 0.8(-0.05)0.0001{
qui:fhsae dir_fgt0 `hhvars´, revar(dir_fgt0_var) method(fh)
mata: bb=st_matrix("e(b)")
mata: se=sqrt(diagonal(st_matrix("e(V)")))
mata: zvals = bb´:/se
mata: st_matrix("min",min(abs(zvals)))
local zv = (-min[1,1])
if (2*normal(`zv´)<`z´) exit
foreach x of varlist `hhvars´{
local hhvars1
qui: fhsae dir_fgt0 `hhvars´, revar(dir_fgt0_var) method(fh)
qui: test `x´
if (r(p)>`z´){
local hhvars1
foreach yy of local hhvars{
if ("`yy´"=="`x´") dis ""
else local hhvars1 `hhvars1´ `yy´
}
}
else local hhvars1 `hhvars´
local hhvars `hhvars1´
}
}
global last `hhvars´
//Obtain SAE-FH-estimates
fhsae dir_fgt0 $last, revar(dir_fgt0_var) method(reml) fh(fh_fgt0) ///
fhse(fh_fgt0_se) fhcv(fh_fgt0_cv) gamma(fh_fgt0_gamma) out
//Check normal errors
predict xb
gen u_d = dir_fgt0 - xb
histogram u_d
gen e_d = dir_fgt0 - fh_fgt0
histogram e_d
save "$mdata\direct_and_fh.dta", replace
If no transformation is applied in the model the FH estimates of the poverty rates may be negative for
some areas. For those areas, the recommendation is to not use those negative FH estimates.4
As can be seen from Figure 3.2, FH estimates are aligned to direct estimates. This ﬁgure also illustrates
the larger adjustment made to direct estimates in locations where the sampling variance is larger. The
resulting FH estimates also have a lower estimated root mean squared error (RMSE) than that of the
direct estimates, suggesting an improved eﬃciency of the model-based estimates (Figure 3.2, right).
4 This decision must be made with care and usually in consultation with colleagues from the relevant statistical agency.
CHAPTER 3. AREA-LEVEL MODELS FOR SMALL AREA ESTIMATION 24
Figure 3.2: FH poverty estimates compared to direct estimates
Source: Based on a 1% SRS sample by municipality of Mexican Intracensal survey of 2015. The ﬁgure on the left presents
a scatter plot of direct estimates on the X-axis and FH estimates on the Y-axis. The FH approach will often yield estimates
that are highly aligned to the direct estimates.
The ﬁgure on the right illustrates the gains in terms of root mean squared error for the FH estimates (Y-axis) versus
the standard error of the direct estimates (X-axis). Gains are often larger for the areas with smaller sample sizes (larger
sampling errors of direct estimates), and though most if not all areas typically see some gains in precision.
3.3 Pros and Cons of Fay-Herriot Models
The information in this section borrows heavily from Molina (2019) while adding more information on
relevant aspects.
Model requirements:
1. Direct estimates of indicators of interest and its sampling variance for the areas considered, τ
ˆdDIR
and ψˆd respectively (from the survey).
2. Aggregate data at area level of all necessary covariates for the model for every area considered, xd ,
d = 1, . . . , D.
Pros:
1. Tend to improve the direct estimates in terms of eﬃciency (MSE).
2. The regression model incorporates unexplained heterogeneity between areas through the area ef-
fects.
3. The FH estimator is a weighted average of the direct estimator and the regression synthetic es-
timator (i.e., obtained by a linear regression). For areas where the sample size is small or where
direct estimators have considerable sampling variance, the synthetic estimator will be given a higher
weight. Accordingly, the FH estimate will approximate the direct estimate in areas with a suﬃcient
sample .
4. Since the FH estimates make use of the direct estimates, which consider sampling weights with γd
tending to one as the area sample size increases, it is consistent under the sampling design. Hence,
provided that the sampling weights are accurate, FH estimates will be less aﬀected by informative
sampling (i.e., when the selection of the units for the sample depends on the outcome values - Rao
and Molina 2015).
CHAPTER 3. AREA-LEVEL MODELS FOR SMALL AREA ESTIMATION 25
5. Since area-level aggregates are used, the ﬁnal estimates are less aﬀected by outliers.
6. Can provide estimates for non-sampled areas based on the regression-synthetic component.
7. The Prasad-Rao MSE estimates (Prasad and Rao 1990) are approximately unbiased. This holds
as long as a suﬃcient number of areas are present and model assumptions are met, with normality
for random eﬀects and model errors.
8. It is easy and computationally fast to implement; considerably faster than unit-level models because
the number of areas is typically much smaller than the actual sample size in typical unit-level
models, and Monte Carlo or bootstrap methods are not necessary.
9. Does not require comparison of auxiliary variables between the diﬀerent data sources to consider
their use in the model.
10. Because aggregate data are used, conﬁdentiality restrictions do not hinder the data acquisition
process.
Cons:
1. Estimates are based on a model; thus, the model needs to be checked. For non-linear parameters
(like poverty), the linear assumption of the model can be problematic.
2. The sampling variances of the direct estimates are assumed to be known, and the error of estimating
these is not usually incorporated in the ﬁnal estimates’ estimated MSE.
3. The Prasad-Rao MSE estimator (Prasad and Rao 1990) is approximately unbiased under the model
with normality but is not design unbiased for a given area.
4. The model is ﬁt only on sampled areas, which can be a very small number out of the total number of
areas using only one observation by area (the direct estimator). As a consequence, model parameter
estimators will be much less eﬃcient than those obtained under unit-level models, and hence gains
in precision are expected to be lower than those of estimators based on unit-level models.
5. Every indicator requires its own model.
6. Once estimates are obtained, these cannot be further disaggregated unless a new model at that
level is constructed.
7. May require benchmarking adjustment to match direct estimates at higher aggregation levels.
Although sub-area-level models by Torabi and Rao (2014) could be used to obtain estimates that
naturally match at two diﬀerent aggregation levels (diﬀerent from the national level), user-friendly
software is still not available.
3.3.1 Alternative Software Packages for Area-Level Models and Extensions
In this section, the focus has been on Stata’s fhsae package. However, there is a wide variety of packages
available in R for small area estimation based on the basic FH area-level model and on some extensions
of this model proposed in the literature. Below is a list of some of these extensions of the FH model and
the corresponding software implementation:5
5 The discussion in this section borrows from Molina et al. (2021), as well as from Harmening et al. (2021).
CHAPTER 3. AREA-LEVEL MODELS FOR SMALL AREA ESTIMATION 26
1. The extension of the FH model proposed by Rao and Yu (1994) that incorporates temporal cor-
relation is implemented in the saery package in R by Leﬂer, Gonzalez, and Martin (2014). R’s
sae package by Molina and Marhuenda (2015) includes a FH model with temporal and spatial
correlation. Spatial correlation is also included in R’s emdi package (Kreutzmann et al. 2019).
(a) Robust alternatives to the basic FH model, as well as its extensions, including spatial and
temporal correlation, are detailed in Warnholz (2016) and implemented in R’s saeRobust
package (Warnholz 2018). The robust spatial correlation model is also available in the emdi
package in R (Kreutzmann et al. 2019).
2. In applications with several dependent target indicators, multivariate versions of the FH model may
help provide even more eﬃcient estimators. A multivariate version of the FH model was originally
proposed by Fay (1987), and has been used to improve the estimates of median income of four-
person families by using direct estimators of median income for three and ﬁve-person families by
Datta, Fay, and Ghosh (1991) and Datta et al. (1996). Multivariate FH models can be ﬁtted with
the R package msae (Permatasari and Ubaidillah 2020).
3. The basic FH model assumes that covariates are population aggregates measured without error.
For cases where auxiliary variables are measured with error, Ybarra and Lohr (2008) oﬀer an
extension to the basic FH model to account for the measurement error. Measurement error models
can be applied with the R package saeME (Mubarak and Ubaidillah 2020). This extension is also
available in the R package emdi (Kreutzmann et al. 2019).
4. Hierarchical Bayesian methods have been applied in the hbsae R package by Boonstra (2015).
The BayesSAE R package (Shi 2018) also includes Bayesian extensions.
5. Halbmeier et al. (2019) introduced the fayherriot Stata command, which allows for adjusting
non-positive random eﬀect variance estimates, and allows for transformation of the variable of
interest to deal with violations of the model’s assumptions. Transformation is also available in the
emdi package in R (Kreutzmann et al. 2019).
CHAPTER 3. AREA-LEVEL MODELS FOR SMALL AREA ESTIMATION 27
3.4 Technical Annex
The FH model is deﬁned in two stages.6 In the ﬁrst stage, the true values of the indicators τd for all
the areas d = 1, . . . , D, are linked by establishing a common linear regression model, called the linking
model, where the true indicator is linearly related to a vector of area-level covariates xd ,
τd = xd β + ud . (3.1)
The random errors in this regression model, ud , are called area eﬀects, because they represent the
unexplained between-area heterogeneity. They are assumed to satisfy usual regression assumptions, such
as having zero mean and constant variance σu2
, which is typically unknown. Note that the linear regression
model (3.1) cannot be ﬁtted to the data, because true values of the indicators τd are not observed. Direct
estimators τ
ˆdDIR
of the indicators are available from the survey microdata but are, obviously, subject
to survey error, since they are calculated only with the area-speciﬁc survey data, which is of small size.
Hence, in the second stage, this survey error is modeled by assuming that the direct estimators are
centered around the true values of the indicators, as follows:
τ
ˆdDIR
= τd + ed , (3.2)
with heteroscedastic errors, where error variances are var (ed |τd ) = ψd , d = 1, . . . , D, which is called the
sampling model. Note that the D error variances ψd , d = 1, . . . , D, need to be assumed as known even
if they are not; otherwise, there would be more unknown parameters than the available observations:
τ
ˆdDIR
, d = 1, . . . , D. These error variances are customarily estimated based on the survey microdata
from the area of interest and perhaps smoothed afterward since these estimates are highly unstable due
to the small area sample sizes.
The usual small area estimator obtained from this model is based on the best linear unbiased predictor
ˆd = α y of the data y = (ˆ
(BLUP), which is deﬁned as the linear function τ DIR
τ1 ,...,τ
ˆDDIR
) , where
τd − τd ) = 0, and is optimal (or best ) in
α = (α1 , . . . , αD ) , which is unbiased under the model, that is, E (ˆ
the sense of minimizing the mean squared error (MSE). The resulting BLUP τ ˆd of τd can be expressed as
a weighted average of the direct estimator τ
ˆDIR and the regression-synthetic estimator, x β ˆ, where the
d d
weight γd = σu
2
/(σu2
+ ψd ) attached to the direct estimator grows as the error variances ψd decrease (or
the area sample size increases), or the unexplained variation σu
2
grows. This means that, in areas with a
large sample size or when the auxiliary information is not very powerful, the FH estimates come close to
the direct estimators; in the opposite case, they come close to the regression-synthetic estimators. This
leads to the BLUP automatically borrowing strength only when it is actually needed and approximating
to the conventional survey direct estimators when the area sample size is large or the model is not useful.
The BLUP does not require normality assumptions on the model errors or area eﬀects, and it is unbiased
under the model. The empirical BLUP (EBLUP) is obtained replacing the unknown area eﬀects’ variance
2
σu by an estimator σ
ˆu2
with good properties (i.e. consistent). The EBLUP preserves unbiasedness under
symmetric distributions and under certain regularity assumptions on the considered estimator σ
ˆu2
of σu
2
(and a translation invariant function of the sample data y), which are satisﬁed by the usual estimators
obtained by ML, REML, or Fay-Herriot’s method of moments. However, their exact mean squared error
(MSE) does not have a closed form. For this reason, analytical approximations with good properties for
large number of areas D have been obtained under normality. Similarly, an analytical estimator of the
6 This section draws heavily from Molina, Corral and Nguyen (2021) as well as Rao and Molina (2015)
CHAPTER 3. AREA-LEVEL MODELS FOR SMALL AREA ESTIMATION 28
true MSE that is nearly unbiased for large D has been obtained under normality. This MSE estimator
depends on the actual estimator of σu
2
that is applied; for the exact formulas, see Rao and Molina (2015),
Section 6.2.1. For areas with moderate or large sample sizes and indicators (and corresponding direct
estimators) deﬁned in terms of sums over the area units, the normality assumption is minimally ensured
by the Central Limit Theorem. However, for areas with small sample sizes or when target indicators are
not deﬁned in terms of sums over the area units, the normality assumption should be validated if these
customary MSE estimates are considered; otherwise, they should be interpreted with care.
A problem that may occur when applying FH models is that the estimated model variance σu
2
may be
non-positive, and then it is customary to set it to zero. In these instances, the weight γd is equal to 0 for
all areas, implying a weight of zero to direct estimates regardless of the area’s sample size. Hence, the
EBLUP reduces to the synthetic estimator, which may be considerably diﬀerent to the direct estimator
(Rao and Molina 2015). Li and Lahiri (2010) and Yoshimori and Lahiri (2014) propose adjustments to the
maximum likelihood approach to obtain strictly positive estimates. These adjustments are incorporated
into the emdi package in R (Kreutzmann et al. 2019).
A clear drawback of the FH model is that the sampling variances of direct estimators, ψd , d = 1, . . . , D,
are assumed to be known even if they are actually estimated,7 and usual MSE estimators do not account
for the uncertainty due to estimation of these variances. Note also that, if one wishes to estimate several
indicators, a diﬀerent model needs to be constructed for each indicator, even if they are all based on the
same welfare variable. This means that one needs to ﬁnd area-level covariates that are linearly related
to each indicator. It seems more reasonable to ﬁt a model for the considered welfare variable at the
unit-level and then estimate all the target monetary indicators based on that same model, as is done in
the procedures based on unit-level models described in Chapter 4.
Moreover, the area-level auxiliary information commonly has the shape of area means or proportions.
For indicators with a more complex shape than the simple mean welfare, it might be diﬃcult to ﬁnd area-
level covariates that are linearly related with the complex function of welfare for the area; consequently,
the linearity assumption in the FH model might fail, and ﬁnding an adequate transformation for linearity
to hold might not be easy in some applications. In any case, even if the gains obtained using the FH
model depend on the explanatory power of the available covariates, Molina and Morales (2009) observed
very mild gains of the EBLUPs of poverty rates and gaps based on the FH model in comparison to those
obtained by Molina and Rao (2010) using the same data sources, but relying on unit-level models (and
using the microdata). When considering indicators, such as the FGT poverty indicators, an additional
drawback of the FH model is that due to the small area sample sizes, direct estimators that are equal to
zero or one are possible (when no one or everyone is found to be below the poverty line), even if the area
sample size is not zero. In those areas, the sampling variances of the direct estimators are also zero and
unless a diﬀerent approach is applied to obtain strictly positive variances, the FH Model is not applicable
for those areas. A common approach to correct for this is to remove those areas in the modeling phase
and then use a synthetic estimator for that area instead.
3.4.1 Aggregation to Higher Geographic Areas
Even if the area-level model may be applied to produce poverty estimates at a given level of aggregation,
one may also wish to aggregate these estimates to higher aggregation levels (let us call these larger
areas “regions”). While the estimate of the poverty rate for a region is simply the population weighted
average of the area-level poverty rates contained within the region, obtaining a correct MSE estimate for
7 See Chapter 2 for how these can be estimated
CHAPTER 3. AREA-LEVEL MODELS FOR SMALL AREA ESTIMATION 29
the estimator of that region requires aggregating the estimates of the MSEs and mean crossed product
errors (MCPEs) at the area level, d, up to the regional level, r. For a region, r, containing Dr areas,
Dr
the poverty rate is then τr = 1
Nr d=1 Nd τd , where τd is the poverty rate in area d, from region r, Nd
Dr
is the population of domain d in region r, and Nr = Nd the population of the region. Then a
d=1
Dr
model-based estimate of the poverty rate for region r is τ
ˆr = N1
r
ˆr is given
ˆd , and the MSE of τ
d=1 Nd τ
by:
Dr Dr
1
τr ) =
M SE (ˆ 2
τd , τ
Nd Nl M CP E (ˆ ˆl )
Nr
d=1 l=1
where M CP E (ˆ
τd , τ τl − τl )] (Rao and Molina 2015, p144). The fhsae package in Stata
τd − τd ) (ˆ
ˆl ) = E [(ˆ
has the option to apply such an approach via its aggarea() option. The method is also available in the
hbsae R package by Boonstra (2015).
CHAPTER 3. AREA-LEVEL MODELS FOR SMALL AREA ESTIMATION 30
References
Bell, W (1997). “Models for County and State Poverty Estimates”. In: Preprint, Statistical Research
Division, US Census Bureau.
Boonstra, Harm Jan (2015). “Package ’hbsae’”. In: R Package Version 1.
Casas-Cordero Valencia, Carolina, Jenny Encina, and Partha Lahiri (2016). “Poverty Mapping for the
Chilean Comunas”. In: Analysis of Poverty Data by Small Area Estimation, pp. 379–404.
Chandra, Hukum, UC Sud, and VK Gupta (2013). “Small Area Estimation under Area Level Model
Using R Software”. In: New Delhi: Indian Agricultural Statistics Research Institute.
Corral, Paul and Alexandru Cojocaru (2019). “Moldova Poverty Map: An Application of Small Area
Estimation.”
Corral, Paul, William Seitz, João Pedro Azevedo, and Minh Cong Nguyen (2018). fhsae: Stata module
to ﬁt an area level Fay-Herriot model. Statistical Software Components. url: https://econpapers.
repec.org/software/bocbocode/s458495.htm.
Datta, Gauri S, RE Fay, and M Ghosh (1991). “Hierarchical and Empirical Multivariate Bayes Analysis
in Small Area Estimation”. In: Proceedings of bureau of the census 1991 annual research conference,
US Department of Commerce, Bureau of the Census. Vol. 7, pp. 63–79.
Datta, GS, Malay Ghosh, Narinder Nangia, and K Natarajan (1996). “Estimation of Median Income
of Four-person Families: A Bayesian Approach”. In: Bayesian analysis in statistics and econometrics:
essays in honor of Arnold Zellner. Ed. by Donald A Berry Kathryn Chaloner John and Geweke Arnold
Zellner. John Wiley & Sons, pp. 129–140. isbn: 0-471-11856-7. url: https://dialnet.unirioja.es/
servlet/articulo?codigo=3418625.
Esteban, María Dolores, Domingo Morales, A Pérez, and Laureano Santamaría (2012a). “Small Area
Estimation of Poverty Proportions under Area-level Time Models”. In: Computational Statistics &
Data Analysis 56.10, pp. 2840–2855. url: https://www.sciencedirect.com/science/article/
abs/pii/S0167947311003720.
Esteban, MD, D Morales, A Pérez, and L Santamaría (2012b). “Two Area-level Time Models for Es-
timating Small Area Poverty Indicators”. In: Journal of the Indian Society of Agricultural Statistics
66.1, pp. 75–89.
Fay, Robert E and Roger A Herriot (1979). “Estimates of Income for Small Places: An Application of
James-Stein Procedures to Census Data”. In: Journal of the American Statistical Association 74.366a,
pp. 269–277.
Fay, Robert E. (1987). “Application of Multivariate Regression to Small Domain Estimation”. In: Small
area statistics: An international symposium. Ed. by Platek, Richard and Rao, Jon N. K. and Singh,
M. P. and Särndal, Carl-Erik and others. Vol. 213. John Wiley & Sons Incorporated, pp. 91–102.
Giusti, Caterina, Lucio Masserini, and Monica Pratesi (2017). “Local Comparisons of Small Area Esti-
mates of Poverty: An Application within the Tuscany Region in Italy”. In: Social Indicators Research
131.1, pp. 235–254.
Halbmeier, Christoph, Ann-Kristin Kreutzmann, Timo Schmid, and Carsten Schröder (2019). “The fay-
herriot Command for Estimating Small-area Indicators”. In: The Stata Journal 19.3, pp. 626–644.
Harmening, Sylvia, Ann-Kristin Kreutzmann, Sören Pannier, Nicola Salvati, and Timo Schmid (2021).
A Framework for Producing Small Area Estimates Based on Area-Level Models in R. R Package emdi.
url: https://mirror.las.iastate.edu/CRAN/web/packages/emdi/vignettes/vignette_fh.pdf.
Kreutzmann, Ann-Kristin, Sören Pannier, Natalia Rojas-Perilla, Timo Schmid, Matthias Templ, and
Nikos Tzavidis (2019). “The R Package emdi for Estimating and Mapping Regionally Disaggregated
Indicators”. In: Journal of Statistical Software 91.7, pp. 1–33. doi: 10.18637/jss.v091.i07. url:
https://www.jstatsoft.org/index.php/jss/article/view/v091i07.
CHAPTER 3. AREA-LEVEL MODELS FOR SMALL AREA ESTIMATION 31
Lahiri, Partha and Jiraphan Suntornchost (2015). “Variable Selection for Linear Mixed Models with
Applications in Small Area Estimation”. In: Sankhya B 77.2, pp. 312–320.
Leﬂer, M, D Gonzalez, and A Martin (2014). “saery: Small Area Estimation for Rao and Yu Model”. In:
R package version 1.
Li, Huilin and Partha Lahiri (2010). “An Adjusted Maximum Likelihood Method for Solving Small Area
Estimation Problems”. In: Journal of multivariate analysis 101.4, pp. 882–892.
Marhuenda, Yolanda, Isabel Molina, and Domingo Morales (2013). “Small area estimation with spatio-
temporal Fay–Herriot models”. In: Computational Statistics & Data Analysis 58, pp. 308–325.
Molina, Isabel (2019). Desagregación De Datos En Encuestas De Hogares: Metodologías De Estimación
En áreas Pequeñas. CEPAL. url: https://repositorio.cepal.org/handle/11362/44214.
Molina, Isabel, Paul Corral, and Minh Cong Nguyen (2021). “Model-based Methods for Poverty Mapping:
A Review”.
Molina, Isabel and Yolanda Marhuenda (2015). “Sae: An R Package for Small Area Estimation”. In: The
R Journal 7.1, pp. 81–98.
Molina, Isabel and Domingo Morales (2009). “Small Area Estimation of Poverty Indicators”. In: Es-
tadistica e Investigacion Operativa (SEIO) 25.3, pp. 218–225. url: https://emis.dsd.sztaki.hu/
journals/BEIO/files/BEIOVol25Num3_IMolina+DMorales.pdf.
Molina, Isabel and JNK Rao (2010). “Small Area Estimation of Poverty Indicators”. In: Canadian Journal
of Statistics 38.3, pp. 369–385.
Mubarak, M and A Ubaidillah (2020). “saeME: Small Area Estimation with Measurement Error”. In: R
package version 1.2.4. url: https://CRAN.R-project.org/package=saeME.
Permatasari, N and A Ubaidillah (2020). “msae: Multivariate Fay–Herriot Models for Small Area Esti-
mation”. In: R package version 0.1 1.
Prasad, N. G. N. and J. N. K. Rao (1990). “The Estimation of the Mean Squared Error of Small-area
Estimators”. In: Journal of the American Statistical Association 85.409, pp. 163–171. issn: 01621459.
url: http://www.jstor.org/stable/2289539.
Rao, JNK and Isabel Molina (2015). Small Area Estimation. 2nd. John Wiley & Sons.
Rao, Jon NK and Mingyu Yu (1994). “Small-area Estimation by Combining Time-series and Cross-
sectional Data”. In: Canadian Journal of Statistics 22.4, pp. 511–528.
Seitz, William Hutchins (2019). “Where They Live: District-level Measures of Poverty, Average Con-
sumption, and the Middle Class in Central Asia”. In: World Bank Policy Research Working Paper
8940.
Shi, C (2018). “BayesSAE: Bayesian Analysis of Small Area Estimation”. In: R package version 1.0-2,
URL https://CRAN.R-project.org/package=BayesSAE.
Torabi, Mahmoud and JNK Rao (2014). “On Small Area Estimation under a Sub-area Level Model”. In:
Journal of Multivariate Analysis 127, pp. 36–55.
Warnholz, S (2018). saeRobust: Robust Small Area Estimation. url: https://CRAN.%20R- project.
%20org/package=%20saeRobust.%20R%20package%20version%200.2.%200.
Warnholz, Sebastian (2016). “Small Area Estimation Using Robust Extensions to Area Level Models:
Theory, Implementation and Simulation Studies”. PhD thesis. Freie Universität Berlin. url: http:
//dx.doi.org/10.17169/refubium-13904.
Ybarra, Lynn MR and Sharon L Lohr (2008). “Small Area Estimation When Auxiliary Information Is
Measured with Error”. In: Biometrika 95.4, pp. 919–931.
Yoshimori, Masayo and Partha Lahiri (2014). “A New Adjusted Maximum Likelihood Method for the
Fay–herriot Small Area Model”. In: Journal of Multivariate Analysis 124, pp. 281–294.
Chapter 4
Unit-level Models for Small Area
Estimation
This chapter focuses on unit-level models for small area estimation of poverty, which typically rely on
estimating the distribution of the household’s welfare, given a set of auxiliary variables or correlates.1
The methods presented in this chapter use model-based techniques that “borrow strength” from other
areas by using a larger data set with auxiliary information – usually the population census. The resulting
indirect estimators sacriﬁce design bias, but often yield more precise estimates in terms of mean-squared
errors (MSE). In software implementations of the method described in this chapter, model parameters are
used to simulate multiple welfare vectors from the ﬁtted distribution for all households in the census. This
is done because the census often lacks a welfare measure for poverty measurement (Elbers et al. 2007).
From the simulated census vectors, it is possible to obtain poverty rates or any other welfare indicator,
for every area, including the non-sampled ones. This chapter serves as a guide for the production of
unit-level small area estimates of poverty indicators, focusing on two of the most popular approaches –
the one proposed by Elbers, Lanjouw, and Lanjouw (2003, ELL), and the Empirical Best (EB) approach
by Molina and Rao (2010, MR).2 It presents evidence on certain aspects of SAE and recommendations
on what may be the preferred approach in some scenarios and oﬀers advice for aspects where perhaps
more research is needed.
The chapter begins by presenting some of the prerequisites that should be satisﬁed before commencing
a SAE exercise. It provides insights that may help practitioners choose a unit-level SAE method that
is adequate for their needs. It then presents considerations that practitioners may keep in mind when
conducting SAE, such as data transformation and why ELL has fallen out of favor with many practi-
tioners. Additionally, it presents the usual steps towards selecting a model and the production of small
area estimates. The sections rely on real world and simulated data, and oﬀer codes which readers may
use to replicate some of the analysis; in other instances it pulls evidence from recent research.3
1 This section draws from the background papers for these guidelines. See Corral, Molina, and Nguyen (2021) and
Corral et al. (2021).
2 A factor that likely contributed to the expansion of these methods is the availability of software, which can be easily
used to apply these approaches; PovMap (Zhao 2006) for ELL and R’s sae package (Molina and Marhuenda 2015). While
other methods exist, for the purposes of this section, attention is given to the methods noted.
3 It is highly recommended that novice practitioners start by reading section 4.4.1.
32
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 33
4.1 Preparing for Unit-Level Small Area Estimation
The basic ingredients for model-based unit-level small area estimation for poverty are a contemporaneous
survey and census – ideally from the same year. Small area estimates rely on household-level data to
obtain the conditional distribution of welfare given the observed correlates. This information is then used
to generate welfare for all the census units. Thus, if characteristics in the two datasets diﬀer considerably,
or if the structure of the conditional distribution has changed between the two data sources, then the
estimates could be biased and of little use.
Obtaining a pool of candidate variables for the analysis requires a thorough review of the survey and
census data to ensure comparability. In the census and survey questionnaires, questions should be asked
in a similarly, ideally worded in the same manner. Checks should then be conducted on the survey weights
to understand how these are constructed. For example, in some countries where there is heavy migratory
work, household expansion factors (usually the number of household members) may be adjusted by the
number of days the individual is present in the dwelling. Such adjustments should also be possible in
the census data; otherwise, the poverty estimates obtained from SAE methods will not be comparable
to those obtained directly from the survey. Even if questions are asked in a similar fashion and weights
are replicable, diﬀerences may still arise. Teams are strongly recommended to compare the mean and
distribution the potential pool of covariates across the survey and the census – including whether it is
possible to reproduce the direct estimates of poverty produced by the National Statistical Oﬃce. Ideally,
the mean and distribution of the covariates should be comparable, at least at the aggregation level at
which the survey is designed to be representative.
One of the ﬁrst decisions that must be made is the level at which estimates must be produced. As
noted afterward, this determines the level at which location eﬀects are speciﬁed in the model in unit-
level models. For reasons that will be apparent in the following sections, the locations in the survey
must be matched to the locations in the census. Matching locations will also ensure that when using
information from external sources, for example, administrative data or satellite imagery-derived data,
these correspond to the correct location. Additionally, the use of hierarchical identiﬁers for the location
variable is recommended (Zhao 2006). Under Stata’s sae package, these identiﬁers must be numeric
and no longer than 16 digits,4 and should have an equal number of digits for every area. For example,
in the hierarchical identiﬁer SSSMMMMPPP, the ﬁrst 3 digits represent states (S), the next 4 digits
represent municipalities (M), and the ﬁnal 3 digits represent PSUs (P). The use of hierarchical identiﬁers
is necessary for two-fold nested error models. In the code syntax, the user needs to indicate the number
of digits to remove (from the right) to arrive at the larger area’s identiﬁer. For example, to go from PSU
to municipality level, it is necessary to remove 3 digits, yielding SSSMMMM.
4.1.1 The Assumed Model
The model based small area estimation methods described in this chapter are dependent on an assumed
model. The nested error model used for small area estimation by ELL (2003) and MR (2010) was
originally proposed by Battese, Harter and Fuller (1988) to produce county-level corn and soybean crop
area estimates for the American state of Iowa. For the estimation of poverty and welfare, the ELL and
MR methods assume that the transformed welfare ych for each household h within each location c in the
population is linearly related to a 1 × K vector of characteristics (or correlates) xch for that household,
4 Stata’s double-precision only goes up to 16 digits.
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 34
Figure 4.1: Empirical bias of two diﬀerent models for unit-level small area estimation
Source: Simulation based on 1,000 populations generated as described in this section 5.5.1.1. Each line corresponds to one
of the 100 areas of the population. The x-axis represents the percentile on which the poverty line falls on, and the y-axis
is the empirical bias. The do-ﬁle to replicate these results can be found in 5.5.2.
according to the nested error model:
ych = xch β + ηc + ech , h = 1, . . . , Nc , c = 1, . . . , C, (4.1)
where ηc and ech are respectively location and household-speciﬁc idiosyncratic errors, assumed to be
independent from each other, following:
iid 2 iid
2
ηc ∼ N 0, ση , ech ∼ N 0, σe
where the variances ση
2
and σe
2
are unknown. Here, C is the number of locations in which the population
is divided and Nc is the number of households in location c, for c = 1, . . . , C . Finally, β is the K × 1
vector of regression coeﬃcients.5
One of the main assumptions is that errors are normally distributed. The assumption does not neces-
sarily imply that the transformed welfare (the dependent variable) is normally distributed but instead
implies that conditional on the observed characteristics the residuals are normally distributed. Under
the assumed model, variation in welfare across the population is determined by three components: the
variation in household characteristics, the variation in household-speciﬁc unobservables, and the vari-
ation in location-speciﬁc eﬀects. Within any given area, the welfare distribution is determined by the
variation in household speciﬁc characteristics and household speciﬁc errors.
It is important to note that while the method’s goal may be to produce headcount poverty rates or other
welfare indicators, these are not explicitly modeled in the sense that the dependent variable is welfare (or
transformed welfare) and not the desired indicator (e.g. headcount poverty). The methodology relies on
approximating as closely as possible the welfare distribution in each area. A model that approximates
the welfare distribution poorly may yield a decent estimate for a given poverty line, yet when judged
at a diﬀerent poverty line the method may completely fail. This can be seen in Figure 4.1, where the
headcount poverty estimates from the model on the left may work relatively well if the poverty threshold
were set at the 70th percentile and yield completely biased estimates at most other thresholds. On the
5 Additional information is provided in the technical annex 4.4.
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 35
other hand, the model on the right approximates the welfare distribution well in all areas and yields
poverty estimates with a small bias across all poverty thresholds and areas.
4.2 Conducting Your First SAE Application with a Unit-Level
Model
The focus of this section is on the process of producing small area estimates of monetary indicators using
a census and a survey. For the exercise conducted in the following sections it is assumed that the data
has already been prepared and all of the checks noted in section 4.1 have been conducted. The section
provides step-by-step processes for producing small area estimates of poverty using census and survey
data. It also draws from existing evidence to make recommendations based on current best practices for
conducting SAE applications.
4.2.1 Model Choice Considerations
Ideally, the particular model to be used should be aligned to the speciﬁc SAE goals. One of the main
considerations regarding the model to be used is the level at which poverty estimates will be reported.
This will determine the level at which the random location eﬀect should be speciﬁed (see section 4.4 for
details). Specifying the random eﬀects at a lower level of aggregation than the level at which estimation
is desired (e.g., for clusters nested within the areas of interest) will lead to noisier estimates (Marhuenda
et al. 2017), albeit with little eﬀect on bias (Corral et al. 2021). The magnitude by which the MSE of the
estimates based on a model with cluster eﬀects instead of area eﬀects will increase depends on the ratio
between the variances of the random eﬀects associated to the diﬀerent locations. If the variance of the
random eﬀects of the clusters within areas, σac
2
, is larger than that of the area’s random eﬀects, σa
2
, then
the MSE may not increase by much. On the other hand, when the variance of the area’s random eﬀects is
larger, the MSE of the estimates based on a model with only cluster eﬀects worsens considerably. Thus,
the random location eﬀect should be at the same level for which estimates will be reported (see Figure
4.2).
Additionally, under the latest edition of the Stata sae package, practitioners can apply two-fold nested
error models (Marhuenda et al. 2017), besides the traditional one-fold nested error model. The drawback
of the implemented two-fold nested error model in the Stata sae package is that it does not consider
survey weights, nor does it consider heteroskedasticity. The beneﬁt from applying a two-fold nested error
model is that the resulting estimates are optimal at two levels of aggregation, because the MSEs of the
estimates at both levels tend to be smaller under the assumed model than when using one-fold nested
error models with only cluster or area eﬀects (see Figure 4.2).
Two model ﬁtting approaches are available under Stata’s sae package: restricted maximum likelihood
(REML) and Generalized Least Squares (GLS), where the variance parameters are estimated using
Henderson’s method III (Henderson 1953, H3). Both approaches yield similar estimated regression
coeﬃcients when heteroskedasticity and survey weights are not considered (Corral, Molina, and Nguyen
(2021) – Figure A4). Given the diﬀerent available models implemented in Stata’s sae package, it is
important to understand which options are available under each ﬁtting method.
1. Model ﬁt by REML does not incorporate survey weights nor heteroskedasticity. Uses Stata’s
xtmixed command in the background to ﬁt the model.6
6 Interested readers can refer to Stata’s documentation to learn more about mixed models.
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 36
Figure 4.2: Empirical MSE comparison considering only cluster eﬀects and when aggregating the cluster
level results to the area level versus specifying the random eﬀects at the reporting level, area.
Source: Simulation based on populations generated as described in Corral et al. (2021) section 3. Notice how the empirical
MSE of CensusEB one-fold nested error models, where η is speciﬁed at the area level, are closely aligned to those under
the two-fold nested error model with cluster and area eﬀects. On the other hand, CensusEB estimators based on one-fold
nested error models, where η is speciﬁed at the cluster level, have a considerably larger empirical MSE.
(a) When choosing this ﬁtting method, the one-fold nested error model considered is the same one
used in Molina and Marhuenda’s (2015) sae package in R. Stata’s sae package implements a
similar REML ﬁtting approach to Molina and Marhuenda’s (2015) sae R package.7
sae model reml depvar indepvars [if] [in], area(varname)
Beyond the model ﬁtting method, it is also possible to approximate the original Molina and
Marhuenda’s (2015) EB estimates by Monte Carlo simulation (replace model by sim) by
adding the option appendsvy. However, the default approach obtains CensusEB8 estimates
instead of the original EB ones.
(b) Stata’s sae package also implements the two-fold nested error model with cluster eﬀects nested
within area eﬀects considered by Marhuenda et al. (2017). This model is appropriate when
point estimates at two diﬀerent levels of aggregation are desired. Section 4.2.4 presents the
hierarchical identiﬁers needed for specifying two-fold nested error models.9
sae model reml2 depvar indepvars [if] [in], area(#) subarea(varname)
Here the subarea option requires an identiﬁer for sub-areas of an equal number of digits
for every area. Under the area option, the user should specify the number of digits of the
hierarchical ID to remove from the right to the left to arrive at an identiﬁer for the areas.
These identiﬁers are commonly referred to as hierarchical identiﬁers and are explained in
Nguyen et al. (2018) and Zhao (2006) in greater detail. Through Monte Carlo simulation
(replace model by sim) the approach yields CensusEB estimates based on the two-fold nested
error model.
2. Model ﬁt with GLS, where variance parameters, estimated under Henderson’s method III (1953),
do allow for the inclusion of heteroskedasticity and survey weights according to Van der Weide
(2014).10 However, this method has only been implemented for one-fold nested error models.11
7 To see the help ﬁle in Stata type: help sae_ebp for the one-fold nested error model.
8 CensusEB estimates are similar to EB estimates, but do not include survey observations when calculating small area
estimates. See Molina (2019) or Corral, Molina, and Nguyen (2021) for more details.
9 To see help ﬁle in Stata type: help sae_ebp2 for the two-fold nested error model.
10 Methods that do not consider the survey sampling weights may in turn include covariates that capture the sam-
pling/response mechanism, trying to correctly model their relationship to the target variable.
11 To see help ﬁle in Stata type: help sae_mc_bs for the one-fold nested error model ﬁt with Henderson’s method III
heteroskedasticity and survey weights (see Van der Weide 2014).
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 37
sae model h3 depvar indepvars [if] [in] [aw], area(varname) [zvar(varnames)
yhat(varnames) yhat2(varnames) alfatest(new varname)]
The method will obtain CensusEB estimates for areas present in the survey and the census. When
weights and heteroskedasticity are not speciﬁed, results will be very close to those from sae model
reml (Corral, Molina, and Nguyen 2021).
The alfatest option allows users to obtain the dependent variable for the alpha model, a model for
heteroskedasticity introduced in Elbers, Lanjouw, and Lanjouw (2002).12 The resulting dependent
variable for the alpha model can facilitate the selection of a suitable model. The covariates for
the alpha model are speciﬁed under zvar and, when interactions with the main model’s linear
ﬁt X βˆ are desired, then users may specify covariates under the yhat and yhat2 options for the
ˆ)2 , respectively.
ˆ and (X β
interaction of covariates with X β
3. An updated GLS method with an adaptation of Henderson’s method III (1953) that incorporates
the error decomposition to estimate ση
2
and σe
2
as described in Elbers, Lanjouw, and Lanjouw (2002)
and Nguyen et al. (2018) is available.13 When users select this ﬁtting method it is not possible to
obtain EB or CensusEB estimates and does not require linking between survey and census areas,
and hence not recommended.14 It is still available in Stata’s sae package to allow for comparisons
and replicability.15
sae model ell depvar indepvars [if] [in] [aw], area(varname) [zvar(varnames)
yhat(varnames) yhat2(varnames) alfatest(new varname)]
The alfatest, zvar, yhat, and yhat2 options are similar to those described under H3 ﬁtting
method.
The most crucial diﬀerence between ELL and EB methods is that ELL estimators are obtained from the
distribution of welfare without conditioning on the variable survey sample data, though other diﬀerences
between the original ELL (2003) and MR (2010) exist. For example, though the underlying assumed
model is the same, diﬀerent model-ﬁtting approaches are applied between ELL and MR approaches.
Noise estimates of the small area estimators are also obtained diﬀerently. The implementation of ELL
in PovMap draws from the multiple imputation literature, where minimizing the MSE is not the primary
goal, while MR’s method relies on the assumed model’s data generating process to estimate the noise of
the small area estimators with a parametric bootstrap introduced by González-Manteiga et al. (2008).16
Regardless of the model’s structure (one-fold or two-fold) and the model-ﬁtting approach used (REML or
H3 in Stata’s sae package), users are advised to obtain Empirical Best (EB) or CensusEB estimates rather
than ELL estimates, because EB will yield more accurate and eﬃcient estimates since EB conditions on
the survey sample data and makes more eﬃcient use of the information at hand - see Corral, Molina,
and Nguyen (2021) for a detailed comparison, also see section 4.4.1.17 The gains from EB can be quite
considerable when the assumed model holds (Figure 4.3, left), but simulations based on real-world data,
12 Readers may refer to Elbers, Lanjouw, and Lanjouw (2002) or Nguyen et al. (2018) for a full exposition of the alpha
model for heteroskedasticity.
13 The adaptation was made in light of criticism regarding the ﬁtting method described in ELL (2002). See Nguyen et al.
(2018) for more details.
14 Ensuring the use of same area identiﬁers is still recommended for the inclusion of area-level variables.
15 To see help ﬁle in Stata type: help sae_ell for the one-fold nested error model ﬁt with heteroskedasticity and survey
weights (see Van der Weide 2014).
16 Corral, Molina, and Nguyen (2021) discuss the diﬀerences between the methods used for noise estimation and how
the methods fare when applied to simulated data.
17 Section 4.4 gives a more detailed explanation on the diﬀerences.
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 38
where the validity of the assumed model is in doubt, also show gains (Figure 4.3, right).18 Thus, despite
ELL being one of the optional approaches in the Stata sae package, it is not recommended because EB
estimates will yield more accurate and eﬃcient estimates, see Corral, Molina, and Nguyen (2021) and
section 4.4.1.19
Figure 4.3: Empirical MSE comparison between ELL and CensusEB
Source: Simulation based on populations generated as described in Corral et al. (2021), section 3 and 4. The simulations
illustrate how the empirical MSE of ELL estimates is considerably larger than that of CensusEB under simulated data
(left) and also under real world data (right).
The use of contextual variables in the unit-level model is also recommended as long as their regression
coeﬃcients are signiﬁcant. ELL (2002) noted the importance of explaining the variation in welfare due
to location and recommended the inclusion of location means of household-level characteristics, as well
as covariates derived from Geographic Information Systems (GIS). Contextual level variables were also
noted by Haslett (2016) as being essential in ELL to reduce area-speciﬁc bias and standard errors. Since
contextual variables can help explain between-area variations, these reduce the portion of unexplained
between-area variation and complement household-level information to produce more precise estimates.
An example of the value that contextual variables may provide in SAE can be seen in Figure 4.4, obtained
in simulations using real world data. Poverty estimates obtained without contextual variables present
more bias and a considerably larger MSE, though speciﬁc results will depend on the chosen variables
and the data at hand.
As noted, the Stata sae package oﬀers practitioners the option to ﬁt ELL, EB or CensusEB estimates.
CensusEB estimates are similar to EB, except that households are not linked across Census and survey
data. In practice, the sample size of a given area in the survey is typically much smaller than the actual
population size. In these cases, the diﬀerence between CensusEB and EB is negligible. In Corral, Molina,
and Nguyen (2021) simulation experiments, where the sample by area is 4 percent of the total population,
the diﬀerence between EB and CensusEB’s empirical MSE is already indiscernible, see Figure 4.5.
18 Empirical MSE are obtained from multiple samples. Under model-based simulations, the sample is kept ﬁxed, and
across populations the only diﬀerence is in the errors. Under design-based simulations, multiple samples are taken from a
single (ﬁxed) population.
19 In areas with no information (areas not sampled), EB and ELL estimates are similar.
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 39
Figure 4.4: Empirical bias and MSE of Census EB estimators based on a nested error model with and
without contextual variables
Source: Simulation based on populations generated as described in Corral et al. (2021) section 3 and 4. These design-based
simulations provide an example of the possible gains in terms of bias and MSE, obtained from adding contextual variables
to the nested error model.
Figure 4.5: Empirical MSE diﬀerence between EB and CensusEB depending on sampling fraction
Source: Corral, Molina, and Nguyen (2021). Figure illustrates that the diﬀerence in empirical MSE of CensusEB and EB
will be close to zero as the sample fraction of the population decreases.
4.2.2 Data Transformation
Since the EB method of MR (2010) assumes that the errors are normally distributed, data transformation
to get a distribution close to normal can lead to less biased and less noisy estimates (see Rao and Molina
2015; Rojas-Perilla et al. 2020; Corral et al. 2021; and Tzavidis et al. 2018).20 More generally, variable
transformation to make the assumptions underlying the model hold is one of the main aspects of model-
based small area estimation (see Technical Annex of this chapter, section 4.4.1). Speciﬁcally, since the
SAE models used often assume normality in the errors, the goal in data transformation is to approximate
20 The implementation of the ELL method in PovMap allowed practitioners to draw errors from the empirical distribution,
which in principle may beneﬁt the method when normality does not hold. Nevertheless, when comparing CensusEB method
(assuming normal errors) and ELL method (drawing errors from the empirical distribution) in a model-based simulation
experiment, where the model errors used to generate the populations were drawn from a Student’s t-distribution, CensusEB
outperformed ELL by a considerable margin (see Corral et al. 2021).
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 40
normally distributed residuals.21
Most software packages for small area estimation oﬀer the possibility of data transformation. Perhaps
the most popular transformation is the natural logarithm, although it is not always ideal as it can
produce left skewed distributions for small values of welfare (see Figure 4.7). Other options include
transformation from the Box-Cox family (Box and Cox 1964) and the log-shift transformations, which
have the advantage that the transformation parameters may be driven by the speciﬁc data at hand
(Tzavidis et al. 2018).The Box-Cox family’s transformation parameter, which is the power to be taken,
is denoted by λ. When λ = 0, it yields the natural logarithm; otherwise, λ is typically data-driven
and chosen to minimize skewness in the dependent variable., which is the power to be taken, is denoted
by λ. When λ = 0, it yields the natural logarithm; otherwise, λ is typically data-driven and chosen
to minimize skewness in the dependent variable.22 A log-shift transformation adds a positive shift to
the observed welfare values before taking the log to avoid the left skewness caused by taking the log
of very small welfare values. Other options, such as the ordered quantile normalization (Peterson and
Cavanaugh 2019), may generate reliable estimates of headcount poverty, but these transformations are
not reversible and therefore they cannot be used to obtain other indicators such as mean welfare, poverty
gap, poverty severity among others (Masaki et al. 2020; Corral et al. 2021), limiting their applicability.
Regardless of the transformation chosen, it is always recommended to check the residuals to determine
if these follow the underlying model assumptions.
When working with real-world data, assumptions are often only approximately met (Marhuenda et
al. 2017) and eﬀort must be made to ﬁnd one that approximates the normal distribution best. In
simulation studies that seek to validate small area estimation methods based on actual data, data-driven
transformations may reduce bias and the noise due to departures from normality (see Corral et al. 2021
and Tzavidis et al. 2018 among others). Using the Mexican Intracensal survey of 2015 as a 3.9 million
household census data, Corral et al. (2021) note that the gains from a transformation of the dependent
variable that minimizes skewness can be quite considerable. In the validation exercises conducted in that
paper, bias and MSEs, are considerably reduced when using a Box-Cox or a log-shift transformation (see
Figure 4.6).
Implementation of a suitable data transformation in Stata is straightforward, since Stata oﬀers commands
for the Box-Cox and the zero-skewness log (log-shift) transformation.23 The Stata sae package (Nguyen
et al. 2018) also includes these transformations, in addition to the natural logarithm, as options and will
apply the transformation for the model. The package will also reverse the transformation to calculate the
indicators of interest in each Monte Carlo simulation. Prior to estimating the indicators, practitioners
may wish to do model selection (e.g., by lasso or stepwise regression) using an already transformed
dependent variable. Once the transformed dependent variable is obtained, practitioners can then conduct
histograms and visually inspect the shape of the distribution of the dependent variable (see Figure 4.7).
Once a satisfactory transformation is obtained, the process of model selection of the covariates can
occur. The model selection process consists of selecting the response variable and the most powerful set
of covariates such that the assumptions underlying the chosen SAE method hold – primarily linearity
and normality of the residuals. The consequences of deviations from the underlying model’s assumptions
could lead to invalid inferences (Rao and Molina 2015) or not too noisy but biased estimates (Banerjee
et al. 2006). The model selection procedure is further explored in chapter 6 in section 6.2.
21 In cases where, even after transformation, the deviation is apparent only in isolated observations or isolated areas,
models based on mixtures of normal distributions may be used. To estimate poverty and inequality indicators, a mixture
may be speciﬁed as done by Elbers and Van der Weide (2014). Bikauskaite et al. (2020) extend the EB procedure
to a multivariate mixture model that incorporates heterogeneity in the regression coeﬃcients apart from the variance
components. The proposed model was used to estimate FGT0 and FGT1 by gender in West Bank and Gaza.
22 R packages sae (Molina and Marhuenda 2015) and emdi (Kreutzmann et al. 2019) also incorporate transformations.
23 Refer to the help ﬁles for bcskew0 and lnskew0 to learn more about the transformation.
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 41
Figure 4.6: Empirical Bias and MSE for CensusEB FGT0 estimates under diﬀerent transformations
Source: Based on samples generated as described in Corral et al. (2021) section 4. The ﬁgure illustrates the potential gains
from a correct data transformation. Taking the natural logarithm of the dependent variable in this case is not enough and
can lead to considerable deviations from normality and thus may yield larger empirical MSEs and biased predictors.
Figure 4.7: Natural logarithm of welfare
Source: Corral et al. (2021) . The ﬁgure illustrates how, in this case, taking the natural logarithm of welfare leaves a left
skewed distribution which may yield biased and noisier estimators.
4.2.3 The Alpha Model
The alpha model proposed by ELL (2002) is a preliminary model for the variance of the idiosyncratic
errors used for heteroskedasticity that resembles the one presented by Harvey (1976).24 Only limited
research has been conducted on the impact of the alpha model for heteroskedasticity on the quality of
the ﬁnal small area estimates. In most applications, the adjusted R2 of the alpha model is relatively
small, rarely above 0.05.
Nevertheless, the alpha model may play a considerable role in the bias and noise of the estimates. A
simulation experiment based on the same data used for the validations conducted by Corral et al. (2021)
shows that applying an alpha model to obtain CensusEB estimates might yield considerably less noisy
and less biased estimates (see Figure 4.8).25 This result may not translate to other data sets, which
24 An alternative approach to modeling heteroskedasticity is to consider a random coeﬃcient model similar to the one
studied in Hobza and Morales (2013) and applied to the estimation of mean income. However, the method has not been
implemented in any software.
25 The example shows lower bias and a smaller MSE for a particular data set and could not translate equally to other
data sets.
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 42
might not suﬀer from heteroskedasticity, and thus, using the alpha model should not be considered a
universal recommendation. Practitioners should carefully inspect their model’s residuals for signs of
heteroskedasticity before deciding to use the alpha model.26
Figure 4.8: Empirical Bias and MSE for CensusEB FGT0 estimates with and without alpha model
Source: Based on samples of populations generated as described in Corral et al. (2021) section 4. The alpha model as
speciﬁed by ELL (2002) can be applied using the sae Package in Stata. Under the Mexican data used by Corral et al.
(2021) the alpha model appears to yield positive results in terms of bias and MSE. This should be interpreted with caution
as it may not be the case in other scenarios.
In practice, the implementation of the alpha model may limit modeling options for practitioners since,
though available in the PovMap software, it is not included in many of the newer software packages for
SAE. For example, heteroskadasticity is also considered by Molina and Rao (2010), as well as Marhuenda
et al. (2017). However, they do not specify the alpha model, and no software package implementation
of their methods allows users to specify heteroskedasticity. The Stata sae package allows users to
model heteroskedasticity for SAE, but only does so when using speciﬁc model ﬁtting procedures such as
the Henderson method III extension proposed by Van der Weide (2014) or the traditional ELL ﬁtting
method.27 When choosing to apply the alpha model, the Stata sae package has an option that automat-
ically constructs the dependent variable. This dependent variable may be used to facilitate the selection
of a suitable model for heteroskedasticity. A removal process of non-signiﬁcant covariates may also be
implemented for the alpha model, as shown in the script of section 4.5.1.
4.2.4 Producing Small Area Estimates
Once the model selection stage is completed, the next step is to obtain estimates for small areas. Model
selection and simulation stages may be separated in diﬀerent scripts to avoid re-running unnecessary
processes in case of errors in only one of the stages.28 Model selection (see Stata script in section 4.5.1)
is discussed in greater detail in Chapter 6 where attention is given to the diﬀerent steps involved.
The Monte Carlo simulation process can be seen in the Stata script of section 4.5.2 and follows the
model selection stage. The production of small area estimates requires ﬁrst importing the census data.
In Stata, it is imported into a Mata (Stata’s matrix programming language) data ﬁle, which will allow
users to work with larger data sets than what is usually possible.29 The user must specify the vectors
26 Heteroskedasticity may be checked by plotting residuals against predicted values.
27 In terms of alternatives, the R package mcmcsae (Boonstra and Baltissen 2021) can also model heteroskedasticity in
terms of covariates.
28 It is advisable to split the processes across multiple do ﬁles, because model selection, as well as simulating welfare in
the cesus and obtaining small area estimates for all areas is not computationally negligible.
29 Note Stata cannot open this data ﬁle as a regular dta ﬁle.
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 43
she wishes to import, including household expansion factors (typically household size) within the census
data. The census data should have variable names equal to those in the survey. It is recommended to
import only the necessary variables from the census to avoid creating larger data ﬁles than needed for
the process.
The SAE process can be broken up into two steps. The ﬁrst step entails producing the point estimates
through Monte Carlo simulation. This step is relatively fast, and the recommended amount of simulations
is at least 100. The estimation of the MSE is considerably more time-consuming and will usually take
several hours, depending on the size of the census. Thus, it is recommended to ﬁrst obtain small
area estimates without specifying bootstrap simulations for MSE estimation. Once users have done their
checks and determined the quality of their point estimates, the MSE can be estimated. It is recommended
to use at least 200 bootstrap replications to estimate MSEs.
4.3 Pros and Cons of Unit-Level Models
This section presents a convenient list of pros and cons for each method. It also notes the needs for each
of the methods. The section borrows heavily and adds to the work presented in Molina (2019).
4.3.1 ELL
Model requirements:
1. Microdata from a household survey and census at the household level, with variables measured and
questions asked similarly.
2. A set of variables related to the welfare variable of interest. These variables should have similar
distributions between the two data sources.
3. Areas in the survey and the census should have identiﬁers that can be linked across each other.
Pros:
1. The model is based on household-level data, which gives more detailed information than area-level
data, and have a considerably larger sample size.
2. Can provide estimates for non-sampled areas.
3. Estimates from any indicator that is a function of welfare can be obtained without relying on
diﬀerent models.
4. The estimates are unbiased under the model if the model is true and the model parameters are
known.
5. It is possible to get estimates at any level of aggregation.
6. Oﬀers a software-implemented model for heteroskedasticity.
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 44
Cons:
1. The method, as is currently implemented, may yield rather noisy (ineﬃcient) estimates and may
even be worse than direct estimates (obtained using only the area-speciﬁc information from the
survey), if heterogeneity between areas is not explained adequately.
(a) The noise of the traditional ELL estimates is large but it is also underestimated under the
ELL bootstrap procedure (see Corral, Molina, and Nguyen (2021) for details).
(b) To reduce the above noise to the largest possible extent, it is always recommendable to in-
clude all the potentially signiﬁcant covariates that may aﬀect welfare, including contextual
covariates, which may be taken from data sources diﬀerent from the census.
2. Estimates are the outcome of a model and thus, the model needs to be appropriately checked.
3. The traditional ELL model speciﬁes the location eﬀects for the clusters (PSUs), typically nested
within the areas where the estimation is desired. Estimations at higher aggregation levels than
those at which the location eﬀects are speciﬁed can underestimate noise if the area-level variables
included in the model fail to explain the between-area heterogeneity.
4.3.2 Empirical Best and CensusEB
Model requirements:
1. Microdata from a household survey and census at the household level with variables measured and
questions asked in a similar manner.
2. A set of variables related to the welfare variable of interest. These variables should have similar
distributions between the two data sources.
3. Areas in the survey and the census should have identiﬁers that can be linked across each other.
Pros:
1. Based on household-level data which provide more detailed information than area-level data and
have a considerably larger sample size.
2. Can provide estimates for non-sampled areas.
3. Estimates from any indicator that is a function of welfare can be obtained without relying on
diﬀerent models.
4. Unbiased under the model, if the model is true and model parameters are known.
5. Optimal in the sense that they minimize the MSE under the model.
6. Considerably less noisy than ELL when unexplained heterogeneity across areas is considerable. For
non-sampled areas, ELL and EB estimators are similar.
7. Oﬀers a software-implemented model for heteroskedasticity (choosing H3 CensusEB in the Stata
sae package).
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 45
Cons:
1. They are the outcome of a model and thus the model needs to be properly checked.
2. It may be aﬀected by deviations from the model’s assumptions. For example, deviation from
normality or outliers can have detrimental eﬀects on SAE - isolated unit-level outliers may not
have much impact if the sample size is large.
3. Original EB estimates do not consider the sampling design and are not design-unbiased and hence
may be aﬀected by informative sampling designs (i.e., when the sample selection probabilities
depend on the outcome values - Rao and Molina 2015).
(a) In the Stata sae package, the REML option obtains the original EB, but H3 CensusEB
incorporates the sampling weights and should not be sensibly aﬀected by informative sampling.
4. The bootstrap method for MSE estimation is computationally intensive.
4.3.3 Alternative Software Packages for Unit-Level Models and Extensions
• Packages that implement the one-fold nested error model EB approach from Molina and Rao (2010):
– The sae package in R (Molina and Marhuenda 2015).
– After the updates presented in Corral, Molina, and Nguyen (2021) the sae package in Stata
(Nguyen et al. 2018).
– The emdi package in R (Kreutzmann et al. 2019).
• Empirical best prediction that incorporates the survey weights (Pseudo EB):
– The emdi package in R (Kreutzmann et al. 2019) has been extended to include the Pseudo
EB method presented in Guadarrama, Molina, and Rao (2018).
– After the updates presented in Corral, Molina, and Nguyen (2021) the sae package in Stata
(Nguyen et al. 2018) includes the model-ﬁtting method that incorporates the survey weights
presented in Van der Weide (2014).
∗ The implemented method only obtains CensusEB estimates.
• Two-fold nested error model EB approach as presented in Marhuenda et al (2017):
– After the updates presented in Corral, Molina, and Nguyen (2021), the sae package in Stata
(Nguyen et al. 2018).
∗ The implemented method only obtains CensusEB estimates.
• Estimation of area means (without transformation of welfare) based on Hierarchical Bayes unit-level
models:
– The hbsae package in R (Boonstra 2015).
• ELL approach:
– The sae package in Stata (Nguyen et al. 2018).
– PovMap, which is a free stand-alone software (Zhao 2006).
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 46
4.4 Unit-Level Models – Technical Annex
4.4.1 The Assumed Model
The nested error model used for small area estimation by ELL (2003) and Molina and Rao (2010) was
originally proposed by Battese, Harter and Fuller (1988) to produce county-level corn and soybean crop
area estimates for the American state of Iowa. For the estimation of poverty and welfare, the ELL and
MR methods assume the transformed welfare ych for each household h in the population within each
location c is linearly related to a 1 × K vector of characteristics (or correlates) xch for that household,
according to the nested error model:
ych = xch β + ηc + ech , h = 1, . . . , Nc , c = 1, . . . , C. (4.2)
Here ηc and ech are respectively location and household-speciﬁc idiosyncratic errors, assumed to be
independent from each other, satisfying:
iid 2 iid
2
ηc ∼ N 0, ση , ech ∼ N 0, σe ,
where the variances ση
2
and σe
2
are unknown. Here, C is the number of locations in which the population
is divided and Nc is the number of households in location c, for c = 1, . . . , C . Finally, β is the K × 1
vector of coeﬃcients.
To illustrate the diﬀerences between the methods, understanding the model is essential. First, note that
the random location eﬀect is the same for all households within a given area. Second, note that the
random location eﬀect is not necessarily 0 for a given location, despite the random location eﬀect being
drawn from a normal distribution with mean 0. Consequently, for any given realization of the population,
the random location eﬀect is unlikely to be equal to 0, but its expected value is 0 across all the possible
realizations. To see this, a simulation where the random location eﬀects for 80 areas are drawn from a
iid
normal distribution ηc ∼ N 0, 0.152 , is shown in the Stata code snippet below. This is repeated across
10,000 simulated populations.
. //Do file simulates random location effects for 80 areas
. set more off
. clear all
. version 14
. set maxvar 10100
.
. //Seed for replicability
. set seed 232989
. //Necessary macros
. local sigmaeta = 0.15 //Sigma eta
. local Npop = 10000 //number of populations
. local Narea = 80
. //Create simulated data
. set obs `Narea´
number of observations (_N) was 0, now 80
. //Simulate random location effects for 80 areas, 10000 populations
. forval z=1/`Npop´{
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 47
2. gen double eta_`z´ = rnormal(0,`sigmaeta´)
3. }
. //Look at values for 1 population
. sum eta_1, d //very close to 0
eta_1
Percentiles Smallest
1% -.3601142 -.3601142
5% -.2164503 -.3536255
10% -.1782243 -.235858 Obs 80
25% -.1081445 -.2219594 Sum of Wgt. 80
50% .0134193 Mean .0047588
Largest Std. Dev. .1484272
75% .1098011 .249745
90% .1700735 .2672989 Variance .0220306
95% .2327892 .3286754 Skewness .007834
99% .4155954 .4155954 Kurtosis 2.991039
. list eta_1 if _n==23 //not 0
eta_1
23. -.22195943
. //The expected value of the random location effect across populations
. egen double E_eta = rmean(eta_*)
. sum E_eta,d
E_eta
Percentiles Smallest
1% -.003775 -.003775
5% -.0018793 -.0031299
10% -.0016342 -.0023112 Obs 80
25% -.0008481 -.0019355 Sum of Wgt. 80
50% .0004301 Mean .0002009
Largest Std. Dev. .0014468
75% .0010856 .0025296
90% .0023057 .0026368 Variance 2.09e-06
95% .0025 .0029182 Skewness -.1558987
99% .0033486 .0033486 Kurtosis 2.744393
. //The expected value for one area across all populations
. list E_eta if _n==23 //approximating 0
E_eta
23. .00242892
.
end of do-file
Note how in the results, the value of the random location eﬀect for a given area and population is not
necessarily equal to 0. However, across all the generated populations, the mean value of the random
location eﬀect for all areas is very close to 0.
This aspect is important, because it highlights the key diﬀerence between the methodology from ELL
(2003) and the EB approach from MR (2010). Under ELL, when simulating welfare for the census, the
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 48
iid
location eﬀect is drawn exactly as assumed under the model, that is as, ηc ∼ N 0, ση
2
. In essence, under
ELL, for any given area present in the sample, the ELL estimator of the census area mean y¯c is obtained
∗(m) ∗(m) ∗(m)
by averaging the actual area means y ¯c ¯
= X β + ηc c ¯c , m = 1, . . . , M , across M simulated
+e
M ∗(m)
populations, that is, the ELL estimator is 1
M , which approximates E (¯
m=1 ¯c
y yc ). But note that
E [ηc ] = 0 and E [ech ] = 0. Hence, the ELL estimator reduces to the regression-synthetic estimator, X
¯ β
c
(MR, 2010). On the other hand, under MR (2010), conditioning on the survey sample ensures that the
estimator includes the random location eﬀect, since E [¯ ¯ β + ηc . Haslett (2016) suggests that
yc |ηc ] = X c
the inclusion of contextual variables somewhat attenuates the issue. Nevertheless, unless the contextual
variables together with the household-level ones fully explain the variation in welfare across areas, there
will always be gains from using EB compared to ELL. Actually, the EB approach from Molina and Rao
(2010) gives approximately the best estimator in the sense that it yields the estimator with the minimum
mean squared error (MSE) for target areas. Consequently, EB estimators are considerably more eﬃcient
than ELL (see Molina and Rao 2010; Corral, Molina, and Nguyen 2021; and Corral et al. 2021).
A census data set of N = 20, 000 observations is created, similar to MR (2010), to observe how con-
ditioning on the survey sample aﬀects the resulting small area estimates. Under the created data, all
observations are uniformly spread among C = 80 areas, labeled from 1 to 80. This means that every area
iid
consists of Nc = 250 observations. The location eﬀects are simulated as ηc ∼ N 0, 0.152 ; note that
every observation within a given area will have the same simulated eﬀect. Then, values of two right-hand
side binary variables are simulated. The ﬁrst one, x1 , takes a value of 1 if a generated random uniform
value between 0 and 1 is less than or equal to 0.3 + 0.5 80
c
. This means that observations in areas with
a higher label are more likely to get a value of 1. The next one, x2 , is not tied to the area’s label. This
variable takes the value 1 if a simulated random uniform value between 0 and 1 is less than or equal to
0.2. The census welfare vectors yc = (yc,1 , . . . , yc,Nc )T for each area c = 1, . . . , C , are then created from
the model as follows:
ln(ych ) = 3 + 0.03x1,ch − 0.04x2,ch + ηc + ech ,
iid
where household-level errors are generated under the homoskedastic setup, as ech ∼ N 0, 0.52 . The
poverty line is ﬁxed at z = 12, corresponding to roughly 60 percent of the median welfare of a generated
population. The steps to creating such a population in Stata are shown below:
. set more off
. clear all
. version 14
. set seed 648743
.
. local obsnum = 20000 //Number of observations in our "census"
. local areasize = 250 //Number of observations by area
. local outsample = 20 //Sample size from each area (%)
. local sigmaeta = 0.15 //Sigma eta
. local sigmaeps = 0.5 //Sigma eps
.
.
. // Create area random effects
. set obs `=`obsnum´/`areasize´´
number of observations (_N) was 0, now 80
. gen area = _n // label areas 1 to C
. gen eta = rnormal(0,`sigmaeta´) //Generate random location effects
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 49
. expand `areasize´ //leaves us with 250 observations per area
(19,920 observations created)
. sort area //To ensure everything is ordered - this matters for replicability
. //Household identifier
. gen hhid = _n
. //Household expansion factors - assume all 1
. gen hhsize = 1
. //Household specific residual
. gen e = rnormal(0,`sigmaeps´)
. //Covariates, some are corrlated to the area´s label
. gen x1=runiform()<=(0.3+.5*area/`=`obsnum´/`areasize´´)
. gen x2=runiform()<=(0.2)
. //Welfare vector
. gen Y_B = 3+ .03* x1-.04* x2 + eta + e
.
. preserve
. sort hhid
. sample 20, by(area)
(16,000 observations deleted)
. keep Y_B x1 x2 area hhid
. save "$mdata\sample.dta", replace
file C:\Users\Paul Corral\OneDrive\SAE Guidelines 2021\\3_Unit_level\\1_data\\s
> ample.dta saved
. restore
.
. save "$mdata\thepopulation.dta", replace
file C:\Users\Paul Corral\OneDrive\SAE Guidelines 2021\\3_Unit_level\\1_data\\t
> hepopulation.dta saved
.
end of do-file
The nested error model is then ﬁt to the population data using restricted maximum likelihood:
. mixed Y_B x1 x2 || area:, reml
Performing EM optimization:
Performing gradient-based optimization:
Iteration 0: log restricted-likelihood = -14736.236
Iteration 1: log restricted-likelihood = -14736.236
Computing standard errors:
Mixed-effects REML regression Number of obs = 20,000
Group variable: area Number of groups = 80
Obs per group:
min = 250
avg = 250.0
max = 250
Wald chi2(2) = 44.41
Log restricted-likelihood = -14736.236 Prob > chi2 = 0.0000
Y_B Coef. Std. Err. z P>|z| [95% Conf. Interval]
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 50
x1 .0252974 .0074975 3.37 0.001 .0106026 .0399922
x2 -.0505327 .0088161 -5.73 0.000 -.067812 -.0332534
_cons 3.014335 .0201408 149.66 0.000 2.97486 3.05381
Random-effects Parameters Estimate Std. Err. [95% Conf. Interval]
area: Identity
var(_cons) .0298249 .0049059 .0216055 .0411712
var(Residual) .2518635 .0025238 .2469652 .256859
LR test vs. linear model: chibar2(01) = 1940.36 Prob >= chibar2 = 0.0000
ˆ, can also be obtained:
Linear predictions, X β
. predict double xb, xb
ˆc , can also be obtained:
Predicted random location eﬀects, η
. predict double eta_pred, reffects
. sum eta_pred
Variable Obs Mean Std. Dev. Min Max
eta_pred 20,000 -2.17e-13 .1687893 -.3371651 .3667746
ˆ, plus the estimated location eﬀects, η
Stata can also produce the linear prediction, X β ˆc :
. predict double xb_eta, fitted
ˆ, and the linear predictor that includes the predicted
The expected values of the linear predictor X β
ˆ+η
location eﬀects, X β ˆc , are the same. The diﬀerence between these is that X β ˆ+η ˆc includes the
estimated location eﬀect, and thus, a larger share of the variance is explained, which leads to minimized
estimation errors for the areas.
. sum Y_B xb xb_eta
Variable Obs Mean Std. Dev. Min Max
Y_B 20,000 3.017888 .5308399 1.027403 5.197237
xb 20,000 3.017888 .0240033 2.963802 3.039632
xb_eta 20,000 3.017888 .1702655 2.626637 3.406407
At the same time, the area speciﬁc predictions are considerably closer to the observed values:
. collapse Y_B xb xb_eta, by(area)
. list Y_B xb xb_eta in 1/10
Y_B xb xb_eta
1. 2.693685 3.011817 2.7040804
2. 3.334479 3.0108076 3.3239027
3. 2.683372 3.0086839 2.6940016
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 51
4. 3.117312 3.0112101 3.1138448
5. 2.670211 3.0146496 2.6814654
6. 3.049783 3.0128294 3.0485756
7. 2.945565 3.0121218 2.9477392
8. 3.180324 3.0099986 3.1747587
9. 3.139087 3.0139408 3.1349977
10. 2.891956 3.01273 2.8959019
4.4.2 Monte Carlo Simulation and Bootstrap Procedures Under CensusEB
The best predictor is deﬁned as a conditional expectation. For indicators with a complex shape, the
conditional expectation deﬁning the best predictor may not have a closed form. Regardless of the shape,
the best predictor can be approximated by Monte Carlo simulation (Molina 2019). ELL and MR’s
EB approach clearly diﬀer in the computational procedures used to estimate the indicators of interest
and their noise. On the other hand, the original implementation of EB in PovMap and the sae Stata
command used a similar computational approach to ELL. Under ELL and the original EB implementation
in PovMap, noise and indicator estimates are all obtained with a single computational procedure. Corral,
Molina, and Nguyen (2021) show that the noise estimates of the original ELL (referred to as variance by
the authors) underestimate the true MSE of the indicators, despite ELL estimates being much noisier.
On the other hand, the parametric bootstrap procedure from González-Manteiga et al. (2008) seems to
estimate the noise (MSE) of EB estimators correctly.
For an in-depth discussion on how ELL and EB computational procedures diﬀer, see Corral, Molina, and
Nguyen (2021). For easy interpretation, the expositions presented here do not consider survey weights
or heteroskedasticity.30
4.4.2.1 Molina and Rao’s (2010) Monte Carlo Simulation Procedure for Point Estimates
The EB method from MR (2010) conditions on the survey sample data and thus makes more eﬃcient
use of the survey data, which contains the only available (and hence precious) information on the actual
welfare. Conditioning on the survey data requires matching households across the survey and census.
Matching households was required in the original EB approach introduced in MR (2010). However, the
census and sample households can not be matched in practice, except in some countries. When linking
the two data sources is not possible, this sample may be appended to the population for areas where
there is a sample. This is the approach taken in the original EB implementation in the sae R package
by Molina and Marhuenda (2015). In order to obtain a Monte Carlo approximation to the EB estimates
using this software, estimates are obtained as a weighted average between a simulated census poverty
rate for the area and the direct estimate of poverty obtained using the area’s sample. When the sample
size relative to the population size for a given area is considerable, for example, 20%, using the sample
observations of welfare may yield to a considerable gain in MSE. However, as noted in Figure 4.5 in
section 4.2.1 the gain in MSE of EB over CensusEB approximates 0 as the sample size by area shrinks.
When the sample is 4% of the population per area, an already quite large sample rarely encountered in
real-world scenarios, the diﬀerence between the approaches is nearly zero.
To illustrate how CensusEB estimates are implemented, a 20% sample by area of the data created in
section 4.4.1 is used as a survey to which the model 4.2 is ﬁt. To obtain point estimates under Molina
30 For the full exposition, see Corral, Molina and Nguyen (2021).
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 52
and Rao (2010), the authors assume that the considered unit-level model is the one that generates the
data. Thus, the estimates for β, ση
2
, and σe
2
obtained from the ﬁtted model are kept ﬁxed in the Monte
Carlo simulation procedure used to obtain EB point estimates.31 Additionally, the predicted random
ˆc , are also kept ﬁxed as a result of conditioning on the sample observations of welfare.
location eﬀects, η
ˆ0 , σ
ˆ0 = β
The ﬁrst step consists in ﬁtting the model and obtaining the parameter estimates, θ ˆη2 2
ˆe
0, σ 0
which are later used to simulate vectors of welfare for all the population of households (a census of
welfare).
. //MonteCarlo simulation to obtain CensusEB estimates
. use "$mdata\sample.dta", clear
.
. //fit model on the sample
. mixed Y_B x1 x2 || area:, reml
Performing EM optimization:
Performing gradient-based optimization:
Iteration 0: log restricted-likelihood = -3009.9023
Iteration 1: log restricted-likelihood = -3009.9023
Computing standard errors:
Mixed-effects REML regression Number of obs = 4,000
Group variable: area Number of groups = 80
Obs per group:
min = 50
avg = 50.0
max = 50
Wald chi2(2) = 7.31
Log restricted-likelihood = -3009.9023 Prob > chi2 = 0.0258
Y_B Coef. Std. Err. z P>|z| [95% Conf. Interval]
x1 .028721 .0167756 1.71 0.087 -.0041585 .0616005
x2 -.0423625 .0201979 -2.10 0.036 -.0819496 -.0027753
_cons 3.010993 .0241635 124.61 0.000 2.963633 3.058353
Random-effects Parameters Estimate Std. Err. [95% Conf. Interval]
area: Identity
var(_cons) .0339989 .0062132 .0237635 .048643
var(Residual) .2521756 .0056975 .2412523 .2635934
LR test vs. linear model: chibar2(01) = 338.23 Prob >= chibar2 = 0.0000
.
. //Obtain the necessary parameters
. local sigma_eta2 = (exp([lns1_1_1]_cons))^2
. local sigma_e2 = (exp([lnsig_e]_cons))^2
With these estimates in hand, it is possible to predict the random location eﬀects, ηc , as well as the
−1
shrinkage parameter, γc = ση
2 2
ση 2
+ σe /nc and the variance of the random location eﬀect, ση
2
(1 − γc ).
Notice that γc will be between 0 and 1, hence the variance of the location eﬀect will be smaller than ση
2
,
31 Note that this is considerably diﬀerent from the approach that was implemented in PovMap (Corral, Molina, and
Nguyen 2021).
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 53
which is the variance of the location eﬀect under ELL. These model parameter estimates will then be
used as true values to simulate the welfare vectors for the whole population of households.
. //Let´s calculate the linear fit
. predict xb, xb
.
. //Get the residual to calculate the random location effect
. gen residual = Y_B - xb
.
. //Number of observations by area
. gen n_area = 1 //we´ll add it later
.
. groupfunction, mean(residual) sum(n_area) by(area)
. //Gamma or adjustment factor ()
. gen double gamma = `sigma_eta2´/(`sigma_eta2´+`sigma_e2´/n_area)
. //Produce eta
. gen double eta = gamma*residual
. //variance of random location effects
. gen double var_eta = `sigma_eta2´*(1-gamma)
.
. keep area eta var_eta
.
. tempfile mylocs
. save `mylocs´
file C:\Users\PAULCO~1\AppData\Local\Temp\ST_2bd8_000003.tmp saved
The next step consists in applying the estimated parameters, θ ˆ0 , to generate the population vectors of
welfare. Note how the same βˆ0 are used to produce the linear ﬁt in every Monte Carlo simulation. Notice
that the location eﬀects are predicted from the sample from the corresponding area; thus, it is necessary to
match areas across the sample and population. The next step consists in generating 100 vectors of welfare
for the population units (census of welfare). Notice that here, the natural logarithm of welfare is reversed.
Additionally, welfare for each household, h, is drawn from ln (ych ) ∼ N xch β ˆ0 + η ˆ 2 (1 − γ
ˆc0 , σ ˆ2 .
ˆc ) + σ
η0 e0
. //Set seed for replicability
. set seed 9374
.
. //Bring in the population
. use x1 x2 area hhid using "$mdata\thepopulation.dta", clear
. //obtain linear fit - e(b) is still in memory
. predict xb, xb
.
. //Include eta and var_eta
. merge m:1 area using `mylocs´
(note: variable area was float, now double to accommodate using data´s values)
Result # of obs.
not matched 0
matched 20,000 (_merge==3)
. drop if _m==2
(0 observations deleted)
. drop _m
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 54
. //to ensure replicability
. sort hhid
. //generate 100 vectors of welfare in the population
. forval z=1/100{
2. //Take the exponential to obtain welfare
. gen double Y_`z´ = exp(rnormal(xb + eta, sqrt(`sigma_e2´+var_eta)))
3. }
Finally, FGT indicators (Foster, Greer, and Thorbecke 1984) are calculated for each area from the
simulated census – this is done for the 100 simulated censuses. Then, the CensusEB estimate is obtained
by averaging across the 100 FGT indicators obtained for the area.
. //Indicate poverty line
. gen povline = 12
. //Obtain FGTs for each area under every simulated vector
. sp_groupfunction, poverty(Y_*) povertyline(povline) by(area)
. //Average over simulations to obtain the EB estimate
. groupfunction, mean(value) by(measure area)
. //Reshape to obtain wide data at the area level
. qui:reshape wide value, i(area) j(measure) string
. //Save CensusEB
. save "$mdata\CensusEBfgt.dta", replace
file C:\Users\Paul Corral\OneDrive\SAE Guidelines 2021\\3_Unit_level\\1_data\\CensusEBfgt.dta saved
Note that this is quite diﬀerent from the method used for ELL, as implemented in PovMap and the Stata
sae package.32 The steps detailed above are aligned to those applied by the command sae sim reml
to obtain CensusEB estimates and are very similar to the approach under sae sim h3.
4.4.2.2 Parametric Bootstrap
The parametric bootstrap used to estimate the MSE of the EB estimates was introduced by González-
Manteiga et al. (2008). It is the approach used by Molina and Rao (2010) and the one used in the sae
R package from Molina and Marhuenda (2015), as well as in the updated Stata sae package.33 The
procedure is computationally more demanding than the bootstrap procedure used in PovMap which was
inspired by the MI literature. However, the parametric bootstrap procedure from González-Manteiga
et al. (2008) tracks the real MSE values.
ˆ0 , σ
ˆ0 = β
The process consists in, ﬁrst, creating a population vector of welfare (i.e. census) using θ 0 .
2 2
ˆη ˆe
0, σ
ˆ0 =
Note that θ ˆ0 , σ
β ˆη2 2
are obtained from the original sample and match those from the ones
ˆe
0, σ 0
used in the previous step for computing CensusEB point estimates.
. //Macors used below
. global povline = 12
. global bs = 1 //Bootstrap replicates
.
.
32 See Corral, Molina, and Nguyen (2021) for more details.
33 See Corral et al. (2021) for more details.
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 55
. //Gonzales-Manteiga et al. (2008) Bootstrap for estimating MSE
. use "$mdata\sample.dta", clear
.
. //fit model on the sample
. mixed Y_B x1 x2 || area:, reml
Performing EM optimization:
Performing gradient-based optimization:
Iteration 0: log restricted-likelihood = -3009.9023
Iteration 1: log restricted-likelihood = -3009.9023
Computing standard errors:
Mixed-effects REML regression Number of obs = 4,000
Group variable: area Number of groups = 80
Obs per group:
min = 50
avg = 50.0
max = 50
Wald chi2(2) = 7.31
Log restricted-likelihood = -3009.9023 Prob > chi2 = 0.0258
Y_B Coef. Std. Err. z P>|z| [95% Conf. Interval]
x1 .028721 .0167756 1.71 0.087 -.0041585 .0616005
x2 -.0423625 .0201979 -2.10 0.036 -.0819496 -.0027753
_cons 3.010993 .0241635 124.61 0.000 2.963633 3.058353
Random-effects Parameters Estimate Std. Err. [95% Conf. Interval]
area: Identity
var(_cons) .0339989 .0062132 .0237635 .048643
var(Residual) .2521756 .0056975 .2412523 .2635934
LR test vs. linear model: chibar2(01) = 338.23 Prob >= chibar2 = 0.0000
.
. //Obtain the necessary parameters
. local sigma_eta2 = (exp([lns1_1_1]_cons))^2
. local sigma_e2 = (exp([lnsig_e]_cons))^2
. //Let´s calculate the linear fit
. predict xb, xb
. dis sqrt(`sigma_eta2´)
.18438803
. dis sqrt(`sigma_e2´)
.50217086
.
. tempfile sample
. save `sample´
file C:\Users\PAULCO~1\AppData\Local\Temp\ST_1690_000003.tmp saved
ˆ0 =
Using θ ˆ0 , σ
β ˆη2 2
as the vector of true values, B population vectors of welfare are created
ˆe
0, σ 0
following the DGP from Eq. 4.2. Each area’s desired set of indicators is obtained from each population
vector. These indicators are regarded as the true values in each bootstrap replicate and the benchmark
to which CensusEB estimates for that generated population will be compared to. The estimate of the
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 56
MSE is the mean across the B replicates of the squared diﬀerences between the bootstrap replicate’s
estimate and the bootstrap replicate’s “true” value. In the code below, the steps for the ﬁrst bootstrap
replicate, b = 1, in this parametric bootstrap procedure are shown.
. *===============================================================================
. //The second stage consists in "extracting the sample" and obtaining estimates
. // via MonteCarlo simulation. For each bootstrap replicate...this is b=1
. *===============================================================================
. use x1 x2 xb hhid area using `sample´, clear
. //Include the etas
. //note that these are the same used in the population
. merge m:1 area using `etas´
Result # of obs.
not matched 0
matched 4,000 (_merge==3)
. drop _m
. sort hhid
. //Generate the welfare vector, in exactly the same manner as in the
. //population
. gen lny = rnormal(xb + eta_1,sqrt(`sigma_e2´))
.
. local seedstage `c(rngstate)´
.
. // Now with the new welfare vector we can obtain CensusEB estimates
. //for the bootstrap replicate using sae
. sae sim reml lny x1 x2, area(area) mcrep(50) bsrep(0) lny ///
> seed(`seedstage´) pwcensus(hhsize) indicators(FGT0 FGT1 FGT2) ///
> aggids(0) uniq(hhid) plines($povline) matin("$mdata\mypop_mata")
Performing EM optimization:
Performing gradient-based optimization:
Iteration 0: log restricted-likelihood = -3004.5851
Iteration 1: log restricted-likelihood = -3004.5851
Computing standard errors:
Mixed-effects REML regression Number of obs = 4,000
Group variable: area Number of groups = 80
Obs per group:
min = 50
avg = 50.0
max = 50
Wald chi2(2) = 17.06
Log restricted-likelihood = -3004.5851 Prob > chi2 = 0.0002
lny Coef. Std. Err. z P>|z| [95% Conf. Interval]
x1 .0484545 .0167536 2.89 0.004 .015618 .081291
x2 -.0596082 .0201734 -2.95 0.003 -.0991472 -.0200691
_cons 2.954358 .0239952 123.12 0.000 2.907328 3.001387
Random-effects Parameters Estimate Std. Err. [95% Conf. Interval]
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 57
area: Identity
sd(_cons) .182708 .0167331 .1526867 .2186321
sd(Residual) .5015719 .0056661 .4905885 .5128011
LR test vs. linear model: chibar2(01) = 331.29 Prob >= chibar2 = 0.0000
file C:\Users\PAULCO~1\AppData\Local\Temp\ST_2bd8_000006.tmp saved
Number of observations in target dataset:
20000
Number of clusters in target dataset:
80
Number of simulations: 50
Each dot (.) represents 1 simulation(s).
1 2 3 4 5
.................................................. 50
Number of bootstraps: 0
1 2 3 4 5
.
. //MSE is calculated as squared difference between the true and CensusEB.
. //Bring in the true poverty for this bootstrap replicate
. rename Unit area
. merge 1:1 area using `true´
Result # of obs.
not matched 0
matched 80 (_merge==3)
. drop _m
. //Get squared difference, after B bootstrap replicates this is our
. //MSE estimate
. forval a = 0/2{
2. gen sq_diff_fgt`a´ = ((avg_fgt`a´ - valuefgt`a´)^2)/$bs
3. }
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 58
4.5 Appendix
4.5.1 Model Selection Do-File
This do-ﬁle provides an example of the steps that may be followed to select a model for SAE, according
to the discussion from section 4.2. The steps for model checking follow those illustrated by UCLA’s
Statistical Consulting Group (2022).
clear all
set more off
/*===============================================================================
Do-file prepared for SAE Guidelines
- Real world data application
- authors Paul Corral & Minh Nguyen
*==============================================================================*/
global main "C:\Users\\`c(username)´\OneDrive\SAE Guidelines 2021\"
global section "$main\3_Unit_level\"
global mdata "$section\1_data\"
global survey "$mdata\survey_public.dta"
global census "$mdata\census_public.dta"
//global with candidate variables.
global myvar rural lnhhsize age_hh male_hh piped_water no_piped_water ///
no_sewage sewage_pub sewage_priv electricity telephone cellphone internet ///
computer washmachine fridge television share_under15 share_elderly ///
share_adult max_tertiary max_secondary HID_* mun_* state_*
version 15
set seed 648743
*===============================================================================
// End of preamble
*===============================================================================
//load in survey data
use "$survey", clear
//Remove small incomes affecting model
drop if e_y<1
//Log shift transformation to approximate normality
lnskew0 double bcy = exp(lny)
//removes skeweness from distribution
sum lny, d
sum bcy, d
//Data has already been cleaned and prepared. Data preparation and the creation
// of eligible covariates is of extreme importance.
// In this instance, we skip these comparison steps because the sample is
// literally a subsample of the census.
codebook HID //10 digits, every single one
codebook HID_mun //7 digits every single one
//We rename HID_mun
rename HID_mun MUN
//Drop automobile, it is missing
drop *automobile* //all these are missing
//Check to see if lassoregress is installed, if not install
cap which lassoregress
if (_rc) ssc install elasticregress
//Model selection - with Lasso
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 59
gen lnhhsize = ln(hhsize)
lassoregress bcy $myvar [aw=Whh], lambda1se epsilon(1e-10) numfolds(10)
local hhvars = e(varlist_nonzero)
global postlasso `hhvars´
//Try Henderson III GLS
sae model h3 bcy $postlasso [aw=Whh], area(MUN)
//Rename HID_mun
rename MUN HID_mun
//Loop designed to remove non-significant covariates sequentially
forval z= 0.5(-0.05)0.05{
qui:sae model h3 bcy `hhvars´ [aw=Whh], area(HID_mun)
mata: bb=st_matrix("e(b_gls)")
mata: se=sqrt(diagonal(st_matrix("e(V_gls)")))
mata: zvals = bb´:/se
mata: st_matrix("min",min(abs(zvals)))
local zv = (-min[1,1])
if (2*normal(`zv´)<`z´) exit
foreach x of varlist `hhvars´{
local hhvars1
qui: sae model h3 bcy `hhvars´ [aw=Whh], area(HID_mun)
qui: test `x´
if (r(p)>`z´){
local hhvars1
foreach yy of local hhvars{
if ("`yy´"=="`x´") dis ""
else local hhvars1 `hhvars1´ `yy´
}
}
else local hhvars1 `hhvars´
local hhvars `hhvars1´
}
}
global postsign `hhvars´
//Henderson III GLS - model post removal of non-significant
sae model h3 bcy $postsign [aw=Whh], area(HID_mun)
//Check for multicollinearity, and remove highly collinear (VIF>3)
reg bcy $postsign [aw=Whh],r
cap drop touse //remove vector if it is present to avoid error in next step
gen touse = e(sample) //Indicates the observations used
vif //Variance inflation factor
local hhvars $postsign
//Remove covariates with VIF greater than 3
mata: ds = _f_stepvif("`hhvars´","Whh",3,"touse")
global postvif `vifvar´
//VIF check
reg bcy $postvif [aw=Whh], r
vif
//Henderson III GLS - model post removal of non-significant
sae model h3 bcy $postvif [aw=Whh], area(HID_mun)
*===============================================================================
// 2.5 Model checks
*===============================================================================
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 60
reg bcy $postvif
predict cdist, cooksd
predict rstud, rstudent
reg bcy $postvif [aw=Whh]
local KK = e(df_m)
predict lev, leverage
predict eps, resid
predict bc_yhat, xb
//Let´s take a look at our residuals
//Notice there is a downward sloping line,which seems to be the smallest eps for that xb
scatter eps bc_yhat
//so we can see what the figure looks like
sleep 15000
// so there´s a bunch of small incomes that may be affecting our model!
scatter eps bc_yhat if exp(lny)>1
/* https://stats.idre.ucla.edu/stata/dae/robust-regression/
Residual: The difference between the predicted value (based on the regression ///
equation) and the actual, observed value.
Outlier: In linear regression, an outlier is an observation with large ///
residual. In other words, it is an observation whose dependent-variable ///
value is unusual given its value on the predictor variables. An outlier may ///
indicate a sample peculiarity or may indicate a data entry error or ///
other problem.
Leverage: An observation with an extreme value on a predictor variable is a ///
point with high leverage. Leverage is a measure of how far an independent ///
variable deviates from its mean. High leverage points can have a great ///
amount of effect on the estimate of regression coefficients.
Influence: An observation is said to be influential if removing the ///
substantially changes the estimate of the regression coefficients. ///
Influence can be thought of as the product of leverage and outlierness.
Cook´s distance (or Cook´s D): A measure that combines the information of ///
leverage and residual of the observation
*/
/* Rules of thumb:
Cooks -> >4/N, also according to "Regression Diagnostics: An Expository ///
Treatment of Outliers and Influential Cases, values over 1...
Abs(rstu) -> >2 We should pay attention to studentized residuals that exceed ///
+2 or -2, and get even more concerned about residuals that exceed +2.5 or ///
-2.5 and even yet more concerned about residuals that exceed +3 or -3. ///
leverage -> >(2k+2)/n
*/
hist cdist, name(diag_cooksd, replace)
hist lev, name(diag_leverage, replace)
hist rstud, name(diag_rstudent, replace)
twoway scatter cdist lev, name(diag_cooksd_lev, replace)
lvr2plot, name(lvr2)
rvfplot, name(rvf)
sum cdist, d
local max = r(max)
local p99 = r(p99)
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 61
reg lny $postvif [aw=Whh]
local myN=e(N)
local myK=e(rank)
//We have influential data points...
reg lny $postvif if cdist<4/`myN´ [aw=Whh]
reg lny $postvif if cdist<`max´ [aw=Whh]
reg lny $postvif if cdist<`p99´ [aw=Whh]
gen nogo = abs(rstud)>2 & cdist>4/`myN´ & lev>(2*`myK´+2)/`myN´
*===============================================================================
// Selecting the Alpha model
*===============================================================================
//Rename HID_mun
cap rename HID_mun MUN
//Henderson III GLS - add alfa model
sae model h3 bcy $postvif if nogo==0 [aw=Whh], area(MUN) ///
alfatest(residual) zvar(hhsize)
des residual_alfa //The dependent variable for the alfa model
// Macro holding all eligible vars
unab allvars : $myvar
//Macro with current variables
local nogo $postvif
//We want to only use variables not used
foreach x of local allvars{
local in = 0
foreach y of local nogo{
if ("`x´"=="`y´") local in=1
}
if (`in´==0) local A `A´ `x´
}
global A `A´ //macro holding eligible variables for alpha model
lassoregress residual_alfa `A´ if nogo==0 [aw=Whh]
local alfa = e(varlist_nonzero)
global alfa `alfa´
reg residual_alfa $alfa if nogo==0 [aw=Whh],r
gen tousealfa = e(sample)
//Remove vif vars
mata: ds = _f_stepvif("$alfa","Whh",5,"tousealfa")
global alfa `vifvar´
//Alfa vars before removal of non-significant vars
global beforealfa `alfa´
local hhvars $alfa
forval z= 0.9(-0.1)0.1{
foreach x of varlist `hhvars´{
local hhvars1
qui: reg residual_alfa `hhvars´ [aw=Whh], r
qui: test `x´
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 62
if (r(p)>`z´){
local hhvars1
foreach yy of local hhvars{
if ("`yy´"=="`x´") dis ""
else local hhvars1 `hhvars1´ `yy´
}
}
else local hhvars1 `hhvars´
local hhvars `hhvars1´
}
}
global alfavars `hhvars´
//Henderson III Model with alpha model
sae model h3 bcy $postvif if nogo==0 [aw=Whh], area(MUN) zvar($alfavars)
*===============================================================================
// GLS model, one final removal of non-significant variables
*===============================================================================
//Loop designed to remove non-significant covariates sequentially
local hhvars $postvif
forval z= 0.5(-0.05)0.05{
qui:sae model h3 bcy `hhvars´ if nogo==0 [aw=Whh], area(MUN) ///
zvar($alfavars)
mata: bb=st_matrix("e(b_gls)")
mata: se=sqrt(diagonal(st_matrix("e(V_gls)")))
mata: zvals = bb´:/se
mata: st_matrix("min",min(abs(zvals)))
local zv = (-min[1,1])
if (2*normal(`zv´)<`z´) exit
foreach x of varlist `hhvars´{
local hhvars1
qui:sae model h3 bcy `hhvars´ if nogo==0 [aw=Whh], area(MUN) ///
zvar($alfavars)
qui: test `x´
if (r(p)>`z´){
local hhvars1
foreach yy of local hhvars{
if ("`yy´"=="`x´") dis ""
else local hhvars1 `hhvars1´ `yy´
}
}
else local hhvars1 `hhvars´
local hhvars `hhvars1´
}
}
global postalfa `hhvars´
*===============================================================================
// SAVE the data with the pertinent covariates and other info
*===============================================================================
sae model h3 bcy $postalfa if nogo==0 [aw=Whh], area(MUN) zvar($alfavars)
char _dta[rhs] $postalfa
char _dta[alpha] $alfavars
char _dta[sel] nogo
save "$mdata\mysvy.dta", replace
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 63
4.5.2 SAE Simulation Stage
This do-ﬁle provides an example of the steps commonly followed to produce the ﬁnal CensusEB small
area estimates. It follows the discussion from section 4.2.4.
clear all
set more off
/*===============================================================================
Do-file prepared for SAE Guidelines
- Real world data application
- authors Paul Corral & Minh Nguyen
*==============================================================================*/
global main "C:\Users\\`c(username)´\OneDrive\SAE Guidelines 2021\"
global section "$main\3_Unit_level\"
global mdata "$section\1_data\"
global survey "$mdata\survey_public.dta"
global census "$mdata\census_public.dta"
version 15
local seed 648743
*===============================================================================
// End of preamble
*===============================================================================
use "$mdata\mysvy.dta", clear
char list
global hhmodel : char _dta[rhs]
global alpha : char _dta[alpha]
global sel : char _dta[sel]
//Add lnhhsize
use "$census"
gen lnhhsize = ln(hhsize)
tempfile census1
save `census1´
// Create data ready for SAE - optimized dataset
sae data import, datain(`census1´) varlist($hhmodel $alpha hhsize) ///
area(HID_mun) uniqid(hhid) dataout("$mdata\census_mata")
*===============================================================================
// Simulation -> Obtain point estimates
*===============================================================================
use "$mdata\mysvy.dta", clear
drop if e_y<1
drop if $sel==1
rename MUN HID_mun
sae sim h3 e_y $hhmodel, area(HID_mun) zvar($alpha) mcrep(100) bsrep(0) ///
lnskew matin("$mdata\census_mata") seed(`seed´) pwcensus(hhsize) ///
indicators(fgt0 fgt1 fgt2) aggids(0 4) uniqid(hhid) plines(715)
*===============================================================================
// Simulation -> Obtain MSE estimates
*===============================================================================
use "$mdata\mysvy.dta", clear
drop if e_y<1
drop if $sel==1
rename MUN HID_mun
sae sim h3 e_y $hhmodel, area(HID_mun) zvar($alpha) mcrep(100) bsrep(200) ///
lnskew matin("$mdata\census_mata") seed(`seed´) pwcensus(hhsize) ///
indicators(fgt0 fgt1 fgt2) aggids(0 4) uniqid(hhid) plines(715)
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 64
save "$mdata\mySAE.dta", replace
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 65
References
Banerjee, Abhijit V, Angus Deaton, Nora Lustig, Kenneth Rogoﬀ, and Edward Hsu (2006). “An Evalu-
ation of World Bank Research, 1998-2005”. In: Available at SSRN 2950327.
Battese, George E., Rachel M. Harter, and Wayne A. Fuller (1988). “An Error-Components Model for
Prediction of County Crop Areas Using Survey and Satellite Data”. In: Journal of the American
Statistical Association 83.401, pp. 28–36. issn: 01621459. url: http : / / www . jstor . org / stable /
2288915.
Bikauskaite, A, I Molina, and D Morales (2020). “Multivariate Mixture Model for Small Area Estimation
of Poverty Indicators”.
Boonstra, Harm Jan (2015). “Package ’hbsae’”. In: R Package Version 1.
Boonstra, Harm Jan and Grzegorz Baltissen (2021). “mcmcsae: Markov Chain Monte Carlo Small Area
Estimation”. In: R package version 0.7.0. url: https:https://CRAN.R- project.org/package=
mcmcsae..
Box, GEP and DR Cox (1964). “An Analysis of Transformations”. In: Journal of the Royal Statistical
Society, Series B 26, pp. 211–252.
Corral, Paul, Kristen Himelein, Kevin McGee, and Isabel Molina (2021). “A Map of the Poor or a Poor
Map?” In: Mathematics 9.21. issn: 2227-7390. doi: 10.3390/math9212780. url: https://www.mdpi.
com/2227-7390/9/21/2780.
Corral, Paul, Isabel Molina, and Minh Cong Nguyen (2021). “Pull Your Small Area Estimates up by
the Bootstraps”. In: Journal of Statistical Computation and Simulation 91.16, pp. 3304–3357. doi:
10 . 1080 / 00949655 . 2021 . 1926460. url: https : / / www . tandfonline . com / doi / abs / 10 . 1080 /
00949655.2021.1926460.
Elbers, C and R van der Weide (2014). “Estimation of Normal Mixtures in a Nested Error Model with
an Application to Small Area Estimation of Poverty and Inequality”. In: World Bank Policy Research
Working Papers 6962.
Elbers, Chris, Tomoki Fujii, Peter Lanjouw, Berk Ozler, Wesley Yin, et al. (2007). “Poverty Alleviation
through Geographic Targeting: How Much Does Disaggregation Help?” In: Journal of Development
Economics 83.1, pp. 198–213.
Elbers, Chris, Jean O Lanjouw, and Peter Lanjouw (2003). “Micro–level Estimation of Poverty and
Inequality”. In: Econometrica 71.1, pp. 355–364.
Elbers, Chris, Jean Olson Lanjouw, and Peter Lanjouw (2002). “Micro-level Estimation of Welfare”. In:
World Bank Policy Research Working Paper 2911.
Foster, James, Joel Greer, and Erik Thorbecke (1984). “A Class of Decomposable Poverty Measures”. In:
Econometrica: Journal of the Econometric Society 52, pp. 761–766.
González-Manteiga, Wenceslao, Maria J Lombardía, Isabel Molina, Domingo Morales, and Laureano
Santamaría (2008). “Bootstrap Mean Squared Error of a Small-area EBLUP”. In: Journal of Statistical
Computation and Simulation 78.5, pp. 443–462.
Guadarrama, María, Isabel Molina, and JNK Rao (2018). “Small Area Estimation of General Parameters
under Complex Sampling Designs”. In: Computational Statistics & Data Analysis 121, pp. 20–40.
Harvey, Andrew C (1976). “Estimating Regression Models with Multiplicative Heteroscedasticity”. In:
Econometrica: Journal of the Econometric Society 44, pp. 461–465. url: https://www.jstor.org/
stable/1913974?casa_token=rLCctNWylroAAAAA:3-S22Hv84lze8TUU-3nyFgDGZsrxkAoCZHwUVKO99Qynn_
zXmr8-3kmdqFxdLxRBJT8KiE-xYwfoI5-gzFA9RF3N-KHS3g7na1Soi8WbYp_QBcPfjU7m&seq=1.
Haslett, Stephen J. (2016). “Small Area Estimation Using Both Survey and Census Unit Record Data”.
In: Analysis of Poverty Data by Small Area Estimation. John Wiley & Sons, Ltd. Chap. 18, pp. 325–
348. isbn: 9781118814963. doi: https://doi.org/10.1002/9781118814963.ch18. eprint: https://
CHAPTER 4. UNIT-LEVEL MODELS FOR SMALL AREA ESTIMATION 66
onlinelibrary.wiley.com/doi/pdf/10.1002/9781118814963.ch18. url: https://onlinelibrary.
wiley.com/doi/abs/10.1002/9781118814963.ch18.
Henderson, Charles R (1953). “Estimation of Variance and Covariance Components”. In: Biometrics 9.2,
pp. 226–252.
Hobza, Tomáš and Domingo Morales (2013). “Small Area Estimation under Random Regression Coeﬃ-
cient Models”. In: Journal of Statistical Computation and Simulation 83.11, pp. 2160–2177.
Kreutzmann, Ann-Kristin, Sören Pannier, Natalia Rojas-Perilla, Timo Schmid, Matthias Templ, and
Nikos Tzavidis (2019). “The R Package emdi for Estimating and Mapping Regionally Disaggregated
Indicators”. In: Journal of Statistical Software 91.7, pp. 1–33. doi: 10.18637/jss.v091.i07. url:
https://www.jstatsoft.org/index.php/jss/article/view/v091i07.
Marhuenda, Yolanda, Isabel Molina, Domingo Morales, and JNK Rao (2017). “Poverty Mapping in Small
Areas under a Twofold Nested Error Regression Model”. In: Journal of the Royal Statistical Society:
Series A (Statistics in Society) 180.4, pp. 1111–1136. doi: 10.1111/rssa.12306.
Masaki, Takaaki, David Newhouse, Ani Rudra Silwal, Adane Bedada, and Ryan Engstrom (2020). “Small
Area Estimation of Non-monetary Poverty with Geospatial Data”. In: World Bank Policy Research
Working Paper 9383.
Molina, Isabel (2019). Desagregación De Datos En Encuestas De Hogares: Metodologías De Estimación
En áreas Pequeñas. CEPAL. url: https://repositorio.cepal.org/handle/11362/44214.
Molina, Isabel and Yolanda Marhuenda (2015). “Sae: An R Package for Small Area Estimation”. In: The
R Journal 7.1, pp. 81–98.
Molina, Isabel and JNK Rao (2010). “Small Area Estimation of Poverty Indicators”. In: Canadian Journal
of Statistics 38.3, pp. 369–385.
Nguyen, Minh Cong, Paul Corral, João Pedro Azevedo, and Qinghua Zhao (2018). “sae: A Stata Package
for Unit Level Small Area Estimation”. In: World Bank Policy Research Working Paper 8630.
Peterson, Ryan A and Joseph E Cavanaugh (2019). “Ordered Quantile Normalization: A Semiparametric
Transformation Built for the Cross-validation Era”. In: Journal of Applied Statistics 47.13-15, pp. 2312–
2327. doi: 10 . 1080 / 02664763 . 2019 . 1630372. url: https : / / www . tandfonline . com / action /
showCitFormats?doi=10.1080/02664763.2019.1630372.
Rao, JNK and Isabel Molina (2015). Small Area Estimation. 2nd. John Wiley & Sons.
Rojas-Perilla, Natalia, Sören Pannier, Timo Schmid, and Nikos Tzavidis (2020). “Data-driven Transfor-
mations in Small Area Estimation”. In: Journal of the Royal Statistical Society: Series A (Statistics in
Society) 183.1, pp. 121–148.
Tzavidis, Nikos, Li-Chun Zhang, Angela Luna, Timo Schmid, and Natalia Rojas-Perilla (2018). “From
Start to Finish: A Framework for the Production of Small Area Oﬃcial Statistics”. In: Journal of the
Royal Statistical Society: Series A (Statistics in Society) 181.4, pp. 927–979.
UCLA: Statistical Consulting Group (2022). Robust Regression: Stata Data Analysis Examples. url:
https://stats.oarc.ucla.edu/stata/webbooks/reg/chapter2/stata-webbooksregressionwith-
statachapter-2-regression-diagnostics/.
Van der Weide, Roy (2014). “GLS Estimation and Empirical Bayes Prediction for Linear Mixed Models
with Heteroskedasticity and Sampling Weights: A Background Study for the POVMAP Project”. In:
World Bank Policy Research Working Paper 7028. url: https://papers.ssrn.com/sol3/papers.
cfm?abstract_id=2495175.
Zhao, Qinghua (2006). User Manual for Povmap. url: http://siteresources.%20worldbank.%20org/
INTPGI/Resources/342674-1092157888460/Zhao_%20ManualPovMap.%20pdf.
Chapter 5
Poverty Mapping in Oﬀ-Census Years
Considerable attention has been given to produce reliable poverty maps in oﬀ-census years. An updated
poverty map is increasingly becoming an essential resource to improve the targeting of many social
protection programs around the world, which was underscored by the rapid onset and widespread impact
of the COVID-19 crisis. Even in the best-case scenarios, poverty maps that rely on unit-level small area
estimation techniques combining a census and survey can only be obtained once a decade. In oﬀ-census
years, the typical small area approach applied has been an area-level model, such as a Fay-Herriot or
a sub-area-level model, such as the one proposed by Torabi and Rao (2014). Nevertheless, since the
perceived gains in precision from area-level models are less than stellar, methods that combine unit- and
area-level models have been proposed (see Nguyen 2012; Lange, Pape, and Pütz 2018; Masaki et al. 2020).
These models are here called unit-context models and, although appealing, seem to yield considerably
biased estimates.
Beyond the small area estimation literature, the machine learning literature has made several contribu-
tions to poverty mapping. Recent research in this area include Chi et al. (2021) and Jean et al. (2016).
The authors of these two papers created highly disaggregated poverty maps by modeling the direct es-
timates of an asset index at a very low geographical level (e.g. villages or enumeration areas) using
satellite-derived covariates. The authors of those papers rely on machine learning approaches, such as
gradient boosting and ridge regression, to obtain estimates for small areas. These models provide point
estimates of poverty at a low geographical level, although they do not necessarily provide an adequate es-
timate of the method’s noise. The methods are attractive since they present the possibility of producing
a poverty map even when a contemporaneous or reliable census does not exist.
5.1 Unit-Context Models
Unit-context models attempt to model the population’s welfare distribution using only area-level co-
variates. More speciﬁcally, unit-context models combine unit and area-level information to model the
transformed household-level welfare (unit) using only area-level covariates (context). Since unit-context
models do not require census microdata, they have been proposed as an alternative approach for the
case when the available census microdata is too outdated to be considered for use under the conventional
model-based methods that include unit-level covariates.1
1 Another approach for cases where the census is outdated is to ﬁt a unit-level model considering only the covariates
with low (or even null) variability along time. This approach reduces (or may even solve) the problem of using an outdated
census.
67
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 68
Previous applications of unit-context models for small area estimation were proposed by Arora, Lahiri,
and Mukherjee (1997), who studied the number of trips home students have taken, and by Efron and
Morris (1975), who looked at batting averages and toxoplasmosis cases. In these applications, the method
appears to work well, although in both studies, the model with aggregate covariates is used to produce
estimates of the area means of the dependent variable in the model (no transformation is considered).
In the context of poverty, the target poverty indicators are typically complex nonlinear functions of the
dependent variables in the model. Hence, properly replicating the full welfare distribution is essential as
noted in Figure 4.1. At the area level, this is complicated since household characteristics are not used in
the model. Thus very little, if any, of the variation in welfare across households in the area is explained.
If simple area means of the welfare variable of interest are the target, then, due to the assumptions
embedded into the nested error models used in chapter 4, a transformation (such as log or log-shift) of
the welfare variable is used as the dependent variable in the model. Consequently, the area means of
the untransformed welfare variable are desired, which are then means of exponentials of the dependent
variable. As is illustrated in the next section, when estimating indicators that are nonlinear functions of
the dependent variables in the model, unit-context models will likely produce small area estimators of
poverty with substantial bias.
Nguyen (2012) ﬁrst considered unit-context models for poverty estimation in an application for Vietnam.
In this application, the dependent variable was the household-level logarithm of per capita expenditure
from the Vietnam Household Living Standard Survey from 2006, whereas all covariates are commune-
level means obtained from a dated (1999) census. Nguyen (2012) obtains ELL estimates of poverty
for small areas under that model and compares the performance with typical ELL poverty estimates
obtained using unit-level covariates from the Vietnam Household Living Standard Survey from 2006 and
the 2006 Rural Agriculture and Fishery Census. The author ﬁnds that provinces and districts hovering
around the middle of the distribution suﬀered considerable re-rankings across methods. However, those
at the top and the bottom were relatively stable.
A similar approach to the one from Nguyen (2012) was presented by Lange, Pape, and Pütz (2018) as
an alternative in cases when census and survey data are not from similar periods. However, the same
ineﬃciency issues noted in Chapter 4 regarding ELL estimates would likely persist when considering a
model using only area-level covariates. Improvements to the approach were seemingly made by Masaki
et al. (2020) by taking measures to address some of the shortcomings of a standard ELL approach and to
obtain EB estimators from Molina and Rao (2010). The authors conduct a design-based validation study
using census data for Sri Lanka and Tanzania for a wealth index constructed by principal component
analysis and suggest that the use of EB improves precision over ELL when implementing unit-context
models.
Although the unit-context approach is attractive in that it does not require a contemporaneous census
and can readily accommodate variables extracted from the emerging ﬁelds related to geospatial analysis,
there are serious concerns about bias in unit-context estimators, as noted in Corral et al. (2021) as well
as the concerns raised in the following section. The MSE from unit-context models is also likely to be
incorrectly estimated since the required parametric bootstrap procedure assumes the underlying model
using only area-level characteristics to be correct. In other words, the unit-context model’s assumptions
require household-level welfare to not depend on household-speciﬁc characteristics, which is unlikely to be
the case. Incorrect MSE estimates risk presenting a map with considerable bias as being overly precise.
Therefore, based on the currently available evidence, area-level models, like Fay-Herriot (Ch. 3), are
generally preferred over unit-context models (see the following section for more details).
In cases where neither area- nor unit-level models are advised due to data limitations, no clear consensus
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 69
has emerged on the best path forward or if one even exists. In evaluating alternatives, practitioners
should choose methods which rely on assumptions that are realistic to the circumstances in which the
model will be employed, which are approximately unbiased (or its bias does not exceed a certain limit),
and for which an accurate method exists to measure the small area estimators’ MSE. In cases where the
MSE cannot be adequately estimated, then at least it should be known in which (realistic) scenarios the
approach has limited bias. If these conditions cannot be reasonably met, it is preferable to not produce
a map than to produce one with potentially biased estimates, or one in which precision is overestimated,
or most worrisome, both. In the next section, the limitations of unit-context models are discussed.
5.1.1 Limitations of Unit-Context Models
Based on results from a validation study using model- and design-based simulations, Corral et al. (2021)
conclude unit-context models, like those presented in Masaki et al. (2020), Lange, Pape, and Pütz (2018)
and Nguyen (2012) are not recommended except under exceptional cases due to the high likelihood of
bias in estimates.2
Nguyen’s (2012) application of a unit-context model to estimate poverty in small areas in Vietnam already
hints toward potential problems of the unit-context approach. The author compares the results from
unit-context models to those obtained via a standard unit-level method, ELL, and ﬁnds considerable
changes in rankings. Nguyen (2012) ﬁnds that diﬀerences in rankings are largest for locations around
the middle of the distribution.
Despite the use of EB, the unit-context application by Masaki et al. (2020), also provides hints of
potential methodological problems with the approach. A ratio procedure was used to benchmark the
values to ensure alignment between direct estimates at the level of representativeness and the estimates
obtained by the approach. The need to benchmark indicates considerable discrepancies between the sum
of estimated totals at the lower level and the estimated total at the higher level. The need to benchmark
also suggests that the model’s assumptions are not satisﬁed.
Figure 5.1: Empirical Bias and MSE for CensusEB based on a unit-level model and CensusEB based on
a unit-context model (UC-CensusEB) and ELL FGT0 estimates from model based simulations
Source: Corral et al. (2021). The ﬁgure is obtained from simulations based on 10,000 populations and samples as speciﬁed
in Corral et al. (2021). The simulations illustrate that unit context (UC) model may yield FGT0 estimates that are
upward biased and with MSEs that could be several orders of magnitude above those of CensusEB estimates, based on the
analogous unit-level model, and for some areas may be almost as ineﬃcient as ELL.
2 The method presents advantages over the traditional Fay-Herriot (Fay and Herriot 1979) models: 1) it may be an
alternative when there are multiple locations with very small samples, for which the sampling variance of the direct estimator
(used on the left-hand side of the Fay-Herriot model) becomes 0, and 2) it may be used to obtain multiple indicators from
a single model under reversible transformations.
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 70
Unit-context models appear to yield upward biased FGT0 estimates in model-based simulations, as
presented in Figure 5.1 (most areas show bias above 0). Since unit-context models are special cases of the
models used in ELL and EB procedures (see section 4.4.1), but without household-level characteristics,
the between-household variation of welfare is not adequately explained by the model. Corral et al. (2021)
suggest that part of the observed bias comes from this misspeciﬁcation, with eﬀects similar to omitted
variable bias (OVB) (see the appendix in Corral et al. 2021). Despite the bias, the empirical MSE of
unit-context models seems to outperform that of ELL estimates (Figure 5.1).
Like traditional unit-level models, unit-context models also assume normally distributed errors and
departures from normality may also produce bias (which might oﬀset or compound the previous bias).
This is why in section (4.2.2), a considerable emphasis is placed on data transformation, all with the
aim of approximating the normality assumption. Because of the poor model ﬁt and potential deviations
from normality, unit-context models also display considerable bias when estimating mean welfare (see
Figure 5.2).
Figure 5.2: Empirical Bias and MSE for CensusEB based on a unit-level model and CensusEB based on
a unit-context model (UC-CensusEB) and ELL mean welfare estimates from model based simulations
Source: Corral et al. (2021). The ﬁgure is obtained from simulations based on 10,000 populations and samples as speciﬁed
in Corral et al. (2021). The simulations illustrate that unit-context models may yield mean welfare estimates that are
considerably biased and with MSEs that could be several orders of magnitude above those of CensusEB estimates based
on the analogous unit-level model and for some areas may be as ineﬃcient as ELL.
In the simulation results displayed in Figure 5.2, the dependent variable is the natural log of welfare.
Nevertheless, what practitioners are often interested in is welfare, and hence need to reverse the log
transformation. Since exp [E (ln y )] = E (y ). The log linear nested error model (ln ych = xch β + ηc + ech )
implies that ych = exp (xch β ) exp (ηc ) exp (ech ), then E (ych |xch ) = exp (xch β ) E [exp (ηc )] E [exp (ech )].
Since it is assumed that ηc ∼ N 0, ση
2
and ech ∼ N 0, σe
2
, then E [exp (ηc )] = exp 0.5ση
2
and
2
E [exp (ech )] = exp 0.5σe . Because unit-context models do not include household-level covariates, the
resulting estimate of σe
2
is likely much greater than the true σe
2
. This implies that unit-context models
may yield an upward biased prediction of mean household welfare for the area despite the use of EB
methods since unbiasedness is assured for the area means of ln y but not for those of y following back
transformation.
To illustrate this last point, a simulation where the modeling and estimates are obtained using the same
source is conducted (see Appendix 5.5.1.2). This is done to remove the potential source of bias noted in
the Appendix of Corral et al. (2021). The empirical bias of the estimators in each of the 40 areas in the
census data set are represented in box-plots. First, note that, when estimating the area means of the
dependent variable (similar to when no transformation is taken), the biases of the CensusEB estimators
based on the unit-context model are not that large (Figure 5.3, left). However, suppose one estimates the
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 71
means of the untransformed welfare variable (exponential of the model’s dependent variable). In that
case, the incorrect estimate of ση
2 2
+ σe plays a considerable role, and thus the estimates of mean welfare
based on unit-context models are upward biased despite the use of EB methods (Figure 5.3, right). As
shown in Figure 5.3 (right), the average bias of CensusEB estimators based on unit-context models is
considerable. For some areas, the bias may be over 800 times the bias of the CensusEB estimators based
on the analogous unit-level models.
Figure 5.3: Empirical Bias of CensusEB prediction of the area means of ln(yais ), left, and yais , right,
obtained from a unit-level model and from a unit-context model
Note: Simulation based on 5,000 populations generated as described in section 5.5.1. The model is ﬁt to the whole
population and then welfare is simulated for the same population to yield 5,000 EB predictors for each area. Box-plots
represent the empirical bias for the 40 areas in the census data. Do-ﬁle to replicate simulation can be seen in section 5.5.1.1.
The source of bias discussed, i.e. due to biased estimators of the variance components under unit-
context models, is in addition to the bias due to the diﬀerence between the sample and census means
of the covariates noted in Corral et al. (2021). The latter source of bias is not relevant here as it is
related to sampling, and in this case, the entire census is taken as the sample. The bias due to variance
components will aﬀect the estimation of mean welfare and headcount poverty. Consequently, estimates
of headcount poverty under unit-context models will be biased also because of poor model ﬁt. As shown
in Figure 5.4, the estimates from the unit-context models are upward biased for all areas, which occurs
because the full distribution of the population is not accurately replicated under unit-context models.
Figure 5.4: Empirical Bias of CensusEB predictors of FGT0 under unit-level model and under a unit-
context model
Note: Simulation based on 5,000 populations generated as shown in section . The model is ﬁt on the entire population
and then welfare is simulated on the same population to yield 5,000 EB predictors for each area. Box-plot represents the
empirical bias for the 40 areas in the census data. Do-ﬁle to replicate simulation can be seen in section 5.5.1.1.
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 72
A potentially signiﬁcant diﬀerence between the results presented in Figures 5.3 and 5.4, and the appli-
cations of Masaki et al. (2020) and Lange, Pape, and Pütz (2018) is the considerably lower R2 of the
unit-context models presented here. To address the low R2 , the simulation experiment is repeated using
a slightly modiﬁed data generating process leading to an increase of the model’s R2 (Technical Annex
5.5.1). This modiﬁcation leads to an R2 of the unit-context model between 0.15 and 0.20, while for the
unit-level model the R2 exceeds 0.60. This adjustment in the data generating process reduces the bias
in the estimators of mean welfare based on the unit-context model (Figure 5.5), but the direction of
the bias is diﬀerent from that of the original simulation (Figure 5.2). This change of direction of the
bias also occurs when estimating poverty rates, and the direction of the bias seems to change with the
poverty threshold. Estimators based on unit-context models are upward biased for poverty rates under
a threshold of 13. If this threshold is increased to 28, the unit-context estimators of the poverty rate
become downward biased for every area (Figure 5.6). These results illustrate a crucial shortcoming of
unit-context models: in a given application, a practitioner cannot know which will be the direction and
the magnitude of the bias. Unit-context models seldom reproduce the true distribution accurately, and
it is diﬃcult to know in which cases they work properly.
Figure 5.5: Empirical Bias of CensusEB predictors under unit-level model and under a unit-context
model
Note: Simulation based on 5,000 populations generated as shown in section 5.5.1. The model is ﬁt on the entire population
and then welfare is simulated on the same population to yield 5,000 EB predictors for each area. Box-plot represents the
empirical bias for the 40 areas in the census data. Do-ﬁle to replicate simulation can be seen in section 5.5.1.2.
Building upon the previous simulation, a ﬁnal simulation is conducted to observe how unit-context models
perform under 99 diﬀerent poverty thresholds. The thresholds correspond to the 99 percentiles of the
ﬁrst population generated in the simulation (detailed in Appendix 5.5.2). In contrast to the previous
simulations, a sample from the population is taken here. This is done to compare unit-context estimators
to estimators based on an FH model (Ch. 3).
As expected, the unit-level CensusEB estimator yields the smallest bias across all areas and poverty lines
(Figure 5.7). The shape of the bias for unit-context models suggests that it may perform well for some
poverty lines, while it will perform poorly for other poverty lines. This agrees with what is noted in
ﬁgure 5.6, where the method could work for a given poverty line but show substantial bias at a diﬀerent
poverty threshold. Moreover, for many areas and poverty lines, unit-context models seem to perform
worse than FH models, not just with respect to bias but also in terms of empirical MSE (Figure 5.8).
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 73
Figure 5.6: Empirical Bias of CensusEB predictors of FGT0 under unit-level model and under a unit-
context model
Note: Simulation based on 5,000 populations generated as shown in section 5.5.1. The model is ﬁt on the entire population
and then welfare is simulated on the same population to yield 5,000 EB predictors for each area. Box-plot represents the
empirical bias for the 40 areas in the census data. Do-ﬁle to replicate simulation can be seen in section 5.5.1.2.
The average absolute empirical bias can be used to rank the diﬀerent methods. Across all 99 poverty
lines, unit-context models have the highest average absolute empirical bias (Figure 5.9).3 Additionally,
the average empirical MSE (across the 100 areas) for each poverty line is surprisingly close between Fay-
Herriot and unit-context models. However, neither dominates the other (Figure 5.10), supporting the
conclusion that unit-context models should be used with caution and only when unit-level and area-level
options are not feasible. If unit-context models are used, all caveats should be made clear to the users
of the resulting estimates.
Figure 5.7: Empirical bias of diﬀerent methods for each area and poverty line
Note: Simulation based on 1,000 populations generated as described in this section. Each line corresponds to one of the
100 areas. The x-axis represents the percentile on which the poverty line falls on, and the y-axis is the empirical bias. In
instances where direct estimates have a variance of 0 and hence FH estimates cannot be obtained, the direct estimate is
used as the FH estimate for that area. The do-ﬁle to replicate these results can be found in 5.5.2.1.
3 The average absolute empirical bias is the average across areas of the area-speciﬁc absolute biases.
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 74
Figure 5.8: Empirical MSE of diﬀerent methods for each area and poverty line
Note: Simulation based on 1,000 populations generated as described in this section. Each line corresponds to one of the
100 areas. The x-axis represents the percentile on which the poverty line falls on, and the y-axis is the empirical MSE. In
instances where direct estimates have a variance of 0 and hence FH estimates cannot be obtained, the direct estimate is
used as the FH estimate for that area. The do-ﬁle to replicate these results can be found in 5.5.2.1.
Figure 5.9: Average absolute empirical bias of diﬀerent methods for each poverty line
Note: Simulation based on 1,000 populations generated as described in this section. Each line corresponds to the average
across 100 areas of the absolute empirical bias for each method. The x-axis represents the percentile on which the poverty
line falls on, and the y-axis is the average absolute empirical bias. In instances where direct estimates have a variance of
0 and hence FH estimates cannot be obtained, the direct estimate is used as the FH estimate for that area. The do-ﬁle to
replicate these results can be found in 5.5.2.1.
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 75
Figure 5.10: Average empirical MSE of diﬀerent methods for each poverty line
Note: Simulation based on 1,000 populations generated as described in this section. Each line corresponds to the average
across 100 areas of the mean squared error for each method. The x-axis represents the percentile on which the poverty
line falls on, and the y-axis is the average MSE. In instances where direct estimates have a variance of 0 and hence FH
estimates cannot be obtained, the direct estimate is used as the FH estimate for that area. The do-ﬁle to replicate these
results can be found in 5.5.2.1.
In addition to bias, there are concerns with an overstatement of precision from a unit-context model.
If welfare depends on household-level characteristics (which is expected) the MSE estimates of Cen-
susEB estimators based on unit-context models, obtained using the parametric bootstrap procedure of
González-Manteiga et al. (2008), would not be appropriate as it may underestimate the method’s true
MSE by a considerable amount. The parametric bootstrap procedure under the unit-context model is
done considering that the unit-context model is the true data generating process and thus generates
bootstrap populations from that model. However, this assumption is shown to not hold in the simula-
tion experiment, leading to inaccurate MSE estimates (Figure 5.11). Consequently, under unit-context
models, one can produce biased poverty estimates but may be presented as being very precise.
Figure 5.11: Bias of parametric MSE estimate by method
Note: Simulation based on 1,000 populations generated as described in this section. Under each population, MSE estimates
are obtained using the parametric bootstrap procedure of González-Manteiga et al. (2008). The empirical MSE is compared
to the estimated MSE to obtain the bias of the MSE estimates of each method.
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 76
Summarizing concerns with unit-context models, given the limitations of unit-context models, even under
an ideal scenario where the model is ﬁt to the whole set of population data and EB estimates are obtained
also using the population data, unit-context models should be considered only if unit-level and area-level
options have been exhausted. While it is possible that in real world applications, there may be bias from
multiple sources which oﬀset each other out for a given poverty threshold, there is no way for an analyst
to determine if biases are canceling or compounding each other as the direction of the bias is not known a
priori. A likely overstatement of precision further complicates this unknown bias. This situation recalls
an original concern levied against ELL by Banerjee et al. (2006) where the authors state: “What we are
most concerned about is the possibility that the Bank is making very attractive poverty maps, whose
precision is not known, but which come with statements of precision that we suspect may be seriously
misleading. In the worst case scenario, a map that may be worthless is presented as one that is extremely
precise” (Banerjee et al. 2006, pg 61).
5.2 Gradient Boosting for Poverty Mapping
With the advent of increased computing power, machine learning approaches have gained popularity in
the literature as well as in policy circles. For example, poverty estimates for small areas derived from a
gradient boosting application by Chi et al. (2021) guided the expansion of social protection in Nigeria;
speciﬁcally, the estimates were used as an input to the Rapid Response Register for the COVID-19 Cash
Transfer Project.4
Gradient boosting methods rely on, ﬁrst, creating a linear ﬁt (usually a constant term) to the data at
hand and then ﬁtting a new model onto the residuals. In contrast to ensemble techniques, the multiple
ﬁts are not averaged. Instead, the predicted residuals (scaled by a learning rate) are added to the
previous step’s prediction, successively taking small steps using the same or diﬀerent covariates at each
step toward the ﬁnal prediction. This process repeats until the requested number of predictions are
completed, or there is no longer a gain in prediction.5 A complete exposition of the approach is beyond
the scope of the Guidelines, and interested readers can refer to Natekin and Knoll (2013) or Chen and
Guestrin (2016) for a more nuanced description of the approach.
A validation approach similar to the one done by Corral et al. (2021) is implemented to test how well
gradient boosting works in a poverty mapping context. Speciﬁcally, 500 samples taken from the census
created from the Mexican Intracensal Survey are used to conduct a design-based simulation experiment
to validate the method.6 The Mexican Intracensal Survey is uniquely suited for this validation since
it includes a measure of income at the household level and is representative at the municipal level and
localities with 50,000 or more inhabitants. The survey was modiﬁed to obtain a census dataset of 3.9
million households and 1,865 municipalities (Corral et al. 2021). Using a sampling approach similar to
the one used for Living Standards Measurement Study (LSMS) surveys, 500 samples are drawn from the
resulting census.
In each of the 500 samples, a model is ﬁt. The model’s dependent variable is the direct estimator of
the headcount poverty rate at the PSU level (or municipality), and PSU aggregates (or municipality)
from the census data are used as covariates. The models are ﬁt using the XGBoost algorithm available
in Python and R.7 Introduced by Chen and Guestrin (2016), XGBoost is a scalable machine learning
4 Blumenstock et al. (2021)https://blogs.worldbank.org/opendata/using-big-data-and-machine-learning-locate-poor-
nigeria
5 The method relies on a squared-error loss function where the sequential ﬁts are added until there is no improvement
in the loss function. For a detailed description of gradient boosting, refer to Natekin and Knoll (2013).
6 Corral et al. (2021) provides a detailed explanation of how this dataset was created.
7 The results shown here were obtained from Python.
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 77
system for tree boosting and is available as an open-source software library. To make the comparison
fair, for each of the 500 samples, a model is selected using lasso regression as detailed in section 6.2
and ﬁt for the considered small area estimation methods. This is diﬀerent from the approach usually
taken under design-based validation, where one model is considered the true model and is used across all
samples. The results for the XGBoost and the SAE methods are illustrated in Figure 5.12. The results
illustrate that in the case of the Mexican data, XGBoost yields unbiased estimates of poverty. Moreover,
the empirical MSE compares favorably with that of CensusEB methods, which are more computational
and data-intensive. Additionally, in this Mexican scenario, the gradient boosted estimates are superior
to those from the unit-context method discussed in the previous section (labeled U-C CensusEB).
Figure 5.12: Empirical bias and MSE of FGT0 for diﬀerent methods
Source: Data from Corral et al. (2021). The simulations are based on 500 samples taken from the Mexican Intracensal
survey , which is treated as a census of 3.9 million households. Under each sample, predictors for each of the methods are
obtained and then compared to the true values obtained from the Mexican Intracensal survey . The ﬁgure illustrates that
in the Mexican Intracensal case, XGBoost yields FGT0 estimates that are very close in performance, in terms of bias and
MSE, to those from CensusEB estimators based on a unit-level model. H3-CBEB is the clustered bootstrap EB which is
discussed in detail in Corral et al. (2021), the method was the EB approach implemented in PovMap and the original sae
Stata package.
Despite the gradient boosting method’s performance, it also carries some caveats. First, though the
method performs well with the Mexican data, there is uncertainty as to the degree to which these results
can be extrapolated to other contexts.8 Second, there are currently no implemented software options for
estimating the method’s MSE.9 Finally, despite XGBoost being based on open-source software, it is still
somewhat of a black box and not easily interpretable. Therefore, further research is required to properly
assess how well the method works, though early results are encouraging.
5.3 Pros and Cons of Methods for Poverty Mapping in Oﬀ-Census
Years
This section presents a convenient list of pros and cons for each method discussed in this chapter. It also
notes the needs for each of the methods. The section is borrows from Molina (2019).
8 The quality of the covariates and how well these predict poverty at the modeling level determine the overall quality
of the estimates obtained.
9 What is shown in Figure 5.12 is the empirical MSE, not an estimate of the MSE.
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 78
5.3.1 Unit-Context Models
Model requirements:
1. Microdata from a household survey (only the model’s dependent variable) and administrative,
satellite-derived data, or any other area or sub-level data.
(a) Masaki et al. (2020) recommend using data that is at least one level below the level at which
results are to be presented, although the more disaggregated, the better.
2. The population size/count at the area level is needed, at least the number of dwellings in the area.
3. Areas and sub-areas in the survey and the census should have identiﬁers that can be linked.
Pros:
1. Based on household-level welfare from a survey and area-level data from any point in time. It may
be used in oﬀ census years, avoiding the use of outdated census data.
2. Unlike Fay-Herriot area-level models (Fay and Herriot 1979), it can be applied for areas with an
estimated sampling variance equal to 0.
3. May provide estimates for non-sampled areas.
Cons:
1. The welfare model is presumably incorrectly speciﬁed unless household-level welfare is dependent
only on area-level characteristics. Hence, estimates are expected to be biased, and the direction
and magnitude of the bias are unknown a priori.
2. The bootstrap method for MSE estimation is computationally intensive.
3. The parametric bootstrap approach from González-Manteiga et al. (2008) under the unit-context
model is likely to yield an inaccurate measure of MSE. In many instances, the MSE may be
considerably underestimated.
5.3.2 Gradient Boosting
Model requirements:
1. Direct estimates of indicators of interest for the areas considered, τ
ˆdDIR
(from the survey).
2. Aggregate data at the area level of all necessary covariates for the model for every area considered,
xd , d = 1, . . . , D.
3. Areas in the survey and the census should have identiﬁers that can be linked across each other.
Pros:
1. Based on direct estimates from a survey and area-level data from any point in time. It may be
used in oﬀ census years, avoiding the use of outdated census data.
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 79
2. Unlike Fay-Herriot area-level models (Fay and Herriot 1979), it can be applied for areas with an
estimated sampling variance equal to 0.
3. The method’s dependent variable and the target indicators are the same. In a design-based simu-
lation using the Mexican Intracensal Survey the method yields estimates of comparable quality to
CensusEB and with better performance than unit-context models.
4. May provide estimates for non-sampled areas.
Cons:
1. The method requires validation exercises in more scenarios beyond the one conducted in this
chapter. There is no guarantee that the method will work as well with covariates with considerably
lower predictive power than the ones from the example provided in section 5.2.
2. The method currently lacks an approach for obtaining noise estimates (MSE). Consequently, it is
diﬃcult to assess the precision of the ﬁnal estimates.
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 80
5.4 Unit-Context Models – Technical Annex
5.4.1 Producing Estimators Based on Unit-Context Models
The production of estimators based on unit-context models is similar to those using regular unit-level
models, except that unit-level covariates are not used. This implies that the share of welfare variation
across households explained by the model’s covariates is expected to be lower and, within many areas,
welfare may be poorly explained. Still, unit-context models may be regarded as an approximation to
the true underlying data generating process. Actually, they are particular cases of unit-level models
(Eq. 4.2); consequently, normality and linearity assumptions need to be checked similarly with the
corresponding covariates. The focus of this section is on the unit-context models as presented in Masaki
et al. (2020) and not those from Lange, Pape, and Pütz (2018) and Nguyen (2012). The reason for this
choice is that Lange, Pape, and Pütz (2018) and Nguyen’s (2012) approach relies on ELL’s method,
which suﬀers from the same issues noted by previous work (see Molina and Rao 2010; Corral, Molina,
and Nguyen 2021; Corral et al. 2021 among others). In addition, Masaki et al. (2020) tested diﬀerent
methods, including EB, and concluded that EB provides a considerable gain in accuracy and eﬃciency
over other methods.
Unit-context versions (i.e. those with aggregated covariates only) may be speciﬁed for either a one-fold
nested error model or a two-fold nested error model. A possible unit-context model follows:
ysach = zsac α + tsa ω + gs λ + ηsa + εsach
where s is used for an aggregation level that is over the target areas (a super-area), and c is used for
subareas. Hence, zsac contains subarea-level characteristics, tsa includes area-level characteristics and
gs is composed of super-area-level characteristics (which may include super-area ﬁxed eﬀects). The
regression coeﬃcients across these levels are respectively denoted α, ω and λ. The random eﬀects, ηsa ,
are speciﬁed in this model at the area level. Note that, among the set of covariates in this model, none
is at the unit level; covariates only vary at the subarea level.
Model selection may be implemented following the approach described in section 6.2, except that only
contextual variables will be among the pool of eligible covariates. Data transformation is also important
in unit-context models and is emphasized by Masaki et al. (2020). In contrast to the data transformation
used in section 6.2, Masaki et al. (2020) recommend transforming the dependent variable with ordered
quantile normalization. Nevertheless, this transformation cannot be used for the most common poverty
and inequality indicators beyond headcount poverty because the transformation is not reversible. EB
point and noise estimates are obtained following a Monte Carlo simulation and parametric bootstrap
procedures, respectively, similar to the conventional application of the EB method under a unit-level
model and detailed in the technical annex (sections 4.4.2.1 and 4.4.2.2). Finally, Masaki et al. (2020)
also recommend adjusting the model-based estimators to match direct estimates, usually to the level
where the survey is representative (benchmarking).
Benchmarking is not recommended unless publication requirements include that estimates of totals at a
lower aggregation level add up to the estimated total at a higher level (e.g., the national level). The need
to benchmark due to substantial discrepancies between the sum of estimated totals at the lower level
and the estimated total at the higher level may indicate that the model assumptions are not satisﬁed.
EB estimators based on a correct model are approximately model-unbiased and optimal in terms of
minimizing the MSE for a given area; thus, when adjusted afterwards for benchmarking, so that these
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 81
match usual estimates at higher aggregation levels, the optimal properties are lost and estimators usually
become worse in terms of bias and MSE under the model.10 When benchmarking adjustments are large,
as those likely required by estimators derived from unit-context model variants, it is an indication that
the model does not hold for the data. Note that a signiﬁcant bias in the ﬁnal estimates may lead to
considerable re-ranking of locations in terms of poverty estimates. Consequently, a limit on the acceptable
bias should usually be determined according to needs. This is particularly important when determining
priorities across areas based on small area estimates. If an area’s true poverty rate is 50% and the
method yields an estimator of 10% due to an incorrect model, there is a real risk that this area may
not be assisted when needed. Molina (2019) suggests 5 or 10 percent of absolute relative bias as an
acceptable threshold.
An additional problem for unit-context models in many applications is that it may not be possible to
match census and survey PSUs. In some cases, it is due to conﬁdentiality reasons and, in others, it is
due to diﬀerent sampling frames. The latter problem will likely aﬀect applications where the census and
survey correspond to diﬀerent years. Fay-Herriot and other area or subarea models that use the same
aggregated variables are an alternative approach to unit-context models for the case where the census
is outdated, for which the model is not necessarily in question, since these models may be correctly
speciﬁed. Of course, model checking is also needed.
10 Beyond unit-context models, benchmarking in many instances may be necessary to ensure aggregate estimates are
aligned to oﬃcial published estimates.
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 82
5.5 Appendix
5.5.1 Simulation Experiment 1 for Unit-Context Models
A simulation experiment is conducted with the purpose of illustrating the inherent bias of the resulting
CensusEB estimators based on unit-context models due to biased estimators of the model parameters.
To remove a source of bias of estimators based on these models, which is due to diﬀerences between the
sample and census means of covariates as shown in the Appendix of Corral et al. (2021), the model is ﬁt
to the whole population data and small area estimates are also calculated based on the same population
data. The simulation is inspired on those conducted by Marhuenda et al. (2017) where the true data
generating process is a two-fold nested error model. This model will better accommodate the usual
applications of poverty mapping, where household surveys use two-stage sampling. A two-fold structure
also allows for the inclusion of contextual variables that are at the cluster level while the random location
eﬀect is speciﬁed at the area level, similar as in Masaki et al. (2020). The creation of the census data set
is similar to the one shown in section 3 of Corral et al. (2021).
A census data set of N = 20, 000 observations is created, where observations are allocated among 40 areas
(a = 1, . . . , A). Within each area, observations are uniformly allocated over 10 clusters (c = 1, . . . , Ca ).
Each cluster, c, consists of Nac = 50 observations, and each cluster is labeled from 1 to 10. The assumed
iid
model contains both cluster and area eﬀects. Cluster eﬀects are simulated as ηac ∼ N 0, 0.052 , area
iid iid
eﬀects as ηa ∼ N (0, 0.1) and household speciﬁc residuals as each ∼ N 0, 0.52 , where h = 1, . . . , Nac ; c =
1, . . . , Ca ; a = 1, . . . , A. Covariates are simulated as follows:11
1. x1 is a binary variable, taking value 1 when a random uniform number between 0 and 1, at the
household-level, is less than or equal to 0.3 + 0.5 40
a c
+ 0.2 10 .
2. x2 is a binary variable, taking value 1 when a random uniform number between 0 and 1, at the
household-level, is less than or equal to 0.2.
3. x3 is a binary variable, taking value 1 when a random uniform number between 0 and 1, at the
household-level, is less than or equal to 0.1 + 0.2 40
a
.
4. x4 is a binary variable, taking value 1 when a random uniform number between 0 and 1, at the
household-level, is less than or equal to 0.5 + 0.3 40
a c
+ 0.1 10
5. x5 is a discrete variable, simulated as the rounded integer value of the maximum between 1 and a
random Poisson variable with mean λ = 3 1 − 0.1 40
a
.
6. x6 is a binary variable, taking value 1 when a random uniform value between 0 and 1 is less than
or equal to 0.4. Note that the values of x6 are not related to the area’s label.
7. x7 is a binary variable, taking value 1 when a random uniform number between 0 and 1 is greater
than or equal to 0.2 + 0.4 40
a c
+ 0.1 10
The welfare vector for each household within a cluster within an area is created from the model with
these covariates, as follows:
yach = 3 + 0.09x1ach − 0.04x2ach − 0.09x3ach + 0.4x4ach − 0.25x5ach + 0.1x6ach + 0.33x7ach + ηa + ηac + each ,
(5.1)
11 Covariates are simulated following Corral et al. (2021) who follow the approach from Molina and Rao (2010) and
Marhuenda et al. (2017), with slight modiﬁcations.
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 83
The dependent variable, yach , is the log of the variable of interest. The poverty line in this scenario is
ﬁxed at z = 12. This generation process is repeated 5,000 times. This will yield 5,000 true poverty rates
for each area.
As already said, to show that estimators based on unit-context models are still biased even if the source
of bias noted in Corral et al. (2021) is removed, instead of drawing a sample from the population to ﬁt
the models, the models are ﬁt to the whole set of census data. This eliminates the latter source of bias.
The unit-context model includes the cluster means of the 7 covariates. In each of the 5,000 simulations,
the following quantities are computed for the poverty rates and gaps in each area:
1. True poverty indicators τa , using the “census”.
2. Census EB estimators τ
ˆaCEBa
presented in Corral, Molina, and Nguyen (2021) based on a nested
error model with only area random eﬀects and including the unit-level values of the covariates,
and obtained using a Monte Carlo approximation with M = 50 replicates. The R2 of this unit-level
model is a slightly below 0.5.
3. Unit-context Census EB estimators τ
ˆaU C −CEBa
based on a nested error model with random eﬀects
at the area level obtained using a Monte Carlo approximation with M = 50 replicates. This
estimator follows the approach from Masaki et al. (2020) and uses only cluster means for all of the
covariates. The R2 of this unit-context model is below 0.05.
The average across the 5,000 simulations of the estimation errors for each area represent the empirical
biases of the considered area estimators. The Stata script to replicate these simulations can be found in
the appendix (section 5.5.1.1).
One could argue that, in this scenario, the R2 of unit-context models is much lower than that one in
the applications of Masaki et al. (2020) and of Lange, Pape, and Pütz (2018). For this reason, the
simulation experiment is repeated modifying slightly the data generating process to increase the R2 .
Speciﬁcally, in this experiment, the covariate x7 is now generated from a random Poisson variable with
mean λ = 3 c
20 − a
100 + u , where u is a random uniform value between 0 and 1, and σe is increased
from 0.5 to 0.6. This modiﬁcation leads to an R2 of the unit-context model between 0.15 and 0.20, while
for unit-level models the R2 exceeds 0.60. The Stata script to replicate these simulations can be found
in the following section 5.5.1.2.
5.5.1.1 Unit-Context Models – Validation
The do-ﬁle below reproduces the simulation experiment described in section 5.4. Note that the model is
ﬁt on to the population and then estimates are obtained by simulating on to the same population.
set more off
clear all
global main "C:\Users\\`c(username)´\OneDrive\SAE Guidelines 2021\"
global section "$main\3_Unit_level\"
global mdata "$section\1_data\"
global myfigs "$section\3_figures\"
/*
Author: Paul Corral
Do file below is a test for a two fold nested error model. It follows the method
illustrated in the paper from Marhuenda et al. (2017) and others in the link
below.
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 84
We start off by creating a fake data set as illustrated in that same paper.
https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/rssa.12306
*/
/*
Purpose of file is to test SAE model performance by imputing on to the
population instead of a sample. This should remove all other sources of bias.
*/
*===============================================================================
// Parameters for simulated data set
*===============================================================================
version 15
set seed 734137
global numobs = 20000
global outsample = 50
global areasize = 500
global psusize = 50
//We have 2 location effects below
global sigmaeta_psu = 0.05
global sigmaeta_area = 0.1
//We have household specific errors
global sigmaeps = 0.5
//Poverty line fixed at 12
global pline = 12
global lnpline = ln(12)
//locals
local obsnum = $numobs
local areasize = $areasize
local psusize = $psusize
local total_sim = 5000
*===============================================================================
//1.Create simulated data
*===============================================================================
//Start off with # of observations
set obs `=`obsnum´/`areasize´´
gen area = _n
lab var area "Area identifier"
//expand to create 10 psu per area
expand `=`areasize´/`psusize´´
sort area
//PSUs labelled from 1 to 10 within each area
gen psu = _n - (area-1)*10
lab var psu "PSU identifier"
//expand to create 50 observations by psu
expand `psusize´
sort area psu
//Household id
gen hhid = _n
lab var hhid "Household identifier"
//Covariates, some are corrlated to the area and psu´s label
gen x1=runiform()<=(0.3+.5*area/(`obsnum´/`areasize´) + ///
0.2*psu/(`areasize´/`psusize´))
gen x2=runiform()<=(0.2)
gen x3= runiform()<=(0.1 + .2*area/int(`obsnum´/`areasize´))
gen x4= runiform()<=(0.5+0.3*area/int(`obsnum´/`areasize´) + ///
0.1*psu/int(`areasize´/`psusize´))
gen x5= round(max(1,rpoisson(3)*(1-.1*area/int(`obsnum´/`areasize´))),1)
gen x6= runiform()<=0.4
gen x7= runiform()>=(0.2+0.4*area/int(`obsnum´/`areasize´) + ///
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 85
0.1*psu/int(`areasize´/`psusize´))
//note that this matches the model from eq. 3 of Corral et al. (2021)
gen XB = 3+ .09* x1-.04* x2 - 0.09*x3 + 0.4*x4 - 0.25*x5 + 0.1*x6 + 0.33*x7
lab var XB "Linear fit"
//Create psu level means...
groupfunction, mean(x*) merge by(area psu)
//Indicate first area observation
bysort area: gen area_1st = 1 if _n==1
//Indicate first psu observation
bysort area psu: gen psu_1st = 1 if _n==1
sort hhid
//We need weights for SAE command
gen hhsize = 1
lab var hhsize "HH size for command"
//Save population´s Xs and linear fit
save "$mdata\popX.dta", replace
*===============================================================================
//2. Import data for SAE
*===============================================================================
sae data import, datain("$mdata\popX.dta") varlist( mean_x1 mean_x2 mean_x3 ///
mean_x4 mean_x5 mean_x6 mean_x7 x1 x2 x3 x4 x5 x6 x7 hhsize) ///
area(area) uniqid(hhid) dataout("$mdata\census")
*===============================================================================
//3. Run the simulations
*===============================================================================
/*
Now, we will run 5,000 simulations where we follow the model´s assumpitons.
under each simulation we will add to XB the psu and area effect, as well
as the household specific error.
Then, under each population we will obtain CensusEB estimates under
unit-level CensusEB, unit-level ELL, and unit-context models. For each
population and the EB estimates obtained we will calculate the difference
between the true poverty rate and the estimate, and the squared difference.
After 5000 simulations these are our empirical bias and MSE.
*/
//Add random location effects and household errors
forval z=1/`total_sim´{
use "$mdata\popX.dta", clear
//random area effects
gen double eta_a = rnormal(0,$sigmaeta_area) if area_1st==1
replace eta_a = eta_a[_n-1] if missing(eta_a)
gen double eta_p = rnormal(0,$sigmaeta_psu) if psu_1st ==1
replace eta_p = eta_p[_n-1] if missing(eta_p)
//household errors
gen eps = rnormal(0,$sigmaeps)
//Y, normally distributed
egen double Y = rsum(XB eta_a eta_p eps)
tempfile myPop
save `myPop´
//Seed stage for simulations, changes after every iteration!
local seedstage `c(rngstate)´
gen double e_y = exp(Y)
//Create true values
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 86
forval a = 0/2{
gen fgt`a´ = (e_y<$pline)*(1-e_y/$pline)^`a´
}
preserve
//true values by area
groupfunction [aw=hhsize], mean(fgt* e_y Y) by(area)
rename e_y mean
tempfile true
save `true´
restore
//Bring in the 20K pop and use it as a survey
use `myPop´, clear
//Obtain UC SAE
preserve
sae sim h3 Y mean_x1 mean_x2 mean_x3 mean_x4 mean_x5 mean_x6 ///
mean_x7, area(area) mcrep(50) bsrep(0) matin("$mdata\census") ///
lny seed(`seedstage´) pwcensus(hhsize) indicators(FGT0 FGT1 FGT2) ///
aggids(0) uniq(hhid) plines($pline)
rename avg_fgt* uc_fgt*
rename Unit area
rename Mean uc_mean
tempfile h3area
save `h3area´
restore
//Obtain UC SAE, without transforming
preserve
sae sim h3 Y mean_x1 mean_x2 mean_x3 mean_x4 mean_x5 mean_x6 ///
mean_x7, area(area) mcrep(50) bsrep(0) matin("$mdata\census") ///
seed(`seedstage´) pwcensus(hhsize) indicators(FGT0 FGT1 FGT2) ///
aggids(0) uniq(hhid) plines($lnpline)
rename avg_fgt* ucn_fgt*
rename Unit area
rename Mean ucn_Y
tempfile h3arean
save `h3arean´
restore
//Obtain CensusEB SAE
preserve
sae sim h3 Y x1 x2 x3 x4 x5 x6 x7, area(area) mcrep(50) ///
bsrep(0) matin("$mdata\census") lny seed(`seedstage´) ///
pwcensus(hhsize) indicators(FGT0 FGT1 FGT2) aggids(0) ///
uniq(hhid) plines($pline)
rename avg_fgt* ceb_fgt*
rename Unit area
rename Mean ceb_mean
tempfile h3eb
save `h3eb´
restore
//Without transforming...CensusEB
preserve
sae sim h3 Y x1 x2 x3 x4 x5 x6 x7, area(area) ///
mcrep(50) bsrep(0) matin("$mdata\census") seed(`seedstage´) ///
pwcensus(hhsize) indicators(FGT0 FGT1 FGT2) aggids(0) ///
uniq(hhid) plines($lnpline)
rename avg_fgt* cebn_fgt*
rename Unit area
rename Mean cebn_Y
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 87
tempfile h3ebn
save `h3ebn´
restore
//Open true point estimates
use `true´, clear
//Merge in the model based estimates
merge 1:1 area using `h3area´, keepusing(uc_*)
drop _m
merge 1:1 area using `h3eb´ , keepusing(ceb_*)
drop _m
merge 1:1 area using `h3arean´, keepusing(ucn_*)
drop _m
merge 1:1 area using `h3ebn´ , keepusing(cebn_*)
drop _m
//Calculate bias and MSE
foreach j in fgt0 fgt1 fgt2 mean{
foreach i in ceb uc cebn ucn{
if ("`j´"=="mean" & ("`i´"=="cebn"|"`i´"=="ucn")) local j Y
gen double `i´_bias_`j´ = (`i´_`j´-`j´)/`total_sim´
gen double `i´_mse_`j´ = ((`i´_`j´-`j´)^2)/`total_sim´
}
}
keep area *_bias_* *_mse_*
//For first sim we rename the vector to *T
if (`z´==1){
rename *_bias_* *_bias_*T
rename *_mse_* *_mse_*T
tempfile Stats
save `Stats´
}
else{ //After the first sim, we add the bias and MSE to *T
merge 1:1 area using `Stats´
drop _m
foreach j in fgt0 fgt1 fgt2 mean{
foreach i in ceb uc cebn ucn{
if ("`j´"=="mean" & ("`i´"=="cebn"|"`i´"=="ucn")) local j Y
replace `i´_bias_`j´T = `i´_bias_`j´T + `i´_bias_`j´
replace `i´_mse_`j´T = `i´_mse_`j´T + `i´_mse_`j´
drop `i´_bias_`j´ `i´_mse_`j´
}
}
tempfile Stats
save `Stats´
}
}
save "$mdata\bias_in_mymodel.dta", replace
5.5.1.2 Unit-Context Models – Validation with Better Model Fit
The do-ﬁle below reproduces the simulation experiment described in section 5.4, but considering unit-
context models with a better R2 and producing estimates for 2 diﬀerent poverty thresholds. Note that
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 88
the model is ﬁt to the whole set of population data and then estimates are also obtained by simulating
on to the population.
set more off
clear all
global main "C:\Users\\`c(username)´\OneDrive\SAE Guidelines 2021\"
global section "$main\3_Unit_level\"
global mdata "$section\1_data\"
global myfigs "$section\3_figures\"
/*
Author: Paul Corral
Version @2 differs from previous one in that we create a model where
UC models have a better fit (R2 ~ 0.18), also welfare is somewhat more skewed
We start off by creating a fake data set illustrated in Marhuenda et al. (2017).
https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/rssa.12306
*/
/*
Purpose of file is to test SAE model performance by imputing on to the
population instead of a sample. This should remove all other sources of bias.
*/
*===============================================================================
// Parameters for simulated data set
*===============================================================================
version 15
set seed 734137
global numobs = 20000
global areasize = 500
global psusize = 50
//We have 2 location effects below
global sigmaeta_psu = 0.05
global sigmaeta_area = 0.1
//We have household specific errors
global sigmaeps = 0.6
//Poverty line fixed at 27.8
global pline = 13
global lnpline = ln($pline)
global pline1 = 28
global lnpline1 = ln($pline1)
local lines $pline $pline1
//locals
local obsnum = $numobs
local areasize = $areasize
local psusize = $psusize
local total_sim = 1000
*===============================================================================
//1.Create simulated data
*===============================================================================
//Start off with # of observations
set obs `=`obsnum´/`areasize´´
gen area = _n
lab var area "Area identifier"
//expand to create 10 psu per area
expand `=`areasize´/`psusize´´
sort area
//PSUs labelled from 1 to 10 within each area
gen psu = _n - (area-1)*`=`areasize´/`psusize´´
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 89
lab var psu "PSU identifier"
//expand to create 50 observations by psu
expand `psusize´
sort area psu
//Household id
gen hhid = _n
lab var hhid "Household identifier"
//Covariates, some are corrlated to the area and psu´s label
gen x1=runiform()<=(0.3+.5*area/(`obsnum´/`areasize´) + ///
0.2*psu/(`areasize´/`psusize´))
gen x2=runiform()<=(0.2)
gen x3= runiform()<=(0.1 + .2*area/int(`obsnum´/`areasize´))
gen x4= runiform()<=(0.5+0.3*area/int(`obsnum´/`areasize´) + ///
0.1*psu/int(`areasize´/`psusize´))
gen x5= round(max(1,rpoisson(3)*(1-.1*area/int(`obsnum´/`areasize´))),1)
gen x6= runiform()<=0.4
gen x7=rpoisson(3)*(1*psu/int(`areasize´/`psusize´)- 1*area/int(`obsnum´/`areasize´) ///
+ 1*uniform())
//note that this matches the model from eq. 3 of Corral et al. (2021)
gen XB = 3+ .09* x1-.04* x2 - 0.09*x3 + 0.4*x4 - 0.25*x5 + 0.1*x6 + 0.33*x7
lab var XB "Linear fit"
//Create psu level means...
preserve
collapse (mean) x*, by(area psu)
rename x* meanpsu_x*
tempfile psumeans
qui save `psumeans´
restore
preserve
collapse (mean) x*, by(area)
rename x* meanarea_x*
tempfile areameans
qui save `areameans´
restore
merge n:1 area psu using `psumeans´, assert(3) nogen
merge n:1 area using `areameans´, assert(3) nogen
//Indicate first area observation
bysort area: gen area_1st = 1 if _n==1
//Indicate first psu observation
bysort area psu: gen psu_1st = 1 if _n==1
sort hhid
//We need weights for SAE command
gen hhsize = 1
lab var hhsize "HH size for command"
//Create hierarchical identifier
gen uno = 100+area
gen dos = 100+psu
gen double HID = real(string(uno)+string(dos))
//Save population´s Xs and linear fit
save "$mdata\popXT.dta", replace
*===============================================================================
//2. Import data for SAE
*===============================================================================
unab themeans : mean*
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 90
sae data import, datain("$mdata\popXT.dta") ///
varlist(`themeans´ x1 x2 x3 x4 x5 x6 x7 hhsize) ///
area(area) uniqid(hhid) dataout("$mdata\census")
*===============================================================================
//3. Run the simulations
*===============================================================================
/*
Now, we will run 5,000 simulations where we follow the model´s assumpitons.
under each simulation we will add to XB the psu and area effect, as well
as the household specific error.
Then, under each population we will obtain CensusEB estimates under
unit-level CensusEB, and unit-context models. For each
population and the EB predictions obtained we will calculate the difference
between the true poverty rate and the predicted one, and the squared difference.
After 5000 simulations these are our empirical bias and MSE.
*/
// For each simulation we need to add random location effects and
// household errors
forval z=1/`total_sim´{
qui{
use "$mdata\popXT.dta", clear
//random area effects
gen double eta_a = rnormal(0,$sigmaeta_area) if area_1st==1
replace eta_a = eta_a[_n-1] if missing(eta_a)
gen double eta_p = rnormal(0,$sigmaeta_psu) if psu_1st ==1
replace eta_p = eta_p[_n-1] if missing(eta_p)
//household errors
gen eps = rnormal(0,$sigmaeps)
//Generate Y adding the XB and the drawn errors
egen double Y = rsum(XB eta_a eta_p eps)
gen double e_y = exp(Y)
tempfile myPop
save `myPop´
if (`z´==1){
reg Y x*
predict res, res
reg Y meanpsu_x1 meanpsu_x2 meanpsu_x3 meanpsu_x4 meanpsu_x5 ///
meanpsu_x6 meanpsu_x7
predict resA, res
twoway (kdensity res) (kdensity resA)
}
//Seed stage for simulations, changes after every iteration!
local seedstage `c(rngstate)´
//Create true values
gen fgt0_$pline = (e_y<$pline)*(1-e_y/$pline)^0
gen fgt0_$pline1 = (e_y<$pline1)*(1-e_y/$pline1)^0
preserve
//true values by area
groupfunction [aw=hhsize], mean(fgt* e_y Y) by(area)
rename e_y mean
tempfile true
save `true´
restore
//Bring in the 20K pop and use it as a survey
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 91
use `myPop´, clear
//Do model selection for Area
if (`z´==1){
lnskew0 y1 = e_y
lassoregress y1 mean*, numfolds(5)
local vv = e(varlist_nonzero)
global area_lnskew `vv´
drop y1
bcskew0 y1 = e_y
lassoregress y1 mean*, numfolds(5)
local vv = e(varlist_nonzero)
global area_bc `vv´
drop y1
lassoregress Y mean*, numfolds(5)
local vv = e(varlist_nonzero)
global area_vars1 `vv´
}
//Obtain UC SAE, without transforming
preserve
sae sim h3 e_y $area_lnskew, area(area) ///
mcrep(50) bsrep(0) matin("$mdata\census") seed(`seedstage´) lnskew ///
pwcensus(hhsize) indicators(FGT0) aggids(0) uniq(hhid) plines($pline $pline1)
rename avg_fgt* uc_fgt*
rename Unit area
rename Mean uc_mean
tempfile h3area
save `h3area´
restore
preserve
sae sim h3 Y $area_vars1, area(area) ///
mcrep(50) bsrep(0) matin("$mdata\census") seed(`seedstage´) lny ///
pwcensus(hhsize) indicators(FGT0) aggids(0) uniq(hhid) plines($pline $pline1)
rename avg_fgt* ucn_fgt*
rename Unit area
rename Mean ucn_mean
tempfile h3arealn
save `h3arealn´
restore
preserve
sae sim h3 e_y $area_bc, area(area) ///
mcrep(50) bsrep(0) matin("$mdata\census") seed(`seedstage´) bcox ///
pwcensus(hhsize) indicators(FGT0) aggids(0) uniq(hhid) plines($pline $pline1)
rename avg_fgt* ucb_fgt*
rename Unit area
rename Mean ucb_mean
tempfile h3areabc
save `h3areabc´
restore
//CensusEB
preserve
sae sim h3 Y x1 x2 x3 x4 x5 x6 x7, area(area) lny ///
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 92
mcrep(50) bsrep(0) matin("$mdata\census") seed(`seedstage´) ///
pwcensus(hhsize) indicators(FGT0) aggids(0) uniq(hhid) plines($pline $pline1)
rename avg_fgt* ceb_fgt*
rename Unit area
rename Mean ceb_mean
tempfile h3eb
save `h3eb´
restore
//Open true point estimates
use `true´, clear
//Merge in the model based estimates
merge 1:1 area using `h3area´, keepusing(uc_*)
drop _m
merge 1:1 area using `h3eb´ , keepusing(ceb_*)
drop _m
merge 1:1 area using `h3arealn´ , keepusing(ucn_*)
drop _m
merge 1:1 area using `h3areabc´ , keepusing(ucb_*)
drop _m
//Calculate bias and MSE
local j mean
foreach i in ceb ucn uc ucb{
gen double `i´_bias_`j´ = (`i´_`j´-`j´)/`total_sim´
gen double `i´_mse_`j´ = ((`i´_`j´-`j´)^2)/`total_sim´
}
foreach line of local lines{
foreach i in ceb ucn uc ucb{
gen double `i´_bias_fgt0_`line´ = ///
(`i´_fgt0_`line´-fgt0_`line´)/`total_sim´
gen double `i´_mse_fgt0_`line´ = ///
((`i´_fgt0_`line´-fgt0_`line´)^2)/`total_sim´
}
}
keep area *_bias_* *_mse_*
//For first sim we rename the vector to *T
if (`z´==1){
rename *_bias_* *_bias_*T
rename *_mse_* *_mse_*T
tempfile Stats
save `Stats´
}
else{ //After the first sim, we add the bias and MSE to *T
merge 1:1 area using `Stats´
drop _m
local j mean
foreach i in ceb ucn uc ucb{
replace `i´_bias_`j´T = `i´_bias_`j´T + `i´_bias_`j´
replace `i´_mse_`j´T = `i´_mse_`j´T + `i´_mse_`j´
}
foreach line of local lines{
foreach i in ceb ucn uc ucb{
replace `i´_bias_fgt0_`line´T = ///
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 93
`i´_bias_fgt0_`line´T + `i´_bias_fgt0_`line´
replace `i´_mse_fgt0_`line´T = ///
`i´_mse_fgt0_`line´T + `i´_mse_fgt0_`line´
drop `i´_bias_fgt0_`line´ `i´_mse_fgt0_`line´
}
}
tempfile Stats
save `Stats´
}
}
dis as error "Sim num `z´"
}
save "$mdata\bias_in_mymodel_twolines.dta", replace
5.5.2 Simulation Experiment 2 for Unit-Context Models
To further explore the bias of unit-context models and compare its performance to other methods,
a ﬁnal simulation is conducted following the data generation noted in section 5.5.1.2, but with some
modiﬁcations. Firstly, the population size is N = 500, 000, and the observations are allocated among
A = 100 areas (a = 1, . . . , A). Within each area a, observations are uniformly allocated over Ca = 20
clusters (c = 1, . . . , Ca ). Each cluster c consists of Nac = 250 observations. In this simulation experiment,
we take a simple random sample of nac = 10 households per cluster, and this sample is kept ﬁxed across
simulations. Using a sample, we can also compare with estimators based on FH model (discussed in
Chapter 3). The model that generates the population data contains both cluster and area eﬀects.
iid iid
Cluster eﬀects are simulated as ηac ∼ N (0, 0.1), area eﬀects as ηa ∼ N 0, 0.152 and household speciﬁc
iid
residuals as each ∼ N 0, 0.52 , where h = 1, . . . , Nac , c = 1, . . . , Ca , a = 1, . . . , A. Finally, x7 is generated
from a random Poisson variable with mean λ = 3 20 c a
− 100 + u , where u is a random uniform value
between 0 and 1. In this experiment, we take a grid of 99 poverty thresholds, corresponding to the 99
percentiles of the very ﬁrst population generated. In total, 1,000 populations are generated. In each of
the 1,000 populations, the following quantities are computed for each of the 99 poverty lines in each area:
1. True poverty indicators τa , using the “census”.
2. CensusEB estimators τ
ˆaCEBa
presented in Corral, Molina, and Nguyen (2021), based on a nested
error model with only area random eﬀects and including the unit-level values of the covariates,
and obtained using a Monte Carlo approximation with M = 50 replicates. The R2 for this model
is roughly 0.60.
3. Unit-context CensusEB estimators τ
ˆaU C −CEBa
based on a nested error model with random eﬀects
at the area-level obtained using a Monte Carlo approximation with M = 50 replicates. This
estimator follows the approach of Masaki et al. (2020) and uses a model selected using lasso, as
described in section 6.2. The R2 of the resulting model hovers around 0.17.
4. Area-level FH estimators τ
ˆaF Ha
based on the model described in section 3.4. In this case, a separate
model is needed for each of the 99 diﬀerent poverty lines. Hence, the R2 depends on the poverty
line, but it ranges from 0.15 to 0.70.
The average across the 1,000 simulations represent the empirical bias for each area. The Stata script to
replicate these simulations can be found in the appendix (section 5.5.2.1).12
12 Depending on the computing power, this may take longer than 2 days to run.
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 94
5.5.2.1 Unit-Context Models – Validation Across All Poverty Lines
The Stata code below produces the simulations described in section 5.4. Here, a sample is drawn from the
population and then, estimates are obtained for 99 diﬀerent poverty lines. Each poverty line corresponds
to a percentile of the very ﬁrst generated population.
set more off
clear all
global main "C:\Users\\`c(username)´\OneDrive\SAE Guidelines 2021\"
global section "$main\3_Unit_level\"
global mdata "$section\1_data\"
global myfigs "$section\3_figures\"
/*
Author: Paul Corral
Version @2 differs from previous one in that we create a model where
UC models have a better fit (R2 ~ 0.18), also welfare is somewhat more skewed
We start off by creating a fake data set illustrated in Marhuenda et al. (2017).
https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/rssa.12306
*/
/*
Purpose of file is to test SAE model performance by obtaining
estimates at different points of the tru distribution.
*/
*===============================================================================
// Parameters for simulated data set
*===============================================================================
version 15
set seed 734137
global numobs = 500000
global areasize = 5000
global psusize = 250
//We have 2 location effects below
global sigmaeta_psu = 0.1
global sigmaeta_area = 0.15
//We have household specific errors
global sigmaeps = 0.5
//Poverty line fixed at 26
global pline = 26
global lnpline = ln($pline)
//locals
local obsnum = $numobs
local areasize = $areasize
local psusize = $psusize
local total_sim = 1000
local myrep = 50
local from0 = 1
*===============================================================================
//1.Create simulated data
*===============================================================================
//Start off with # of observations
set obs `=`obsnum´/`areasize´´
gen area = _n
lab var area "Area identifier"
//expand to create 10 psu per area
expand `=`areasize´/`psusize´´
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 95
sort area
//PSUs labelled from 1 to 10 within each area
gen psu = _n - (area-1)*`=`areasize´/`psusize´´
lab var psu "PSU identifier"
//expand to create 50 observations by psu
expand `psusize´
sort area psu
//Household id
gen hhid = _n
lab var hhid "Household identifier"
//Covariates, some are corrlated to the area and psu´s label
gen x1=runiform()<=(0.3+.5*area/(`obsnum´/`areasize´) + ///
0.2*psu/(`areasize´/`psusize´))
gen x2=runiform()<=(0.2)
gen x3= runiform()<=(0.1 + .2*area/int(`obsnum´/`areasize´))
gen x4= runiform()<=(0.5+0.3*area/int(`obsnum´/`areasize´) + ///
0.1*psu/int(`areasize´/`psusize´))
gen x5= round(max(1,rpoisson(3)*(1-.1*area/int(`obsnum´/`areasize´))),1)
gen x6= runiform()<=0.4
gen x7=rpoisson(3)*(1*psu/int(`areasize´/`psusize´)- 1*area/int(`obsnum´/`areasize´)+ ///
1*uniform())
//note that this matches the model from eq. 3 of Corral et al. (2021)
gen XB = 3+ .09* x1-.04* x2 - 0.09*x3 + 0.4*x4 - 0.25*x5 + 0.1*x6 + 0.33*x7
lab var XB "Linear fit"
//Create psu level means...
preserve
collapse (mean) x*, by(area psu)
rename x* meanpsu_x*
tempfile psumeans
qui save `psumeans´
restore
preserve
collapse (mean) x*, by(area)
rename x* meanarea_x*
tempfile areameans
qui save `areameans´
restore
merge n:1 area psu using `psumeans´, assert(3) nogen
merge n:1 area using `areameans´, assert(3) nogen
//Indicate first area observation
bysort area: gen area_1st = 1 if _n==1
//Indicate first psu observation
bysort area psu: gen psu_1st = 1 if _n==1
sort hhid
//We need weights for SAE command
gen hhsize = 1
lab var hhsize "HH size for command"
//Create hierarchical identifier
gen uno = 100+area
gen dos = 100+psu
gen double HID = real(string(uno)+string(dos))
preserve
sample 10, by(HID) count
tempfile msvy
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 96
save `msvy´
restore
merge 1:1 hhid using `msvy´
gen svy = _m==3
drop _m
//Save population´s Xs and linear fit
sort hhid
save "$mdata\popX.dta", replace
*===============================================================================
//2. Import data for SAE
*===============================================================================
unab themeans : mean*
sae data import, datain("$mdata\popX.dta") varlist(`themeans´ x1 x2 x3 x4 x5 x6 x7 hhsize) ///
area(area) uniqid(hhid) dataout("$mdata\thecensus")
*===============================================================================
//3. Simulation
*===============================================================================
local seedstage `c(rngstate)´
if (`from0´==1) local start 1
else{
use "$mdata\bias_in_mymodel_allplines.dta", clear
local seedstage = `_dta[note1]´
local start = subinstr(`_dta[note2]´,"Sim num ","",.)
local start = `start´+1
global plines `_dta[note3]´
dis "`seedstage´"
dis "`start´"
}
forval z=1/`total_sim´{
set seed `seedstage´
use "$mdata\popX.dta", clear
gen double eta_a = rnormal(0,$sigmaeta_area) if area_1st==1
replace eta_a = eta_a[_n-1] if missing(eta_a)
gen double eta_p = rnormal(0,$sigmaeta_psu) if psu_1st ==1
replace eta_p = eta_p[_n-1] if missing(eta_p)
//household errors
gen eps = rnormal(0,$sigmaeps)
//Generate Y adding the XB and the drawn errors
egen double Y = rsum(XB eta_a eta_p eps)
if (`z´==1){
pctile double y_p = Y, nq(100)
replace y_p = round(exp(y_p),0.01)
levelsof y_p, local(mp)
local thelines
local aa = 1
foreach y of local mp{
if (`y´<10) local nm = substr("`y´",1,4)
if (`y´>=10 & `y´<100) local nm = substr("`y´",1,5)
if (`y´>100) local nm = substr("`y´",1,6)
local thelines `thelines´ `nm´
local aa= `aa´+1
}
global plines `thelines´
}
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 97
gen double e_y = exp(Y)
//Create true values
foreach j of global plines{
local linea = subinstr("`j´",".","_",.)
forval a = 0/0{
gen fgt`a´_`linea´ = (e_y<`j´)*(1-e_y/`j´)^`a´
}
}
preserve
timer on 1
//true values by area
groupfunction [aw=hhsize], mean(fgt* e_y Y) by(area)
rename e_y mean
tempfile true
save `true´
timer off 1
timer list
restore
keep if svy==1
if (`z´==1){
lassoregress Y mean*, numfolds(5)
local vv = e(varlist_nonzero)
global area_vars `vv´
reg Y $area_vars
gen touse = e(sample)
mata: ds = _f_stepvif("$area_vars","hhsize",3,"touse")
global area_vars `vifvar´
reg Y $area_vars
}
tempfile msvy
save `msvy´
preserve
//R2 of roughly 0.16
sae sim h3 Y $area_vars, area(area) ///
mcrep(`myrep´) lny bsrep(0) matin("$mdata\thecensus") seed(`seedstage´) ///
pwcensus(hhsize) indicators(FGT0) aggids(0) uniq(hhid) plines($plines)
rename avg_fgt* ucln_fgt*
rename Unit area
rename Mean ucln_mean
tempfile h3arealn
save `h3arealn´
restore
preserve
sae sim h3 Y x1 x2 x3 x4 x5 x6 x7, area(area) ///
mcrep(`myrep´) lny bsrep(0) matin("$mdata\thecensus") seed(`seedstage´) ///
pwcensus(hhsize) indicators(FGT0) aggids(0) uniq(hhid) plines($plines)
rename avg_fgt* cebln_fgt*
rename Unit area
rename Mean cebln_mean
tempfile h3ebln
save `h3ebln´
restore
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 98
foreach pp of global plines{
preserve
local nm = subinstr("`pp´",".","_",.)
qui: proportion fgt0_`nm´, over(area)
mata: fgt0 = st_matrix("e(b)")
mata: fgt0 = fgt0[`=1+`obsnum´/`areasize´´..`=2*`obsnum´/`areasize´´]´
mata: fgt0_var = st_matrix("e(V)")
mata: fgt0_var = ///
diagonal(fgt0_var)[`=1+`obsnum´/`areasize´´..`=2*`obsnum´/`areasize´´]
gen num_obs = 1
groupfunction, rawsum(num_obs) mean(fgt0_`nm´) first(meanarea*) by(area)
sort area //ordered to match proportion output
//Pull proportion´s results
getmata dir_fgt0 = fgt0 dir_fgt0_var = fgt0_var
gen double mifgt0 = dir_fgt0
replace dir_fgt0_var = . if dir_fgt0_var==0
replace dir_fgt0 = . if missing(dir_fgt0_var)
qui: gen meanarea_x7sq = meanarea_x7^2
if (`z´==1){
fhsae dir_fgt0 meanarea_x7 meanarea_x7sq, ///
re(dir_fgt0_var) method(chandra)
test meanarea_x7sq
local sq = r(p)
test meanarea_x7
local regular = r(p)
if (`regular´<0.1 & `sq´<0.1) global fh_`nm´ meanarea_x7 ///
meanarea_x7sq
else global fh_`nm´ meanarea_x7
}
fhsae dir_fgt0 ${fh_`nm´}, re(dir_fgt0_var) method(reml) fh(fh_fgt0_`nm´)
//for cases where the direct estimate is 1 or 0, the fh
// estimate will be missing. In these cases we make it
// equal to the direct estimate
qui:replace fh_fgt0_`nm´ = mifgt0 if missing(fh_fgt0_`nm´)
keep fh_fgt0 area
tempfile fh_`nm´
save `fh_`nm´´
restore
}
//Open true point estimates
use `true´, clear
//Merge in the model based estimates
merge 1:1 area using `h3arealn´, keepusing(ucln_*)
drop _merge
merge 1:1 area using `h3ebln´, keepusing(cebln_*)
drop _merge
foreach pp of global plines{
local nm = subinstr("`pp´",".","_",.)
merge 1:1 area using `fh_`nm´´
drop _merge
}
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 99
foreach k of global plines{
local linea = subinstr("`k´",".","_",.)
foreach g in cebln ucln fh{
qui{
gen double `g´_bias_fgt0_`linea´ = ///
(`g´_fgt0_`linea´-fgt0_`linea´)/`total_sim´
gen double `g´_absbias_fgt0_`linea´ = ///
abs(`g´_fgt0_`linea´-fgt0_`linea´)/`total_sim´
gen double `g´_mse_fgt0_`linea´ = ///
((`g´_fgt0_`linea´-fgt0_`linea´)^2)/`total_sim´
}
}
}
keep area *_bias_* *_mse_* *_absbias_*
//For first sim we rename the vector to *T
if (`z´==1){
rename *_bias_* *_bias_*T
rename *_absbias_* *_absbias_*T
rename *_mse_* *_mse_*T
tempfile Stats
save `Stats´
}
else{ //After the first sim, we add the bias and MSE to *T
merge 1:1 area using `Stats´
drop _merge
local todrop
foreach g in cebln ucln fh{
foreach k of global plines{
local linea = subinstr("`k´",".","_",.)
qui{
replace `g´_bias_fgt0_`linea´T = ///
`g´_bias_fgt0_`linea´T + `g´_bias_fgt0_`linea´
replace `g´_absbias_fgt0_`linea´T = ///
`g´_absbias_fgt0_`linea´T + `g´_absbias_fgt0_`linea´
replace `g´_mse_fgt0_`linea´T = ///
`g´_mse_fgt0_`linea´T + `g´_mse_fgt0_`linea´
local todrop `todrop´ `g´_bias_fgt0_`linea´ ///
`g´_absbias_fgt0_`linea´ `g´_mse_fgt0_`linea´
}
}
}
local todrop: list uniq todrop
drop `todrop´
tempfile Stats
save `Stats´
}
local seedstage `c(rngstate)´
if (mod(`z´,50)==0){
//seed to begin next sim if it stops before hand...
note: "`seedstage´"
note: "Sim num `z´"
note: "$plines"
save "$mdata\bias_in_mymodel_allplines.dta", replace
}
dis as error "SIMULATION NUMBER : `z´"
}
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 100
local a=1
foreach x of global plines{
local nm = subinstr("`x´",".","_",.)
rename *_bias_fgt0_`nm´T *_bias_fgt0_ptile`a´
rename *_absbias_fgt0_`nm´T *_absbias_fgt0_ptile`a´
rename *_mse_fgt0_`nm´T *_mse_fgt0_ptile`a´
local a =`a´+1
}
save "$mdata\bias_in_mymodel_allplines.dta", replace
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 101
References
Arora, Vipin, Partha Lahiri, and Kanchan Mukherjee (1997). “Empirical Bayes Estimation of Finite
Population Means from Complex Surveys”. In: Journal of the American Statistical Association 92.440,
pp. 1555–1562.
Banerjee, Abhijit V, Angus Deaton, Nora Lustig, Kenneth Rogoﬀ, and Edward Hsu (2006). “An Evalu-
ation of World Bank Research, 1998-2005”. In: Available at SSRN 2950327.
Blumenstock, Joshua, Jonathan Lain, Isabella Smythe, and Tara Vishwanath (2021). Using Big Data and
Machine Learning to Locate the Poor in Nigeria. url: https://blogs.worldbank.org/opendata/
using-big-data-and-machine-learning-locate-poor-nigeria.
Chen, Tianqi and Carlos Guestrin (2016). “XGBoost: A Scalable Tree Boosting System”. In: Proceedings
of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, pp. 785–
794.
Chi, Guanghua, Han Fang, Sourav Chatterjee, and Joshua E Blumenstock (2021). “Micro-estimates of
Wealth for All Low-and Middle-income Countries”. In: arXiv preprint arXiv:2104.07761.
Corral, Paul, Kristen Himelein, Kevin McGee, and Isabel Molina (2021). “A Map of the Poor or a Poor
Map?” In: Mathematics 9.21. issn: 2227-7390. doi: 10.3390/math9212780. url: https://www.mdpi.
com/2227-7390/9/21/2780.
Corral, Paul, Isabel Molina, and Minh Cong Nguyen (2021). “Pull Your Small Area Estimates up by
the Bootstraps”. In: Journal of Statistical Computation and Simulation 91.16, pp. 3304–3357. doi:
10 . 1080 / 00949655 . 2021 . 1926460. url: https : / / www . tandfonline . com / doi / abs / 10 . 1080 /
00949655.2021.1926460.
Efron, Bradley and Carl Morris (1975). “Data Analysis Using Stein’s Estimator and Its Generalizations”.
In: Journal of the American Statistical Association 70.350, pp. 311–319.
Fay, Robert E and Roger A Herriot (1979). “Estimates of Income for Small Places: An Application of
James-Stein Procedures to Census Data”. In: Journal of the American Statistical Association 74.366a,
pp. 269–277.
González-Manteiga, Wenceslao, Maria J Lombardía, Isabel Molina, Domingo Morales, and Laureano
Santamaría (2008). “Bootstrap Mean Squared Error of a Small-area EBLUP”. In: Journal of Statistical
Computation and Simulation 78.5, pp. 443–462.
Jean, Neal, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell, and Stefano Ermon (2016).
“Combining Satellite Imagery and Machine Learning to Predict Poverty”. In: Science 353.6301, pp. 790–
794. url: https://www.science.org/doi/abs/10.1126/science.aaf7894.
Lange, Simon, Utz Johann Pape, and Peter Pütz (2018). “Small Area Estimation of Poverty under
Structural Change”. In: World Bank Policy Research Working Paper 9383.
Marhuenda, Yolanda, Isabel Molina, Domingo Morales, and JNK Rao (2017). “Poverty Mapping in Small
Areas under a Twofold Nested Error Regression Model”. In: Journal of the Royal Statistical Society:
Series A (Statistics in Society) 180.4, pp. 1111–1136. doi: 10.1111/rssa.12306.
Masaki, Takaaki, David Newhouse, Ani Rudra Silwal, Adane Bedada, and Ryan Engstrom (2020). “Small
Area Estimation of Non-monetary Poverty with Geospatial Data”. In: World Bank Policy Research
Working Paper 9383.
Molina, Isabel (2019). Desagregación De Datos En Encuestas De Hogares: Metodologías De Estimación
En áreas Pequeñas. CEPAL. url: https://repositorio.cepal.org/handle/11362/44214.
Molina, Isabel and JNK Rao (2010). “Small Area Estimation of Poverty Indicators”. In: Canadian Journal
of Statistics 38.3, pp. 369–385.
Natekin, Alexey and Alois Knoll (2013). “Gradient Boosting Machines, a Tutorial”. In: Frontiers in
neurorobotics 7, p. 21.
CHAPTER 5. POVERTY MAPPING IN OFF-CENSUS YEARS 102
Nguyen, Viet Cuong (2012). “A Method to Update Poverty Maps”. In: The Journal of Development
Studies 48.12, pp. 1844–1863. doi: 10.1080/00220388.2012.682983. eprint: https://doi.org/10.
1080/00220388.2012.682983. url: https://doi.org/10.1080/00220388.2012.682983.
Torabi, Mahmoud and JNK Rao (2014). “On Small Area Estimation under a Sub-area Level Model”. In:
Journal of Multivariate Analysis 127, pp. 36–55.
Chapter 6
Model Diagnostics
Small area estimation models used in Chapters 3 and 4 are particular cases of linear mixed models.1
Under model-based SAE, using either unit-level (chapter 4) or area-level (chapter 3) models, a series
of useful model diagnostics can help verify model assumptions and assess model ﬁt. These checks also
include residual analysis to detect deviations from the assumed model and detection of inﬂuential ob-
servations (Rao and Molina 2015). Performing thorough and rigorous model diagnostics as part of the
SAE exercise is crucial for ensuring the validity of small area estimates.
This chapter describes the underlying model considered in Section 6.1. Then it provides brief recom-
mendations for model and variable selection in Section 6.2 and concludes with regression diagnosis in
Section 6.3.
6.1 The nested-error model
The model used for small area estimation of poverty and welfare, such as the ones proposed by Elbers,
Lanjouw, and Lanjouw (2003) and Molina and Rao (2010), assume that the transformed welfare ych ,
for each household h within each location c in the population is linearly related to a 1 × K vector of
characteristics (or correlates) xch for that household, according to the nested error model:
ych = xch β + ηc + ech , h = 1, . . . , Nc , c = 1, . . . , C, (6.1)
where ηc and ech are respectively location and household-speciﬁc idiosyncratic errors, assumed to be
independent from each other, following:
iid 2 iid
2
ηc ∼ N 0, ση , ech ∼ N 0, σe ,
where the variances ση
2
and σe
2
are unknown. Here, C is the number of locations in which the population
is divided and Nc is the number of households in location c, for c = 1, . . . , C . Finally, β is the K × 1
vector of coeﬃcients.
As illustrated in Chapter 4, the assumption of normality plays a considerable for EB methods. Deviations
from this assumption may lead to biased and noisier estimates, as shown in Corral et al. (2021). Isolated
deviations from the model (outliers) and inﬂuential observations or even outlying locations may also
1 Except for the gradient boosting application shown.
103
CHAPTER 6. MODEL DIAGNOSTICS 104
exist. Thus, the following sections provide some insights towards selecting a suitable model, as well as
checks that may be done to ensure that the chosen model for SAE is adequate.
6.2 Model and variable selection
The objective of model selection is to determine the relevant covariates out of a pool of candidate model
variables such that the resulting SAE model generates the most precise estimates possible, noting that
it must be possible to measure this precision accurately. Classic approaches to model selection include
lasso or stepwise regression. Other literature on the subject includes the fence method, described in
Pfeﬀermann (2013), which involves selecting a model out of a subgroup of correct models that fulﬁll
speciﬁc criteria. Rao and Molina (2015) also elaborate on other methods such as Akaike Information
Criteria (AIC) type methods, which rely on the marginal restricted log likelihood based on normality of
the random eﬀects and the errors. For variable selection under area-level models, particularly Fay-Herriot
models, Lahiri and Suntornchost (2015) propose a method where the approximation error converges to
zero in probability for large sample sizes.2
Since the aim is to arrive at the “true” model, removing all non-signiﬁcant covariates from the model is
recommended as they may introduce noise. It is important not to confuse the estimated noise and the
true noise of a given small area estimate. The most common uncertainty measure for an area-speciﬁc
prediction is the MSE (Tzavidis et al. 2018). In applications of SAE, such as those illustrated for
unit-level models, an estimate of the MSE for the small area estimator is obtained through a parametric
bootstrap procedure (see section 4.4.2.2). This method diﬀers considerably from the method used in
the ELL method, where a single computational algorithm is used to assess the uncertainty. Corral,
Molina, and Nguyen (2021), through model-based simulations, present evidence of the fact that the
single computational algorithm to estimate noise used in ELL could underestimate the actual MSE of
the method.3
The script example provided in section 4.5.1 of Chapter 4 uses a lasso approach for model selection that
initially includes all suitable covariates and uses 10 fold cross-validation with a shrinkage parameter λ
that is within one standard error of the one that minimizes the cross-validated average prediction squared
error. The lasso approach employed here does not consider the nested error structure of the model; that
is, it is done with the corresponding linear model without the random area eﬀects. Practitioners who
are comfortable with R may rely on the glmmlasso R package, which may be used for model selection
in the model with the assumed nested error structure (see Groll and Tutz (2014) and Groll (2017)).
The lasso selection process yields a selected set of covariates, although some of the included covariates may
be non-signiﬁcant. Consequently, the next step is to remove all the non-signiﬁcant covariates sequentially.
The process starts by removing the most non-signiﬁcant covariates, one by one. When a covariate is
removed, the signiﬁcance of other covariates may change; thus, after removing each covariate, the model
is ﬁt again to determine which covariate to remove next until the remaining ones are all signiﬁcant. Note
that the process used here ignores the magnitude of the coeﬃcients and thus could be further improved.
Finally, it is recommended to remove highly collinear covariates. Once these are removed, the follow-
ing steps are to identify outliers and inﬂuential observations, which may lead to considerably diﬀerent
estimated model parameters, see the next section. For the Fay-Herriot model of Chapter 3 a diﬀerent
2 The authors deﬁne the approximation error as the diﬀerence between standard variable selection criterion and the
ideal variable selection criterion, where it is assumed that the direct estimates are not measured with noise.
3 See Corral et al. (2021) for a detailed discussion on the previous ELL bootstrap. Also see Elbers, Lanjouw, and
Lanjouw (2002) for the sources of noise.
CHAPTER 6. MODEL DIAGNOSTICS 105
approach than the one detailed here was taken. In the taken approach for area-level models, the model
started with all possible covariates and began removing covariates, starting with the least signiﬁcant
ones. The removal was done considering the random eﬀects (see section 3.2).
6.3 Regression diagnostics
After the ﬁtting process, checking whether the assumptions from the underlying model are satisﬁed is
recommended. Regression diagnostics for Linear Mixed Models (LMM)4 are more diﬃcult to interpret
than those of the standard linear models, since these models include random eﬀects, which lead to a
diﬀerent covariance structure. Moreover, tools for diagnostics of linear mixed models are less common
in software packages, including Stata. Thus, practitioners either must code their own diagnostics or rely
on diagnostics often used for linear regression. The model assumptions to keep in mind are: 5
1. Linearity: the dependent variable ych is a linear function of the selected vector of covariates xch .
2. Normality: random eﬀects ηc and errors ech are normally distributed.
3. Homoskedasticity: errors’ variance σe
2
is constant, although this assumption may be relaxed by
modeling the heteroskedasticity using a model such as the alpha model speciﬁcation provided by
ELL (2003). This can be done when choosing Henderson’s Method III ﬁtting in Stata’s sae package.
4. Independence: errors ech are independently distributed. Under the nested error model, the as-
sumption is extended so that ech in location c is unrelated to the corresponding location eﬀect ηc ,
as well as to all other location eﬀects ηl, l = c.
Other issues to consider are the detection of inﬂuential observations and outliers and, as discussed above,
multicollinearity. Although these are not part of the assumptions, their presence may aﬀect the precision
of estimates of model parameters and, in turn, model predictions.
The following subsections include some formal and informal ways to check if the assumptions of the
underlying model are satisﬁed. The process starts by eliminating covariates with high multicollinearity,
followed by inﬂuence analysis and, ﬁnally, residual analysis which encompasses most assumptions.6 Model
diagnostics based on residuals and inﬂuence measures for special cases of the general linear mixed model
can be found in Rao and Molina (2015).
6.3.1 Multicollinearity
Collinearity reduces the accuracy of estimates of regression coeﬃcients, leading to larger standard errors
of the coeﬃcients. Under the multiple imputation (MI) inspired bootstrap methods, such as the often
used one in the ELL method (see Ch. 4), larger standard errors in the coeﬃcients typically led to
larger estimates of noise for the estimators of the indicators of interest. Thus, care was taken to avoid
collinearity and multicollinearity. A simple way to detect collinearity is via a correlation matrix, where
large absolute correlation coeﬃcients may suggest collinearity problems (James et al. 2013). However,
4 LMM are an extension of general linear models which include both ﬁxed and random eﬀects.
5 These assumptions are similar to the ones of a classic linear regression, but adding those for the random eﬀects
inclusion.
6 Many of the commands provided in the following subsections can be easily reviewed by typing help regress
postestimation in Stata’s command window.
CHAPTER 6. MODEL DIAGNOSTICS 106
collinearity may occur between more than a pair of variables leading to multicollinearity, which cannot
be detected under the pairwise correlation matrix (ibid ).
Under the presence of multicollinearity, the inspection of the correlation matrix is no longer suﬃcient,
and one must instead compute the variance inﬂation factors (VIFs) (James et al. 2013). The smallest
possible value for a VIF is 1. There are multiple rules of thumb as to what is an acceptable VIF.
According to James et al. (2013), values exceeding 5 or 10 may require action. After the model is ﬁt,
the command estat vif may be used to check the variance inﬂation factor for the covariates included
in the model.
Example 1:
Variance inﬂation factor values above 10 might require special attention since the variable in question
could be a linear combination of other independent variables. In the example below, a special Mata
function,7 _f_stepvif(), is used to remove covariates with a VIF above a speciﬁed threshold; in this
case, the chosen value is 3.8 The function expects the covariate list as its ﬁrst argument, followed by
the sample weights, and ﬁnally, the threshold. After the removal of high VIF covariates, the resulting
covariate list is returned in a Stata local macro called vifvar.
*===============================================================================
// Collinearity
*===============================================================================
reg y $postsign [aw=Whh],r
//Check for multicollinearity, and remove highly collinear (VIF>3)
cap drop touse //remove vector if it is present to avoid error in next step
gen touse = e(sample) //Indicates the observations used
estat vif //Variance inflation factor
local hhvars $postsign
//Remove covariates with VIF greater than 3
mata: ds = _f_stepvif("`hhvars´","Whh",3,"touse")
global postvif `vifvar´
//VIF check
reg y $postvif [aw=Whh], r
vif
// For ilustration
// Henderson III GLS - model post removal of non-significant
sae model h3 y $postsign [aw=Whh], area(HID_mun)
// Henderson III GLS - model post removal of non-significant
sae model h3 y $postvif [aw=Whh], area(HID_mun)
6.3.2 Inﬂuence analysis
Inﬂuence analysis is used to detect observations that may have considerable impact on the estimates of
the parameters and, consequently, model predictions. These observations include outliers (observations
7 The Mata function is included in Stata’s sae package.
8 The speciﬁc threshold value is up to the practitioner, although it is not recommended to use thresholds above 10.
CHAPTER 6. MODEL DIAGNOSTICS 107
with a large residual, i.e., an observation poorly predicted by the model) and inﬂuential observations (the
omission of which considerably changes the point estimates of β ). These observations can be identiﬁed by
measuring how far the observation’s value for a predictor variable is from the mean or by the size of their
studentized residual. Inﬂuence analysis is recommended prior to calculating any small area estimate, as
these observations may impact the bias and noise of the ﬁnal small area estimates.
Cook’s distance (Cook 1977), also known as Cook’s D, measures the eﬀect in the estimated coeﬃcients
when an observation is left out or deleted (Rao and Molina 2015). Under ordinary least squares regression,
Cook’s distance can be obtained easily in Stata with the cooksd option after the predict command.
Nevertheless, under regular OLS, Cook’s distance focuses solely on isolated observations, whereas, under
the nested error model used for SAE shown in equation 6.1, the analysis of the inﬂuence of particular
locations may be more relevant. The Stata package mlt by Möhring and Schmidt-Catran (2013) may
be used to assess the inﬂuence on the estimated parameters of particular locations.9 The mlt command
estimates Cook’s D empirically, making it computationally intensive. The rule of thumb for classifying
inﬂuential locations is absolute Cook’s D values greater than 4/C, where C is the number of locations into
which the population is divided. The mlt command will also calculate DFBETAs, which measure the
inﬂuence of a single location on the coeﬃcient of each covariate. It represents the standardized diﬀerence
between the coeﬃcient with and without the given location (Möhring and Schmidt-Catran 2013). The
√
rule of thumb for classifying locations as inﬂuential is a DFBETA absolute value above 2/ C , although
this should be applied with caution.
Leverage measures the inﬂuence on the ﬁtted values of a given observation. Unfortunately, data pack-
ages that can obtain leverage under the assumed model are not available. Cameron and Trivedi (2005)
present an alternative toward handling inﬂuential observations under OLS by using the post estimation
command dfits, which shows the diﬀerence in ﬁts (predictions) with and without the unusual observa-
tion. The command combines outliers and leverage into a single statistic. A rule of thumb to identify
these observations is if |df its| > 2 k/n, where k is the number of covariates and n is the number of
observations. Nevertheless, the removal of observations always entails loss of information (which may be
fair) and thus much care should be taken before deciding on removal, and should be done only when,
after inspecting the oﬀender, it is determined to be mistaken.
Example 2:
After ﬁtting the model and calculating residuals, problematic observations are identiﬁed using several
rules of thumb: Cooks d > 4/n , leverage > (2k + 2)/n and abs(rstu) > 2 .
// Step 1
reg y $postvif
// After regression without weights...
// Calculate measures to identify influential observations
predict cdist, cooksd // calculates the Cook´s D influence statistic
predict rstud, rstudent // calculates the Studentized (jackknifed) residuals
// Step 2
reg y $postvif [aw=Whh]
9 https://www.stata.com/meeting/germany12/abstracts/desug12_moehring.pdf
CHAPTER 6. MODEL DIAGNOSTICS 108
// Predict leverage and residuals
predict lev, leverage // calculates the diagonal elements of the
// projection ("hat") matrix
predict r, resid // calculates the residuals
// Save useful locals
local myN=e(N) // # observations
local myK=e(rank) // rank or k
local KK =e(df_m) // degrees of freedom (k-1)
sum cdist, d
* return list
local max = r(max) // max value
local p99 = r(p99) // percentile 99
// Step 3
// For ilustration...
// We have influential data points...
reg lny $postvif if cdist<4/`myN´ [aw=Whh]
reg lny $postvif if cdist<`p99´ [aw=Whh]
reg lny $postvif if cdist<`max´ [aw=Whh]
// Identified influential / outliers observations
gen nogo = abs(rstud)>2 & cdist>4/`myN´ & lev>(2*`myK´+2)/`myN´
count if nogo==1 // these are the obs that we want to eliminate
A graphical representation of the squared normalized residual versus leverage (lvr2plot), before and
after the elimination of inﬂuential observations, is an easy way to identify potentially inﬂuential obser-
vations and outliers (Fig. 6.1). The two reference lines are the means of the leverage and the squared
normalized residual. Many points are outside these two reference points and should be scrutinized before
deciding to remove them.
Figure 6.1: Squared normalized residual vs leverage
Source: own elaboration from code in Appendix 6.4.2. The red lines are references to the mean leverage on the Y axis and
the mean squared normalized residual on the X axis. Observations that are far away from these reference points should be
inspected more closely.
CHAPTER 6. MODEL DIAGNOSTICS 109
6.3.3 Residual analysis
Most techniques for residual analysis rely on visual inspection of the graphed residuals. Residuals should
be checked for linearity, normality, and constant variance in case homoskedasticity is assumed.
6.3.3.1 Linearity
The nested error model used in SAE (Eq. 6.1) assumes that the outcome variable y is a linear function
of the covariates.10 If a single covariate is used, a scatter plot of the residual versus the covariate is
enough to see if a linear relationship exists. When several covariates are used, checking for linearity is
somewhat more complex.
To assess linearity in a simple regression, use scatter to produce a plot of y versus x, lfit to ﬁt a
regression line, and lowess to show a smoothed ﬁt.
. reg depvar indepvar
. twoway (scatter depvar indepvar)(lfit depvar indepvar)(lowess depvar indepvar)
For multiple regression:
. reg depvar indepvars
. predict r, residual
. scatter r indepvar1
. scatter r indepvar2
Other checks for non-linearity include acprlot and cprplot, which produce graphs of an augmented
component-plus-residual plot and a residual plot, respectively.
. reg depvar indepvars
. acprplot indepvar1, lowess lsopts(bwidth(1))
. acprplot indepvar2, lowess lsopts(bwidth(1))
When a clear non-linear pattern is observed, transformations of the independent variable might help.
. graph matrix depvar indepvars, half
. kdensity indepvar, normal
. gen logvar=log(indepvar)
. kdensity logvar, normal
// Linearity
reg y $postsign, r
// Augmented component-plus-residual plot; works better than the
// component-plus-residual plot for identifying nonlinearities in the data.
acprplot age_hh, lowess lsopts(bwidth(1)) graphregion(color(white)) msize(small)
graph export "$figs\acprplot_age_hh.png", as(png) replace
10 Much of the information in this section is borrowed from UCLA: Statistical Consulting Group
(2022); https://stats.idre.ucla.edu/stata/webbooks/reg/chapter2/stata-webbooksregressionwith-statachapter-2-regression-
diagnostics/
CHAPTER 6. MODEL DIAGNOSTICS 110
// Kernel density plot for log_age_hh with a normal density overlaid
kdensity age_hh, normal graphregion(color(white)) msize(small)
graph export "$figs\kdensity_age_hh.png", as(png) replace
// log transformation
gen log_age_hh =log(age_hh)
// Kernel density plot for log_age_hh with a normal density overlaid
kdensity log_age_hh, normal graphregion(color(white)) msize(small)
graph export "$figs\kdensity_log_age_hh.png", as(png) replace
6.3.3.2 Tests for normality of residuals and random eﬀects
Model errors and random eﬀects are assumed to be normally distributed under the nested error model
used for SAE. Deviations from normality may lead to considerable bias in the ﬁnal small area estimates.
Scatter plots of residuals against ﬁtted values and normal Q-Q plots of residuals provide a natural way
to identify outliers or inﬂuential observations that might aﬀect the precision of the estimates. West,
Welch, and Galecki (2014) mentioned that the random eﬀects vector is assumed to follow a multivariate
normal distribution. Thus, the information from the observations sharing the same random eﬀect is used
to predict (instead of estimating) the values of that random eﬀect in the model.
The usual predictors of the random eﬀects under linear mixed models are known as Empirical Best Linear
Unbiased Predictors (EBLUPs), since they are the most precise linear unbiased predictors. They use the
weighted least squares estimates of β , and the variance parameters are replaced by suitable estimates
(Robinson 1991). Rao and Molina (2015) provide a comprehensive formal derivation of the EBLUPs,
and applications beyond small area estimation can be found in West, Welch, and Galecki (2014).
The xtmixed or mixed command may be used to check the validity of a linear mixed model in Stata.
Deviations from normality can be observed in a normal Q-Q plot of residuals (Figure 6.2), which displays
sample quantiles of unit-level residuals against the theoretical quantiles of a normal distribution plot:
mixed depvar indepvars || area:, reml
predict res, residual
qnorm res
It is also important to check the assumptions regarding the random eﬀect’s distribution. This may be
done after ﬁtting the linear mixed model and obtaining the predicted random eﬀects (Figure 6.3):
predict eta, reffects
bysort area: gen first = _n
qnorm eta if first==1
The plots of empirical quantiles compared to theoretical quantiles of the normal distribution (normal Q-
Q plots) are helpful to detect deviations from normality. Transformations of the dependent variable are
often taken to ensure that empirical quantiles are better aligned to the theoretical ones. As Marhuenda
et al. (2017) note, in real life applications, the exact ﬁt to a distribution is hardly met.
CHAPTER 6. MODEL DIAGNOSTICS 111
Figure 6.2: Sample quantiles of residuals against theoretical quantiles of a Normal distribution
Source: own elaboration from code in Appendix 6.4.2. Deviations from normality can be observed in the ﬁgure particularly
at the bottom. Marhuenda et al. (2017) notes that, in applications using real data, the exact ﬁt to a distribution is hardly
met; it is recommended that practitioners apply several transformations to the dependent variable and select the one that
provides the best approximation to the nested error model’s assumptions.
Figure 6.3: Sample quantiles of predicted random eﬀects against theoretical quantiles of a Normal dis-
tribution
Source: own elaboration.
CHAPTER 6. MODEL DIAGNOSTICS 112
6.4 Appendix
6.4.1 Useful deﬁnitions and concepts
Deﬁnitions are based on Molina and Marhuenda (2015), Ghosh and Rao (1994), Rao and Molina (2015),
and Cochran (2007).
• Small area: a domain (area) is regarded as small if the domain-speciﬁc sample size is not large
enough to support direct estimates of adequate precision. That is to say, a small area is any domain
of interest for which direct estimates of adequate precision cannot be produced.
• Large area: an area or domain where a direct estimate of the parameter of interest for the area
has a suﬃciently large sample to yield estimates with the desired precision.
• Domain: domain may be deﬁned by geographic delimitations/territories (state, county, school
district, health service area), socio-demographic groups, or both (speciﬁc age-sex-race group within
a large geographic area) or even other types of sub-populations (set of ﬁrms belonging to a census
division by industry group).
• Direct (domain) estimators: estimators based only on the domain-speciﬁc sample area. These
estimators are typically design-unbiased but tend to have low precision in small areas.
• Indirect (domain) estimators: estimator that uses information from other areas, under the
assumption that there exists some homogeneity relationship between them.
• Target parameter: indicator to be estimated. Some examples are population mean, proportion,
and rate.
• Eﬃciency/precision: 1/variance when an estimator is unbiased; 1/MSE otherwise.
• Sampling error: error from using a sample from the population rather than the whole population.
For a better understanding of statistical inference, the following concepts/deﬁnitions are necessary. Def-
initions and observations are from Molina and García-Portugues (2021) and Rao and Molina (2015).
Note that, when dealing with SAE, the bias for each area c = 1, . . . , C is of interest, not the bias on
average over all the areas.
• Unbiased estimator: if an estimator’s bias is equal to zero for all the parameter values. If the
expected value of the estimator is equal to the parameter. The estimator ϑ ˆc of parameter ϑc is
ˆc − ϑc ) = 0
unbiased if and only if E (ϑ
• Estimation error: the estimation error ϑˆc − ϑc is typically diﬀerent from zero even if the estimator
is unbiased. The bias is the mean estimation error: Bias(ϑ ˆc − ϑc ).
ˆc ) = E (ϑ
• Mean squared (estimation) error (MSE): is also called mean squared prediction error (MSPE)
ˆc − ϑc )2 ].
ˆc ) = E [(ϑ
or prediction mean squared error (PMSE). M SE (ϑ
ˆc .
ˆc : is the standard deviation of the sampling distribution of ϑ
• Standard error (SE) of ϑ
ˆ)/ϑ is the associated standard error of the
ˆ) = s(ϑ
• Coeﬃcient of variation (CV) of ϑ: cv (ϑ
estimate over the true value of ϑ. It is also known as the relative standard deviation (RSD). The
estimated CV is then cv ˆ)/ϑ
ˆ) = s(ϑ
ˆ (ϑ ˆ.
• Consistency: when increasing the sample size n, the probability that the estimator diﬀers from
the true value ϑc by more than ε vanishes for every ε > 0.
CHAPTER 6. MODEL DIAGNOSTICS 113
6.4.2 Regression diagnostics do-ﬁle
The following do-ﬁle provides an example of how to do regression diagnostics. Note that the checks are
not exhaustive and only serve as a guide for the practitioner.
clear all
set more off
/*===============================================================================
Do-file prepared for SAE Guidelines
- Regression Diagnostics
- authors Paul Corral, Minh Nguyen & Sandra Segovia
*==============================================================================*/
global main "C:\Users\\`c(username)´\OneDrive\SAE Guidelines 2021"
global section "$main\4_Model_selection"
global data "$section\1_data"
global dofile "$section\2_dofiles"
global figs "$section\3_figures"
local survey "$data\survey_public.dta"
*local census "$main\3_Unit_level\1_data\census_public.dta"
//global with candidate variables.
global myvar rural lnhhsize age_hh male_hh piped_water no_piped_water ///
no_sewage sewage_pub sewage_priv electricity telephone cellphone internet ///
computer washmachine fridge television share_under15 share_elderly ///
share_adult max_tertiary max_secondary HID_* mun_* state_*
version 15
set seed 648743
local graphs graphregion(color(white)) xsize(9) ysize(6) msize(small)
*===============================================================================
// End of preamble
*===============================================================================
// First part is just as the model selection dofile
// Load in survey data
use "`survey´", clear
//Remove small incomes affecting model
drop if e_y<1
// Kernel density plot for e_y with a normal density overlaid
kdensity e_y, normal `graphs´
graph export "$figs\kdensity_e_y.png", as(png) replace
//Log shift transformation to approximate normality
lnskew0 double bcy = exp(lny)
// Kernel density plot for lnywith a normal density overlaid
kdensity lny, normal `graphs´
graph export "$figs\kdensity_lny.png", as(png) replace
// Kernel density plot for bcy with a normal density overlaid
kdensity bcy, normal `graphs´
graph export "$figs\kdensity_bcy.png", as(png) replace
// removes skeweness from distribution
CHAPTER 6. MODEL DIAGNOSTICS 114
sum e_y, d
sum lny, d
sum bcy, d
// Data has already been cleaned and prepared. Data preparation and the creation
// of eligible covariates is of extreme importance.
// In this instance, we skip these comparison steps because the sample is
// literally a subsample of the census.
codebook HID //10 digits, every single one
codebook HID_mun //7 digits every single one
//We rename HID_mun
rename HID_mun MUN
//Drop automobile, it is missing
drop *automobile* //all these are missing
//Check to see if lassoregress is installed, if not install
cap which lassoregress
if (_rc) ssc install elasticregress
//Model selection - with Lasso
gen lnhhsize = ln(hhsize)
lassoregress bcy $myvar [aw=Whh], lambda1se epsilon(1e-10) numfolds(10)
local hhvars = e(varlist_nonzero)
global postlasso `hhvars´
//Try Henderson III GLS
sae model h3 bcy $postlasso [aw=Whh], area(MUN)
//Rename HID_mun
rename MUN HID_mun
//Loop designed to remove non-significant covariates sequentially
forval z= 0.5(-0.05)0.05{
qui:sae model h3 bcy `hhvars´ [aw=Whh], area(HID_mun)
mata: bb=st_matrix("e(b_gls)")
mata: se=sqrt(diagonal(st_matrix("e(V_gls)")))
mata: zvals = bb´:/se
mata: st_matrix("min",min(abs(zvals)))
local zv = (-min[1,1])
if (2*normal(`zv´)<`z´) exit
foreach x of varlist `hhvars´{
local hhvars1
qui: sae model h3 bcy `hhvars´ [aw=Whh], area(HID_mun)
qui: test `x´
if (r(p)>`z´){
local hhvars1
foreach yy of local hhvars{
if ("`yy´"=="`x´") dis ""
else local hhvars1 `hhvars1´ `yy´
}
}
else local hhvars1 `hhvars´
local hhvars `hhvars1´
}
}
global postsign `hhvars´
*===============================================================================
CHAPTER 6. MODEL DIAGNOSTICS 115
// Regression diagnostics
*===============================================================================
/*This is not a complete diagnostic; it is just a preview, steps & repetitions
depend on the underlying model. Check all vars, check different transformations,
do not forget a model for heteroskedasticity (alpha model) if needed */
rename bcy y
*===============================================================================
// Collinearity
*===============================================================================
reg y $postsign [aw=Whh],r
//Check for multicollinearity, and remove highly collinear (VIF>3)
cap drop touse //remove vector if it is present to avoid error in next step
gen touse = e(sample) //Indicates the observations used
estat vif //Variance inflation factor
local hhvars $postsign
//Remove covariates with VIF greater than 3
mata: ds = _f_stepvif("`hhvars´","Whh",3,"touse")
global postvif `vifvar´
//VIF check
reg y $postvif [aw=Whh], r
vif
// For ilustration
// Henderson III GLS - model post removal of non-significant
sae model h3 y $postsign [aw=Whh], area(HID_mun)
// Henderson III GLS - model post removal of non-significant
sae model h3 y $postvif [aw=Whh], area(HID_mun)
*===============================================================================
// Residual Analysis
*===============================================================================
// Linearity
reg y $postsign, r
// Augmented component-plus-residual plot; works better than the
// component-plus-residual plot for identifying nonlinearities in the data.
acprplot age_hh, lowess lsopts(bwidth(1)) `graphs´
graph export "$figs\acprplot_age_hh.png", as(png) replace
// Kernel density plot for log_age_hh with a normal density overlaid
kdensity age_hh, normal `graphs´
graph export "$figs\kdensity_age_hh.png", as(png) replace
// log transformation
gen log_age_hh =log(age_hh)
// Kernel density plot for log_age_hh with a normal density overlaid
kdensity log_age_hh, normal `graphs´
graph export "$figs\kdensity_log_age_hh.png", as(png) replace
CHAPTER 6. MODEL DIAGNOSTICS 116
// Normality
reg y $postvif [aw=Whh],r
predict resid, resid
// Kernel density plot for residuals with a normal density overlaid
kdensity resid, normal `graphs´
graph export "$figs\kdensity_resid.png", as(png) replace
// Standardized normal probability
pnorm resid , `graphs´
graph export "$figs\pnorm.png", as(png) replace
// Quantiles of a variable against the quantiles of a normal distribution
qnorm resid , `graphs´
graph export "$figs\qnorm.png", as(png) replace
// Numerical Test: Shapiro-Wilk W test for normal data
swilk resid
// Heteroscedasticity
reg y $postvif
// Residuals vs fitted values with a reference line at y=0
rvfplot , yline(0) `graphs´
graph export "$figs\rvfplot_1.png", as(png) replace
// Cameron & Trivedi´s decomposition of IM-test / White test
estat imtest
// Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
estat hettest
*===============================================================================
// Influence Analysis
*===============================================================================
// Graphic method < before >
reg y $postvif [aw=Whh]
// residuals vs fitted vals
rvfplot , yline(0) `graphs´
graph export "$figs\rvfplot_2.png", as(png) replace
// normalized residual squared vs leverage
lvr2plot , `graphs´
graph export "$figs\lvr2plot.png", as(png) replace
// Numerical method
// Step 1
reg y $postvif
// After regression without weights...
CHAPTER 6. MODEL DIAGNOSTICS 117
// Calculate measures to identify influential observations
predict cdist, cooksd // calculates the Cook´s D influence statistic
predict rstud, rstudent // calculates the Studentized (jackknifed) residuals
// Step 2
reg y $postvif [aw=Whh]
// Predict leverage and residuals
predict lev, leverage // calculates the diagonal elements of the
// projection ("hat") matrix
predict r, resid // calculates the residuals
// Save useful locals
local myN=e(N) // # observations
local myK=e(rank) // rank or k
local KK =e(df_m) // degrees of freedom (k-1)
sum cdist, d
* return list
local max = r(max) // max value
local p99 = r(p99) // percentile 99
// Step 3
// For ilustration...
// We have influential data points...
reg lny $postvif if cdist<4/`myN´ [aw=Whh]
reg lny $postvif if cdist<`p99´ [aw=Whh]
reg lny $postvif if cdist<`max´ [aw=Whh]
// Identified influential / outliers observations
gen nogo = abs(rstud)>2 & cdist>4/`myN´ & lev>(2*`myK´+2)/`myN´
count if nogo==1 // these are the obs that we want to eliminate
// Graphic method < after >
reg y $postvif [aw=Whh] if nogo==0
// residuals vs fitted vals
rvfplot , yline(0) `graphs´
graph export "$figs\rvfplot_2_after.png", as(png) replace
// normalized residual squared vs leverage
lvr2plot , `graphs´
graph export "$figs\lvr2plot_after.png", as(png) replace
*===============================================================================
// Model Specification tests
*===============================================================================
reg y $postvif
// Wald test for ommited vars < will compare with previous regression>
boxcox y $postvif, nolog // Box - Cox model
// Functional form of the conditional mean
CHAPTER 6. MODEL DIAGNOSTICS 118
reg y $postvif
estat ovtest // performs regression specification error test (RESET) for omitted variables
linktest //performs a link test for model specification
// Omnibus tests + Heteroscedasticity tests
reg y $postvif
estat imtest // Cameron & Trivedi´s decomposition of IM-test / White test
estat hettest // Breusch-Pagan / Cook-Weisberg test for Heteroscedasticity
*===============================================================================
// Diagnostics for random effects
*===============================================================================
// Multilevel mixed-effects linear regression
mixed y $postvif || state:, reml
predict res, residual
predict eta, reffects
bysort state: gen first=_n
// For state == 1
qnorm eta if first==1 , `graphs´
graph export "$figs\qnorm_mixed_1.png", as(png) replace
// Quantiles of a variable against the quantiles of a normal distribution
qnorm res , `graphs´
graph export "$figs\qnorm_mixed.png", as(png) replace
CHAPTER 6. MODEL DIAGNOSTICS 119
References
Cameron, A Colin and Pravin K Trivedi (2005). Microeconometrics: Methods and Applications. Cam-
bridge university press. isbn: 978-0-521-84805-3.
Cochran, William G (2007). Sampling Techniques. John Wiley & Sons.
Cook, R Dennis (1977). “Detection of Inﬂuential Observation in Linear Regression”. In: Technometrics
19.1, pp. 15–18.
Corral, Paul, Kristen Himelein, Kevin McGee, and Isabel Molina (2021). “A Map of the Poor or a Poor
Map?” In: Mathematics 9.21. issn: 2227-7390. doi: 10.3390/math9212780. url: https://www.mdpi.
com/2227-7390/9/21/2780.
Corral, Paul, Isabel Molina, and Minh Cong Nguyen (2021). “Pull Your Small Area Estimates up by
the Bootstraps”. In: Journal of Statistical Computation and Simulation 91.16, pp. 3304–3357. doi:
10 . 1080 / 00949655 . 2021 . 1926460. url: https : / / www . tandfonline . com / doi / abs / 10 . 1080 /
00949655.2021.1926460.
Elbers, Chris, Jean O Lanjouw, and Peter Lanjouw (2003). “Micro–level Estimation of Poverty and
Inequality”. In: Econometrica 71.1, pp. 355–364.
Elbers, Chris, Jean Olson Lanjouw, and Peter Lanjouw (2002). “Micro-level Estimation of Welfare”. In:
World Bank Policy Research Working Paper 2911.
Ghosh, Malay and JNK Rao (1994). “Small Area Estimation: An Appraisal”. In: Statistical science 9.1,
pp. 55–76.
Groll, Andreas (2017). “glmmLasso: Variable Selection for Generalized Linear Mixed Models by L1-
penalized Estimation”. In: R package version 1.1, p. 25.
Groll, Andreas and Gerhard Tutz (2014). “Variable Selection for Generalized Linear Mixed Models by L
1-penalized Estimation”. In: Statistics and Computing 24.2, pp. 137–154.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2013). An Introduction to Sta-
tistical Learning. Vol. 112. Springer. isbn: 978-1-0716-1418-1. url: https://link.springer.com/
book/10.1007/978-1-0716-1418-1?noAccess=true.
Lahiri, Partha and Jiraphan Suntornchost (2015). “Variable Selection for Linear Mixed Models with
Applications in Small Area Estimation”. In: Sankhya B 77.2, pp. 312–320.
Marhuenda, Yolanda, Isabel Molina, Domingo Morales, and JNK Rao (2017). “Poverty Mapping in Small
Areas under a Twofold Nested Error Regression Model”. In: Journal of the Royal Statistical Society:
Series A (Statistics in Society) 180.4, pp. 1111–1136. doi: 10.1111/rssa.12306.
Möhring, Katja and Alexander Schmidt-Catran (2013). MLT: Stata Module to Provide Multilevel Tools.
url: https://econpapers.repec.org/software/bocbocode/s457577.htm.
Molina, Isabel and Eduardo García-Portugues (2021). A First Course on Statistical Inference. Accessed:
2010-06-22. Bookdown.org. url: https://bookdown.org/egarpor/inference/.
Molina, Isabel and Yolanda Marhuenda (2015). “Sae: An R Package for Small Area Estimation”. In: The
R Journal 7.1, pp. 81–98.
Molina, Isabel and JNK Rao (2010). “Small Area Estimation of Poverty Indicators”. In: Canadian Journal
of Statistics 38.3, pp. 369–385.
Pfeﬀermann, Danny (2013). “New Important Developments in Small Area Estimation”. In: Statistical
Science 28.1, pp. 40–68.
Rao, JNK and Isabel Molina (2015). Small Area Estimation. 2nd. John Wiley & Sons.
Robinson, George K (1991). “That BLUP Is a Good Thing: The Estimation of Random Eﬀects”. In:
Statistical science 6.1, pp. 15–32.
CHAPTER 6. MODEL DIAGNOSTICS 120
Tzavidis, Nikos, Li-Chun Zhang, Angela Luna, Timo Schmid, and Natalia Rojas-Perilla (2018). “From
Start to Finish: A Framework for the Production of Small Area Oﬃcial Statistics”. In: Journal of the
Royal Statistical Society: Series A (Statistics in Society) 181.4, pp. 927–979.
UCLA: Statistical Consulting Group (2022). Regression with Stata Chapter 2 – Regression Diagnostics.
url: https://stats.oarc.ucla.edu/stata/dae/robust-regression/.
West, Brady T, Kathleen B Welch, and Andrzej T Galecki (2014). Linear Mixed Models: A Practical
Guide Using Statistical Software. 2nd. Chapman and Hall/CRC. doi: 10.1201/b17198. url: https:
//www.taylorfrancis.com/books/mono/10.1201/b17198/linear- mixed- models- brady- west-
kathleen-welch-andrzej-galecki.
Chapter 7
Concluding Remarks
The objective of these Guidelines is to illustrate the application of small area estimation methods to
obtain estimates of welfare indicators at a geographical level below that achieved with regular sample
surveys. They concentrate on the production of small area estimates of poverty and related indicators
through model-based small area estimation – both area- and unit-level models. The Guidelines are
expected to serve as a practical guide to individuals working on the production of small area estimates
of poverty so that informed choices can be made.
The main contribution of the guidelines is to provide a clear and easy-to-follow set of recommendations.
Toward that goal, it organizes some of the existing evidence on the best practices for poverty mapping
under the considered methodologies to help practitioners make considered choices given their existing data
limitations. The Guidelines also provide easy-to-follow and replicable scripts for much of the analysis
presented, such that interested readers may deepen their understanding of the methods. Finally, the
advantages and disadvantages of each method are presented to facilitate the user’s decision-making.
In summary, the Guidelines present evidence on what approach may be best suited in diﬀerent data
environments. Evidence suggests area-level models such as Fay-Herriot are quite useful when access to
census microdata is not possible, although the gains in precision may be limited. When access to census
microdata is possible, and data between survey and census is comparable, unit-level methods such as
CensusEB will likely yield estimates of the highest quality. The importance of data transformation
to approximate the model’s assumptions has also been highlighted. Additionally, the importance of
estimating location eﬀects at the same level as the one where estimates will be reported is noted. Evidence
is also provided showing that when the level of random location eﬀects and the level of reporting are not
aligned, it will yield noisier estimates.
The Guidelines also note that the small area estimation of poverty is an active research ﬁeld, and
methods are constantly being improved. Some of the more recent modeling approaches discussed in
the Guidelines require further methodological work. Interesting approaches have been suggested for
instances where the census and survey data are not aligned. However, there is still much work to be done
to reﬁne these methodologies and address the concerns highlighted in these guidelines and elsewhere.
In particular, hybrid alternatives such as unit-context models, because these are unable to replicate the
full welfare distribution, are likely to yield biased estimates. Moreover, since the noise estimates for
unit-context models are likely incorrect, their precision cannot be adequately evaluated. Similarly, the
Guidelines also suggest that the lack of adequate noise estimates for machine learning approaches such as
gradient-boosting limits their usefulness for the purposes of small area estimation of poverty, despite the
method showing promising results in a validation using the Mexican Intracensal survey. Future research
121
CHAPTER 7. CONCLUDING REMARKS 122
should focus on resolving these methodological concerns and constraints, as this would facilitate the
mainstreaming of these methods and further enhance the practitioners’ ability to produce reliable small
area estimates of poverty when access to census microdata is not available.