WPS7028 Policy Research Working Paper 7028 GLS Estimation and Empirical Bayes Prediction for Linear Mixed Models with Heteroskedasticity and Sampling Weights A Background Study for the POVMAP Project Roy van der Weide Development Research Group Poverty and Inequality Team September 2014 Policy Research Working Paper 7028 Abstract This note adapts results by Huang and Hidiroglou (2003) the poverty mapping approach put forward by Elbers et al. on Generalized Least Squares estimation and Empirical (2003). The estimators presented here have been imple- Bayes prediction for linear mixed models with sampling mented in version 2.5 of POVMAP, the custom-made weights. The objective is to incorporate these results into poverty mapping software developed by the World Bank. This paper is a product of the Poverty and Inequality Team, Development Research Group. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org. The author may be contacted at rvanderweide@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team GLS Estimation and Empirical Bayes Prediction for Linear Mixed Models with Heteroskedasticity and Sampling Weights A background study for the POVMAP project Roy van der Weide∗ Keywords: linear mixed models, small area estimation, Empirical Bayes, sampling weights, poverty, inequality JEL Classification: I32, C31, C43, C53 ∗ World Bank Research Department. Email: rvanderweide@worldbank.org. A big thank you goes to Chris Elbers for providing comments on an earlier version of this note. 1 Introduction The poverty mapping approach put forward by Elbers et al. (2003; henceforward ELL) makes it possible to estimate poverty and inequality at a highly disaggregated level. Depending on the geography of the country of interest, estimates of poverty might be obtained for areas as small as a city or community, which greatly facilitates the targeting of the poor among other applications (see e.g. Elbers et al., 2007). ELL achieve this by means of a massive out-of-sample prediction exercise that “imputes” income or consumption data for every household recorded in a population census. Once estimates of consumption are available for all households in the population this data can then be aggregated at almost any desired level of aggregation. The household consumption model used for prediction is estimated to data from a household income survey where the independent variables are restricted to those that are available in both the survey and the census. A linear mixed model is assumed which is standard in the small area estimation literature (see e.g. Rao, 2003). Spatial correlation between the residuals is accounted for by means of a nested error structure that consists of a random area effect and an idiosyncratic household effect. ELL believed their approach would be most convincing if the assumptions about the errors are kept to a minimum. Specifically, the household errors are allowed to be heteroskedastic and by default no assumptions are made about the shape of the error distribution functions. The ELL approach has been applied to obtain poverty maps in over 60 countries worldwide. Part of this success may arguably be attributed to its implementation in POVMAP, a custom-made software package developed by the World Bank that can be downloaded from the public domain at no cost.1 The POVMAP project has made ELL, a computationally intensive approach, available to a large audience of applied users (and has thereby greatly lowered the threshold for adopting ELL). The first version of POVMAP (i.e. POVMAP 1.0) ran under MS-DOS. A graphical user interface was added with the second version (POVMAP 2.0). Both versions of POVMAP closely follow the procedures from the original ELL publication. A decade has past since the original publication, a good time to take stock of new developments. The developments that we will focus on in this note are Empirical Bayes (EB) prediction married with the ELL approach (see Molina and Rao, 2010) while accounting for unequal sampling probabilities in the income survey (see Huang and Hidiroglou, 2003). EB prediction utilizes the survey data to narrow down the random area effects while non-EB prediction (i.e. conventional ELL) makes no such attempt. 1 POVMAP2 can be downloaded from: iresearch.worldbank.org/PovMap/ 2 As such, EB will only make a difference for areas that are represented in the survey (for other areas EB reduces to conventional ELL prediction).2 The objective of this note is to adapt the results by Huang and Hidiroglou (2003) on EB prediction for generalized linear mixed models with sampling weights to the ELL framework. Note that while the original paper by Molina and Rao (2010) implements EB prediction by assuming homoskedastic errors, this assumption is easily relaxed, as can be seen in this note. The introduction of sampling weights (as probability weights) also concerns the estimation of the model parameters, which in this case involves a modification to Generalized Least Squares (GLS). Our note functions as a background study for a milestone upgrade of POVMAP to version 2.5. Following Huang and Hidiroglou (2003) and Molina and Rao (2010) we assume nor- mally distributed errors when we present Empirical Bayes prediction. For a treatment of EB under less restrictive assumptions, see the recent study by Elbers and van der Weide (2014). In POVMAP 2.5, the user will have a choice between normal EB (Molina and Rao, 2010) and non-normal non-EB (ELL). The relative performance of these two options will depend on: (a) the size of the random area effect; (b) the number of small areas represented in the survey; and (c) the degree of non-normality of the errors. Nor- mal EB prediction is expected to do well if there are relatively large random area effects, if many of the small areas are covered by the survey, while the error distrbutions can be reasonably well approximated by a normal distribution. The outline of the note is as follows. Section 2 introduces the model framework and some notation. Section 3 presents the modification to the GLS estimator due to the introduction of probability weights. EB prediction is presented in Section 4, where we explicitly allow both for sampling weights and heteroskedasticity. 2 Model and notation Suppose that the (log) consumption data can be described by the following nested error regression model: yah = β T xah + ua + εah , (1) where the subscript ah refers to household h in area a, where yah denotes (log) household per capita consumption, xah denotes a vector containing m independent variables, and where ua and εah represent the area error and the household specific error with zero 2 The challenge is to identify the conditional distribution for the area error. When both the area and the household errors are normally distributed, it follws that the area error conditional on the survey data will also be normally distributed. If we allow the errors to be non-normally distributed however, then working out the conditonal distribution will no longer be a trivial exercise (see Elbers and van der Weide, 2014). 3 2 2 means and variances denoted by σu and σε,ah , respectively. The two errors are assumed 2 independent from each other. Note that σε,ah is permitted to vary between households, 2 while σu is assumed to be a constant. For ease of exposition, we will assume that the variance parameters are known.3 Let na denote the number of households sampled in area a, so that n = a na denotes the total sample size. Let wah denote the sampling weight for household ah. Let us also define W as the diagional matrix with the sampling weights wah along the diagonal (sorted by area), and define Ω as a diagonal matrix with the following matrices h wah along its diagonal (sorted by area): Ωa = 2 Ina , where Ina denotes the identity h wah matrix of dimension na . We will at times also represent the model in matrix notation: y = Xβ + u + ε. (2) Let R = E [εεT ] denote the diagonal matrix with the household error variances σε,ah 2 along the diagonal (sorted by area). We will denote the diagonal block of R correspond- ing to area a by Ra . Similarly, let Q = E [uuT ] be the block-diagonal matrix where the 2 blocks are given by Qa = σu 1na 1T na , where 1na denotes the unit vector of length na . 3 Estimation of β using GLS You and Rao (2002) derive a GLS estimator for β with sampling weights under the 2 2 assumption that σε,ah = σε for all households by solving weighted moment conditions. Huang and Hidiroglou (2003) have relaxed this assumption by permitting heteroskedas- 2 ticity, i.e. a non-constant σε,ah . Their GLS estimator reduces to the estimator of You and Rao (2002) if one were to insert constant variances (which we will confirm). The weighted GLS estimator for β from Huang and Hidiroglou (2003) satisfies: ˆw = (X T V β ˆ −1 y, ˆ −1 X )−1 X T V (3) w w with: ˆ + ΩQ, ˆw = W −1 R V ˆ (4) where the two matrices W and Ω are functions of the sampling weights only (see Section 2). 3 2 2 2 See the Annex for the estimation of σu and σε = E [σε,ah ]. For the estimation of the conditional 2 variances σε,ah we refer the reader to Elbers et al. (2003). 4 ˆw can be estimated by: The variance of β ˆw ] = (X T V var[β ˆ −1 V ˆ −1 X )−1 (X T V ˆV ˆ −1 X )−1 , ˆ −1 X )(X T V (5) w w w w with: ˆ + Q. ˆ =R V ˆ (6) Note that Vˆ and Vˆw are two different matrices. Also note that β ˆw reduces to the conventional GLS estimator if we insert constant sampling weights. 3.1 ˆw ] ˆw and var[β Expanding the expressions for β In this subsection we will attempt to further work out the expressions for β ˆw and var[βˆw ] with the objective to ease implementation. Note that β ˆw is a function of V ˆw−1 . We will drop the “hat” to ease notation. Due to the block-diagonal nature of Vw , we have that −1 its inverse Vw too will be block-diagonal where its blocks solve the inverse of the blocks of Vw . ˆw as follows: This allows us to re-write the expression for β ˆw = (X T V −1 X )−1 X T V −1 y β (7) w w −1 T −1 T −1 = Xa Va,w Xa Xa Va,w ya , (8) a a where Va,w denotes the area a block of Vw , and where Xa and ya denote the corre- sponding area a “blocks” of X and y , respectively, containing only the rows from area a. To further expand this expression let us work out the inverse of Va,w . Note that −1 Va,w = Wa Ra + Ωa Qa , where Wa and Ra are both diagonal matrices of dimension 2 na with wah and σε,ah along their diagonal, respectively. Recall that Qa is defined as 2 Qa = σu 1na 1T na , where 1na denotes the unit vector of length na . It will be convenient to represent the blocks Va,w as follows: 2 wah Va,w = Ra,w + σu h 2 1na 1T na , (9) h wah 2 σε,ah where Ra,w is a diagonal matrix of dimension na with diagonal elements given by wah . The inverse of Va,w is then seen to solve: −1 −1 γa,w −1 −1 Va,w = Ra,w − −1 Ra,w 1na 1T na Ra,w , (10) 1T na Ra,w 1na 5 where: 2 σu γa,w = 2 (11) 2 + h wah −1 −1 σu wah (1T na Ra,w 1na ) h 2 σu = −1 . (12) 2 + 2 wah σu h wah h wah 2 h σε,ah −1 −1 Given this expression for Vw , let us work out what this means for X T Vw X and T −1 X Vw y separately, and then put these back together to obtain the alternative repre- ˆw . We begin with X T V −1 X . sentation for β w −1 T −1 X T Vw X = Xa Va,w Xa a T −1 γa,w −1 −1 = Xa Ra,w − T −1 1 Ra,w 1na 1T na Ra,w Xa a 1na Ra,w na −1 −1 T −1 −1 Ra,w Ra,w = Xa Ra,w Xa − γa,w (1T T na Ra,w 1na )Xa −1 1na 1T na −1 Xa a 1T na Ra,w 1na 1T na Ra,w 1na wah wah = 2 xah xT ah − γa,w 2 ¯a,w x x ¯Ta,w , a h σε,ah h σε,ah with: 1 wah ¯a,w = x wah 2 xah . (13) 2 h σε,ah h σε,ah −1 By similar logic we obtain the following expression for X T Vw y: −1 wah wah X T Vw y= 2 xah yah − γa,w 2 x ¯a,w ¯a,w y , (14) a h σε,ah h σε,ah with: 1 wah ¯a,w = y wah 2 yah . (15) 2 h σε,ah h σε,ah −1 −1 Combining the expressions we obtained for X T Vw X and X T Vw y yields the fol- ˆ lowing expression for βw : −1 ˆw = wah wah β 2 xah xT ah − γa,w 2 ¯a,w x x ¯Ta,w a h σε,ah h σε,ah wah wah × 2 xah yah − γa,w 2 ¯a,w y x ¯a,w . a h σε,ah h σε,ah 6 2 2 2 If we assume constant variance σε,ah = σε , we have that σε drops from the equation altogether, in which case our expression for β ˆw is seen to coincide with the expression obtained by You and Rao (2002) under the same assumptions. Let us next try to re-write the expression for the variance of βˆw in a way that will make it easier to compute. Due to the block-diagonal nature of both Vw and V , it ˆw ] can be written as: follows that var[β ˆw ] = (X T V −1 X )−1 (X T V −1 V V −1 X )(X T V −1 X )−1 var[β w w w w −1 −1 T −1 T −1 −1 T −1 = Xa Va,w Xa Xa Va,w Va Va,w Xa Xa Va,w Xa , a a a where for ease of notation we have dropped the “hat” from the right-hand-side (RHS). T −1 Note that we have already expanded Xa Va,w Xa when we revisited the expression for ˆw , which leaves only X V Va V Xa . Let us first examine the matrix V −1 Va V −1 . β T − 1 −1 a a,w a,w a,w a,w Writing out the matrix multiplication yields: −1 −1 −1 −1 2 −1 −1 γa,w −1 −1 −1 Va,w Va Va,w = Ra,w Ra Ra,w + σu Ra,w 1na 1T na Ra,w − T −1 1 Ra,w 1na 1T na Ra,w Ra Ra,w 1na Ra,w na 2 −1 −1 γa,w −1 −1 −1 − σu γa,w Ra,w 1na 1T na Ra,w − −1 Ra,w Ra Ra,w 1na 1T na Ra,w 1T na Ra,w 1na 2 γa,w −1 −1 −1 − σu T −1 1 Ra,w 1na 1T T na Ra,w 1na 1na Ra,w 1na Ra,w na 2 γa,w −1 −1 −1 −1 + −1 Ra,w 1na 1T T na Ra,w Ra Ra,w 1na 1na Ra,w 1T na Ra,w 1na 2 γa,w −1 −1 −1 + σu γa,w −1 Ra,w 1na 1T T na Ra,w 1na 1na Ra,w . 1T na Ra,w 1na After rearranging terms we obtain: −1 −1 2 −1 −1 Va,w Va Va,w = σu (1 − γa,w )2 Ra,w 1na 1T T na Ra,w + Ba Ra Ba , where: −1 γa,w −1 −1 Ba = Ra,w − −1 Ra,w 1na 1T na Ra,w . (16) 1T na Ra,w 1na 7 T −1 −1 Inserting this into Xa Va,w Va Va,w Xa yields: T −1 −1 2 T −1 −1 Xa Va,w Va Va,w Xa = σ u (1 − γa,w )2 Xa Ra,w 1na 1T T T na Ra,w Xa + Xa Ba Ra Ba Xa 2 2 2 wah = σu (1 − γa,w ) 2 ¯a,w x x ¯Ta,w h σε,ah 2 wah + 2 ˇa,h xah xT w ah − γa,w x ˇT ¯a,w x a,w − γa,w x ¯T ˇa,w x 2 ¯a,w x a,w + γa,w x ¯Ta,w , h σε,ah h where: ˇa,w = x ˇa,h xa,h , w h with: 2 wah 2 σε,ah ˇa,h = w 2 wah . 2 h σε,ah Putting the terms together gives us the following elaborate expression for the vari- ˆw : ance of β −1 −1 ˆw ] = var[β T −1 Xa Va,w Xa T −1 Xa −1 Va,w Va Va,w Xa T −1 Xa Va,w Xa a a a   2 2 wah = C σu (1 − γa,w )2 2 x ¯T ¯a,w x a,w  CT a h σε,ah 2 wah + C 2 ˇa,h xah xT w ah − γa,w x ˇT ¯a,w x a,w − γa,w x ¯T ˇa,w x 2 a,w + γa,w x ¯T ¯a,w x a,w CT . a h σε,ah h where: −1 wah wah C= 2 xah xT ah − γa,w 2 x ¯T ¯a,w x a,w . a h σε,ah h σε,ah 3.2 Probability weighted OLS nested as a special case The weighted OLS estimator is effectively obtained by setting Vw = σ 2 W and V = σ 2 In , where σ 2 denotes the variance of the total error term. This yields the following expression for β ˜w : ˜w = (X T W X )−1 X T W y β −1 = wah xah xT ah T wah xah yah . a h a h 8 The corresponding variance solves: ˜w ] = σ 2 (X T W X )−1 X T W 2 X (X T W X )−1 var[β −1 −1 2 = σ wah xah xT ah 2 wah xah xT ah wah xah xT ah . a h a h a h Note that in Stata this estimator can be obtained by including the sampling weights as “probability weights” in the regular regression function (without using robust standard errors). 4 Empirical Bayes prediction assuming normality Here we are interested in identifying the distribution of the area random error ua condi- tional on the residuals ea for the households sampled from area a.4 This task is greatly simplified by assuming that both ua and εah are normally distributed, as is done by Huang and Hidiroglou (2003) and You and Rao (2002). It then follows that the dis- tribution of ua conditional on ea too will be normal. What remains is to identify the mean and variance of this distribution. Huang and Hidiroglou (2003) offer an estimate of the conditional mean E [ua |ea ] for the general linear mixed model. Applying their results to our nested error regression 2 model with potentially non-constant variances σε,ah , we obtain the following: h wah 2 T −1 E [ua |ea ] = u ˆa = 2 σu 1na Va,w ea , (17) h wah where ea = (ea1 , . . . , eana )T denotes the vector of area a residuals coming out of the (weighted) GLS regression. Note that we dropped the “hat” from the RHS to ease −1 notation. Substituting the expression we derived for Va,w (see eq. (10)) into eq. (17) yields: wah 2 T −1 γa,w −1 −1 u ˆa = h 2 σu 1na Ra,w − T −1 1 Ra,w 1na 1T na Ra,w ea h wah 1na Ra,w na −1 1Tna Ra,w ea = γa,w −1 1T na Ra,w 1na  wah  h 2 σε,ah eah = γa,w  wah . 2 h σε,ah 4 For ease of exposition we will treat the residuals ea as if they were observed data, i.e. as if β was known. In practice of course we will be working with estimates of ea . 9 Huang and Hidiroglou (2003) unfortunately do not offer an estimate of the variance ˆa , which is what we would need to implement Empirical Best of ua conditional on u estimation. One way to compute var[ua |uˆa ] is to appeal to the law of total variance: ˆa ]] + var[E [ua |u var[ua ] = E [var[ua |u ˆa ]] = E [var[ua |u ua ]. ˆa ]] + var[ˆ wah wah ua ] it will be convenient to define αah = To compute var[ˆ 2 σε,ah /( 2 h σε,ah ): ua ] = var[γa,w var[ˆ αah eah ] h 2 = γa,w var[ua + αah εah ] h 2 2 2 2 = γa,w σu + αah σε,ah . h Inserting this into eq. (18) gives us: 2 2 2 2 2 E [var[ua |u ˆa ]] = σu − γa,w σu + αah σε,ah . (18) h 2 2 It can be verified that under the assumption of constant variance σε,ah = σε , we have σ 2 2 2 2 2 2 2 that σu + h αah σε,ah simplifies to σu + σε h αah = γa,w . In this case the conditional u 2 variance is seen to take the form: var[ua |u ˆa ] = (1 − γa,w )σu , which coincides with the expression derived by You and Rao (2002) under the same assumptions. Interestingly, 2 var[ua |uˆa ] will be of the same form when we allow σε,ah to vary but assume the sampling weights to be constant. This representation of the conditonal variance does not apply 2 to the more general case however where both σε,ah and the sampling weights will vary across households. 10 2 2 5 Annex: Estimation of σu and σε A variation of Henderson’s method III estimator for the variance parameters that per- mits the use of sampling weights can be found in Huang and Hidiroglou (2003). Let us define (borrowing notation from Huang and Hidiroglou, 2003): SSE = ¯a,w )2 − wah (yah − y wah (yah − y ¯a,w )T × ¯a,w )(xah − x ah ah −1 T × ¯a,w )(xah − x wah (xah − x ¯a,w ) ¯a,w )(xah − x wah (yah − y ¯a,w ) ah ah   −1 t2 = tr  ¯a,w )T ¯a,w )(xah − x wah (xah − x 2 wah (xah − x ¯a,w )T  ¯a,w )(xah − x ah ah   −1 t3 = tr  wah xah xT ah 2 wah xah xT ah  ah ah   −1 t4 = tr  wah xah xT ah ( wah )2 x ¯T ¯a,w x a,w ,  ah a h where y ¯a,w = h wah xah /( h wah ) denote the weighted ¯a,w = h wah yah /( h wah ) and x mean of yah and xah , respectively. (Note however that these weighted mean variables are different from those defined in the main text for they use different weights.) 2 2 The estimators for the unconditional variances σu and σε can then be obtained as: 2 SSE ˆε,w σ = 2 wah ah wah − − t2 h a h wah 2 −1 2 ah wah yah −( ah wah yah xT ah ) ah wah xah xT ah ( ah wah yah xah ) − ( ah wah − t3 )ˆ 2 σε,w ˆu,w σ = . ah wah − t4 You and Rao (2002) find that the use of sampling weights makes little difference for the estimation of the variance parameters (they opt for leaving out the sampling weights for this purpose). 11 References Elbers, C., Fujii, T., Lanjouw, P., Ozler, B. and Yin, W. (2007). Poverty alleviation through geographic targeting: How much does disaggregation help? Journal of Development Economics, 83, 198–213. Elbers, C., Lanjouw, J. and Lanjouw, P. (2003). Micro-level estimation of poverty and inequality. Econometrica, 71, 355–364. Elbers, C. and van der Weide, R. (2014). Estimation of normal mixtures in a nested error model with an application to small area estimation of poverty and inequality. Policy Research World Bank Working Paper, no. 6962. Huang, R. and Hidiroglou, M. (2003). Design consistent estimators for a mixed linear model on survey data. Joint Statistical Meetings, Section on Survey Research Methods 1897–1904. Molina, I. and Rao, J. (2010). Small area estimation of poverty indicators. Canadian Journal of Statistics, 38, 369–385. Rao, J. (2003). Small area estimation. London: Wiley. You, Y. and Rao, J. (2002). A pseudo-empirical best linear unbiased prediction ap- proach to small area estimation using survey weights. Canadian Journal of Statistics, 30, 431–439. 12