Policy Research Working Paper 11107 Non-Normal Empirical Bayes Prediction of Local Welfare Chris Elbers Roy van der Weide Development Economics Development Research Group April 2025 Policy Research Working Paper 11107 Abstract Estimates of the area- and household idiosyncratic error of local welfare is found to increase precision relative to distributions from household income and consumption normal-Empirical Best estimation. Although the gains regression models across 142 household surveys from 16 in precision range between meaningful and marginal, it different countries, the type of models that underpin pov- is always positive. Given that non-normal-Empirical Best erty maps, points to significant deviations from normality. estimation is furthermore easy to implement, there is no Accounting for non-normality in Empirical Best estimation downside to using it. This paper is a product of the Development Research Group, Development Economics. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at rvanderweide@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Non-Normal Empirical Bayes Prediction of Local Welfare Chris Elbers and Roy van der Weide* * Chris Elbers (c.t.m.elbers@vu.nl) is at VU University Amsterdam and the Tinbergen Institute. Roy van der Weide (rvanderweide@worldbank.org) is at the World Bank. We gratefully acknowledge financial s upport from VU University Amsterdam and the World Bank’s Knowledge for Change Program (KCP II). A thank you also goes to Stephen Haslett, Peter Lanjouw, Isabel Molina, and David Newhouse for providing comments on earlier versions of this paper. 1 Introduction Elbers et al. (2003; henceforward ELL) have popularized the estimation of poverty and inequal- ity at the small area level, also referred to as “poverty mapping”.1 Poverty here measures the share of individuals whose income or consumption falls below a given poverty line. Collecting household income and consumption data is expensive. This is particularly true for developing countries where much of the income does not come from wage employment. Consequently, income and consumption data is typically only available in the form of household income and consumption expenditure surveys. The sample size of these surveys is sufficient for the estima- tion of national and possibly sub-regional welfare, but too small to estimate welfare at the level of small areas. Small areas here refer to administrative areas that are below the province level, think of district or possibly even the municipality level. Estimating poverty at the small area level is made possible by combining household survey data with unit record population census data.2 The census has data on variables such as demographics, education, employment and housing, which serve as predictors of household income, but not the household income variable itself. Crucially, data on these predictors are also collected by the income survey. The income survey is used to train a household income model that is then used to predict income for every household in the population census. These predicted incomes can subsequently be aggregated to obtain estimates of poverty at the small area level. This approach has been applied to obtain maps of poverty in over 60 countries worldwide. For reviews of the literature on small area estimation, see e.g., Haslett (2013, 2016), Rao and Molina (2015), Tzavidis et al. (2018), Das and Haslett (2019), and Corral et al. (2020). To accommodate spatial correlation in errors, a nested error structure is assumed where the total error is the sum of a household-specific error and a location error. The location error is typically modeled at the small area level. The total error observed for sampled households clearly carries information about the area random effect. Empirical Bayes (EB) estimation, also known as Empirical Best estimation, exploits this for the purpose of prediction, i.e. it uses the residuals observed in the survey to predict the location effects for sampled small areas, which in turn are used to predict local poverty rates. Working out the conditional distribution for the location error (conditional on observed residuals) requires making distributional assumptions. It is standard in the literature on EB 1 First attempts to estimate poverty at the small area level using aggregate data, dates back to the seminal work by Fay and Herriott (1979). 2 Population censuses provide complete coverage of a country and collects household and individual attributes that tend to be highly predictive of household welfare, think of household composition, education, employment, and dwelling characteristics, making it the ideal source of data for poverty mapping. An alternative source of data that similarly provides complete coverage of the country is remote-sensing data, think of night-time- lights, population density, land type and use, greenness, and local climate and pollution variables. The unit of observation at which these data can be connected to the household survey in this case would be the village level rather than the household level. For studies that have explored the use of remote-sensing data to obtain small area estimates of poverty, see e.g., Burke et al. (2021), Chi et al. (2022), Engstrom et al. (2022), Jean et al. (2016), Merfeld and Newhouse (2023), Newhouse et al. (2022), Newhouse (2023), Pokhriyal and Jacques (2017), Van der Weide et al. (2024), and the references therein. Studies vary in the resolution of imagery used, with higher resolution images allowing for a richer set of predictors that can be extracted from them (see e.g. Marx et al., 2019). 2 estimation to assume normally distributed errors, in which case the conditional distributions are also normal (see Molina and Rao, 2010). Assuming normality is however not without cost. Since poverty and inequality are non-linear functions of household consumption (and by extension non-linear functions of the errors), misspecification in the error distributions may introduce an error in the small area estimates of poverty. ELL makes minimal assumptions about the errors, motivated by their empirical application where the errors did not conform to a normal distribution. Unfortunately, the non-parametric estimator for the distribution functions adopted by ELL does not lend itself for Empirical Bayes (EB) estimation. Molina and Rao (2010; henceforward MR) instead prioritizes EB estimation under the assumption of normally distributed errors. Where ELL accept a loss in precision by not implementing EB estimation, MR accept a loss in precision that might stem from a misspecification of the error distribution functions. Throughout the paper, higher precision will refer to lower Root Mean Squared Error (RMSE). The approach put forward in this paper accommodates both EB estimation and non- normally distributed errors. We achieve this by fitting finite normal mixtures (NM) to the error distribution functions. Normal mixtures are extremely flexible; they are able to fit any well-behaved distribution function, and are ideally suited for accommodating EB estimation. With respect to the household idiosyncratic error term, we follow ELL by drawing from the empirical realizations of the idiosyncratic errors (with replacement).3 To verify what deviations from normality (or degrees of non-normality) are empirically relevant, we estimate the respec- tive error distributions using data from 142 household surveys across 16 countries from different regions and different income groups. Our findings are the following. An inspection of the estimated area- and idiosyncratic error distributions across the 142 household surveys reveals significant deviations from normality, especially for the household idiosyncratic component. Household errors have skewness ranging from -1.5 to 1.5 and kurtosis from 4 to 10 (the skewness of area errors ranges between -1 and 1, and kurtosis from 2 to 8, disregarding outliers). When the area error is small (2.5 percent of total error or less), the benefit of accounting for non-normality is seen to outweigh the loss stemming from foregoing Empirical Best prediction. As soon as the share of area error approaches 5 percent or higher, EB prediction dominates non-EB prediction across practically all empirically observed error distributions. Non-normal-EB prediction performs best in all cases. The improvement in performance relative to normal-EB is however comparatively marginal at higher levels of the location effect. Similarly, the improvement is marginal when compared to ELL at near-zero levels of the location effect. While the gains may range between marginal (near-zero reduction in RMSE) and meaningful (up to a 15-25 percent reduction in RMSE), they are always positive. Given that non-normal-EB prediction is easy to implement, there is no downside to using it. Highly disaggregated estimates of poverty and inequality have inspired a range of applica- 3 Another way of relaxing the normality assumption is explored in Bikauskaite et al. (2020), where a mixture is applied to the full nested error model, rather than fitting flexible mixture distributions to the respective error distributions separately. 3 tions. A natural use of poverty maps is for the targeting of social assistance programs, see e.g., Banerjee et al. (2025) and Smythe and Blumenstock (2022). Elbers et al. (2007) show in empirical applications to Madagascar, Cambodia, and Ecuador, that reductions in poverty achieved with uniform transfers can be realized with less than half the budget if the transfers could be targeted on the basis of district or municipality level estimates of poverty. Small area estimates are also increasingly used as regressors in empirical analysis, see e.g. Araujo et al. (2008), Baird et al. (2013), Bazzi (2017), Crost et al. (2014), Demombynes and Ozler (2005), Elbers et al. (2005), Maloney and Caicedo (2015), and Mendez and Van Patten (2022).4 To highlight some examples, Araujo et al. (2008), using data from Ecuador, find that villages with higher levels of inequality are less likely to invest in public goods that would benefit the poor (which they attribute to elite capture). Demombynes and Ozler (2005) find that inequality at the small area level has a positive relationship with local crime rates in an empirical application to South Africa. Finally, the multiple imputation methods underpinning small area estimation are also finding applications in the monitoring of poverty over time at a more aggregate level in contexts where household consumption surveys have been discontinued or interrupted, prompt- ing researchers to impute household consumption data into alternative surveys, see e.g., Tarozzi (2007) and Sinha Roy and Van der Weide (2025). The remainder of this paper is organized as follows. The problem statement is formulated in Section 2. In Section 3 we briefly summarize the two competing approach adopted in the existing literature: Non-normal-Non-EB and Normal-EB. The newly proposed approach, Non- normal-EB, is presented in Section 4. Sections 5 presents a simulation study based on empirical data from 142 household surveys across 16 countries, respectively. Finally, Section 6 concludes. 2 Problem statement: Small area estimation of welfare Let yah measure log per capita income (or expenditure) for household h residing in area a, and let sah denote the number of household members for that same household.5 Suppose that at the household level the data generating process (DGP) satisfies: yah = xT ah β + ua + εah , (1) where xah is a vector with independent variables, and where ua and εah are zero expectation error terms that are independent of each other.6 The subscripts indicate target area (or domain) 4 Data combination models can also be applied beyond small area estimation applications, see e.g., Graham et al. (2016). 5 Ideally the household income or consumption expenditure data is adjusted for spatial and temporal price differences, i.e., is measured in real terms. As this is not the focus of this study, we will be abstracting away from that, but refer the interested reader to e.g., e.g. Gibson et al. (2017) and van Veelen and van der Weide (2008). 6 The assumed nested error structure is a common and practical way of accounting for spatial correlation between errors. The nested error structure could be extended to a two-fold nested error model, see e.g., Marhuenda et al. (2017). Alternatively one could assume a spatial auto-correlation structure, see e.g., Bell and Bockstael (2000), Pratesi and Salvati (2008), and Kelejian and Prucha (2010). In that case, the correlation between errors could be modeled as a smooth decaying function of distance rather than the step function implied by the nested error structure. The regressors included in xah typically includes interactions and transformations of selected 4 a and household h.7 Let us assume that errors are homoskedastic, so that for each household h 2 + σ 2 . Throughout the paper it is assumed that consistent and area a we have: var[yah |xah ] = σu ε ˆu estimators for the variance parameters are available, which we shall denote by σ 2 and σ ˆε2 .8 We will not make any assumptions about the shape of the error distributions at this point. Let A be the total number of areas covered by the income survey and let na be the number A of households that have been sampled in area a, so that n = a=1 na equals the total sample size of the survey. We shall denote the total household error by: eah = yah − xT ah β , and its area a average by e ¯T ¯a − x ¯a = y ¯a = a β where e h eah /na . We will also use the notation ea = (ea,1 , . . . , ea,na ) which is a vector of length na that contains the residuals for all households from area a. With a slight abuse of terminology we ¯a as data (as if we know the parameter vector β ). Let will at times refer to the errors eah and e y(a) and e(a) denote vectors of length Na with elements yah and eah for all households from the population. Similarly, x(a) will denote a matrix with rows given by xT ah for all Na households. ya , ea and xa will denote the survey sample analogues. Finally, let the probability distribution functions (probability density functions) for ua and εah be denoted by Fu (p(u)) and Gε (p(ε)). The objective is to estimate: E [W (y(a) )|x(a) , ya ] = W (x(a) β + e)p(e|ea )de (2) ≃ W (x(a) β + u + ε)p(ε)p(u|ea )dεdu, (3) where p(u|ea ) is the conditional probability density function for the location area error u (con- ditional on observed residuals) and where the welfare function W will generally be non-linear. Popular choices of W are the share of individuals whose income falls below a pre-specified poverty line (also known as the head-count poverty rate) and the Gini index of income in- equality. Head-count poverty is a special case of the Foster–Greer–Thorbecke family of poverty indices, see Foster et al. (1984). In effect EB estimates are obtained by integrating out area errors ua using probability density p(ua |ea ), i.e. the probability density of ua conditional on ea observed from the survey. Non-EB estimates are obtained by using the unconditional density p(ua ). The challenge is to work out p(ua |ea ) along with the marginals p(ua ) and p(εah ) without imposing restrictions about their form so that the resulting conditional density p(ua |ea ) can also take on any form. Currently, the covariates to account for area heterogeneities (see Tarozzi and Deaton, 2009; Tarozzi, 2011). See also Das and Chambers (2017) and Lahiri and Salvati (2023). Model selection methods such as Lasso as well as machine learning methods are also increasingly used to identify the relevant model variables (including interactions and transformations). See e.g., Corral et al. (2024) and the references therein for an overview on the use of machine learning in this context. 7 These areas may refer to geographic areas such as districts or municipalities, but also to non-geographic domains such as ethnic groups or age groups, say. 8 A commonly used estimator for the variance parameters from a nested error model is Henderson’s method III estimator (see Henderson, 1953; and Searle et al., 1992), which may be viewed as a method of moments estimator which does not require any assumptions about the shape of the error distributions. Alternative estimators are restricted maximum likelihood, minimum norm quadratic unbiased estimation (MINQUE; see e.g. Westfall, 1987; Searle et al., 1992), and so-called spectral decomposition estimation (see e.g. Wang and Yin, 2002; and Wu et al., 2009). 5 literature on EB estimation avoids this challenge by assuming normally distributed marginals, in which case the conditional distribution will be normal too. Estimation of β is obviously of primary importance when estimating E [W ], but in this paper we will concentrate on the error distributions and assume that β is known. Even if β is estimated perfectly, getting the error distribution wrong still has the potential to introduce a bias. For non-linear W , the expected value E [W ] will be a function of the higher moments of the distributions of both u and ε. Getting these moments wrong (i.e., getting the distribu- tions wrong) is one source of bias. Errors in the conditional density p(u|ea ), stemming from misspecification of the error distributions, is another source of bias. The latter bias persists even for linear W . Recently, Skrondal and Rabe-Hesketh (2009) and McCulloch and Neuhaus (2011) have explored the magnitude of the bias that is introduced by getting the first moment of p(u|ea ) wrong and found it to be modest in the case of linear W . Accordingly, our proposed estimator is anticipated to provide more value added in the case of nonlinear W . Standard errors can be obtained by means of simulation which is ideally suited for estimating quantities that are non-linear functions of the random variables at hand, as is the case with measures of poverty and inequality. Let R denote the number of simulations. The estimator then takes the form: R 1 (r) ˆ= µ ˜(a) , s(a) , W y (4) R r=1 (r) (r) ˜(r) where y ˜(a) denotes the r-th simulated (or predicted) income vector with elements y˜ah = xT ah β + (r) (r) ˜(r) and the errors u(r) (r) ˜a + ε u ˜ah . With each simulation, both the model parameters β ˜a and ε˜ are ah drawn from their estimated distributions.9 In the end this gives R simulated poverty rates. The point estimates and their corresponding standard errors are obtained by computing respectively the average and the standard deviation over these simulated values. 3 Two alternative approaches In this section we will briefly discuss the two most-commonly adopted approaches to date: (1) Empirical Best (EB) prediction advocated by Molina and Rao (2010; MR) which assumes that errors are normally distributed, and (2) the approach put forward by Elbers et al. (2003; ELL) that allows for non-normally distributed errors, but does not adopt EB prediction. Moving forward, we will also refer to these approaches as non-normal-non-EB and normal-EB, respec- tively. 9 Our preferred method is to draw β ˜(r) by re-estimating the model parameters using the r-th bootstrap version ˜ of the survey sample. Alternatively, β (r) may be drawn from its estimated asymptotic distribution. The difference between these two alternatives is expected to be modest, unless the survey sample is particularly small so that finite sample effects may play a role. 6 3.1 ELL: Non-normal-non-EB estimation ELL makes minimal assumptions about the errors ua and εah . They proceed by first obtaining estimates of the area errors ua and εah , which then allows them to sample from the empirical ˆa and ε errors u ˆah . ua is estimated as the simple area average of the total residuals (appropriately re-scaled so that the sample variance of u 2 that corrects for the ˆa equals the estimate of σu ˆa contribution of the area average of εah ). The estimate for εah is obtained by subtracting u 2 ).10 Note that u from the total residual (re-scaled so that its sample variance matches σε ˆa equals ¯a . The latter term will typically be small in magnitude. It is assumed that the sum of ua and ε its effect on the estimator for the distribution of ua will be negligible. The advantage of this approach is that it does not restrict the shape of the error distribu- tions. A shortcoming is that it does not take full advantage of all the available data, i.e., it does not lend itself for Empirical Bayes (EB) estimation. Specifically, it does not offer a means of evaluating the conditional distribution of ua (conditional on the mean residual for observed for area a in the event area a is included in the survey sample), i.e., p(u|ea ). Working out the conditional distribution is not a trivial exercise without making further distributional assump- tions. Consequently ELL decided to forego EB estimation altogether. In doing so, they have accepted a certain loss in efficiency by not fully utilizing all available information. 3.2 Normal-EB estimation MR put forward an approach that prioritizes EB estimation. The residuals ea (for area a covered by the survey) clearly carry information about the area random effect ua . In the extreme where ¯a will perfectly reveal ua . They closely follow ELL with the notable differ- na tends to infinity e ence that the area errors are drawn from the conditional distribution (for areas covered by the survey sample) under the assumption that the errors are normally distributed. As we have seen, ˜a is normal too and its parameter can be under this assumption the conditional distribution of u easily determined.11 Where ELL accept a loss in precision by not implementing EB estimation, Molina and Rao (2010) accept a loss in precision that might stem from a misspecification of the error distribution functions. ELL was motivated by estimating poverty and inequality in developing countries where the share of small areas (or domains) that are covered by the income surveys typically ranges between of 5 to 25 percent of all domains in the population. In this case the benefits of EB estimation may be more modest. However, in more developed countries, or countries where travel costs that are incurred when covering all small areas are manageable, income surveys often cover a much larger number of the domains. In fact, there are numerous examples where surveys cover the majority (up to 100 percent) of domains in the country. In those instances, there may be clear benefits to adopting EB estimation. 10 2 Note that ELL allows for heteroskedasticity, so that σε can be household-specific. 11 This is arguably why errors are generally assumed normal in the literature on EB estimation. 7 4 Non-Normal-EB estimation The approach presented in this paper builds on both ELL and MR. Like ELL we make no re- strictive assumptions about the error distributions. Unlike ELL however, we also accommodate EB estimation. We achieve this by fitting finite normal mixtures (NM) to the error distribu- tion functions. Normal mixtures are extremely flexible; they are able to fit any well-behaved distribution function, and are ideally suited for accommodating EB estimation. If the marginal distributions of ua and εah can be described by normal mixtures, then the conditional distribu- tion too can be described by a normal mixture with known parameters (that are functions of these parameters and of the data on which is being conditioned). There are several other studies that have explored different ways of relaxing the normality assumption in mixed linear models. Verbeke and Lesaffre (1996) is an early example that also considers normal mixtures as a “non-parametric” representation of the error distribution function. However, they impose a number of important restrictions. First, only the area random effects ua are allowed to be non-normal; a normal-mixture is fitted to the distribution of ua under the assumption that εah is normally distributed. Second, it is assumed that the “component distributions” that make up the normal mixture share a common variance, which noticeably simplifies estimation but at the same time significantly limits the flexibility of the normal mixture to fit any given distribution function. This approach has also been followed by Cordy and Thomas (1997) who work with the same setup and adopt the same set of restrictions. There is another strand of the literature that permits both error terms to be non-normally distributed by imposing an alternative parametric family for the distribution functions. See for example Zhou and He (2008) who fit skewed t-distributions to the nested errors of the linear mixed model.12 Recently, there have also been efforts to explore the impact of misspecifications in the error distributions for Empirical Bayes predictions derived from linear mixed models, see for example Skrondal and Rabe-Hesketh (2009) and McCulloch and Neuhaus (2011). They conclude that the bias is reasonably small. It should be noted however that those studies focus on prediction of the dependent variable itself, in which case the Empirical Bayes estimates of the area random effects ua denote the only source of bias. Misspecifications in the error distributions become considerably more important when predicting nonlinear functions of the dependent variable, such as measures of poverty and inequality, as we will show in this paper. 4.1 Specification Let us assume that Fu and Gε can be represented by mixture distributions: i=mu Fu = πi Fi (5) i=1 j =mε Gε = λ j Gj , (6) j =1 12 Diallo and Rao (2018) relax the normality assumption by considering skew-normal errors. 8 where the Fi ’s and Gj ’s denote a basis of distribution functions which we will also refer to as components or component distribution functions. The πi ’s and λj ’s denote unknown non- negative probabilities that satisfy i πi = 1 and j λj = 1, which we will also refer to as mixing probabilities.13 mu and mε denote the number of components used to represent Fu and Gε , respectively. We will denote the probability density functions associated with Fi and Gj by respectively fi and gj . Mixture distributions are remarkably well equipped to fit any well-behaved distribution function. For example, kernel density estimators are closely related to mixture distributions. We will be working with normal component distributions, so that the mixture distributions are normal mixtures. Assumption 1 The components Fi are normal distribution functions with mean µi and vari- 2 . Similarly, components G are normal distribution functions with mean ν and variance ance σi j j 2. ωj Note that the modeler is at liberty to work with a different basis of component distributions. This choice does not have real implications for the ability of the mixture to fit a given distribution function. If p(ua ) and p(εah ) are normal-mixtures, then p(ua |e ¯a ) is a normal mixture too. This is a powerful result as the integral in equation (3) will generally have to be computed by simulation, while sampling from normal mixtures is straightforward. Lemma 2 shows how the parameters that define p(ua |e ¯a ) can be obtained as a function of the parameters of the normal-mixtures p(ua ) and p(εah ). Implementing EB estimation is thus as easy as sampling the area errors from the normal-mixture p(ua |e ¯a is observed in the survey sample. ¯a ) whenever e ¯a , which we denote by p(ua |e Lemma 2 The probability density function of ua conditional on e ¯a ), is a normal-mixture with known parameters: i=mu j =mε k=mε p(ua |e ¯a ) = wijk φ ua ; mijk ; s2 ijk , (7) i=1 j =1 k=1 where: 2 2 + ω2 ωj σi k mijk = 2 + (ω 2 + ω 2 )/n ea − (νj + νk )/na ) + (¯ 2 + ω2 + n σ2 µi σi j k a ωj k a i 2 + ω 2 )σ 2 (ωj k i s2 ijk = 2 + ω2 + n σ2 , ωj k a i ˜ijk / and where wijk = w ijk ˜ijk with: w 2 2 2 ˜ijk = πi λj λk φ(¯ w ea ; µi + (νj + νk )/na ; σi + ( ωj + ωk )/na ), (8) 13 An alternative specification, where a mixture is applied to the full nested error model, is explored in Bikauskaite et al. (2020). In effect, this assumes that the normal mixture distributions for the area and house- hold errors share the same mixing probabilities, rather than specifying flexible normal mixtures for the area- and idiosyncratic error distributions separately. 9 2 ) and (λ , ν , ω 2 ) where φ denotes the normal probability density function, and where (πi , µi , σi j j j denote the parameters associated with the normal-mixture distributions Fu and Gε , respectively. ¯a to a If we apply the Central Limit Theorem to approximate the marginal distribution of ε 2 /n , the expression for p(u |e normal distribution with mean zero and variance equal to σε a a ¯a ) simplifies considerably: i=mu 2 2 −1 ¯a ) ≈ p(ua |e ¯a + (1 − γai )µi ; 1/σi αi φ ua ; γai e + na /σε , (9) i=1 2 /(σ 2 + σ 2 /n ), and where α = α with γai = σi ˜i/ i ε a i ˜i iα with: 2 2 ˜ i = πi φ(¯ α ea ; µi ; σi + σε /na ). (10) ¯a , given the density function p(ua |e The expected value of ua conditional on e ¯a ) from Lemma 2, is seen to solve: ¯a ] ≈ E [ua |e αi (¯ ¯a + (1 − γai )µi ) , ea ) (γai e (11) i where γai = 2 /(σ 2 σi + 2 /n ), σε ea ) denotes the mixing probabilities of p(ua |e and where αi (¯ ¯a ). i a Note that the standard assumption of normal errors is nested as a special case, where there 2 = σ 2 . The first and second moment of the normal is just one component with µi = 0 and σi u conditional density p(ua |e ¯a ) in this case are seen to solve: E [ua |e ¯a ¯a ] = γa e (12) 2 ¯a ] = (1 − γa )σu var[ua |e , (13) 2 /(σ 2 + σ 2 /n ) (see e.g. Molina and Rao, 2010).14 For non-normal errors, we have where γa = σu u ε a that E [ua |e ¯a , and var[ua |e ¯a ] is generally a non-linear function of e ¯a ] will generally be a function ¯a . of e 4.2 Estimation 2 ) and (λ , ν , ω 2 ), which determine the The objective is to estimate the parameters (πi , µi , σi j j j i=mu 2 ) and G (ε) = i=mε 2 ), re- normal mixtures (NM) Fu (u) = i=1 πi Fi (u; µi , σi ε j =1 λj Gj (ε; νj , ωj spectively, where mu and mε denote the number of components. Estimation of NM to observed data is a routine task. A commonly used method of estimation is the Expectation-Maximization (EM) algorithm, which is an iterative method to find local maximum likelihood estimates of the parameters of interest, see e.g. Dempster et al. (1977) and McLachlan and Peel, (2000). Suppose that some data y is drawn from a NM with m components and parameter vector θ. The EM algorithm introduces a latent random variable zij that equals 1 if observation yj has been drawn from component i, and 0 otherwise. Let the log-likelihood function that treats both 14 2 2 2 2 2 Note that the unconditional variance solves: var[ua ] = var[γa e¯a ] + (1 − γa )σu = γa (σ u + σε /na ) + (1 − γa )σu , 2 2 2 2 2 2 which equals var[ua ] = γa σu + (1 − γa )σu = σu , since σu + σε /na = σu /γa . 10 y and z as data be denoted by L(θ; y, z ), and its derivative with respect to θ by Lθ (θ; y, z ). The corresponding maximum-likelihood (ML) estimator solves the following moment condition: E [Lθ (θ; y, z )] = E [E [Lθ (θ; y, z )|y ]] = 0. (14) The EM algorithm essentially solves the same moment condition but iteratively: ˆ(k) ]] = 0, E [E [Lθ (θ; y, z )|y, θ (15) ˆ(k) denotes the iteration-k estimate for θ. Solving equation (15) yields θ where θ ˆ(k+1) . The ˆk ). The advantage of the EM procedure results in a non-decreasing sequence of likelihoods L(y ; θ algorithm is that solving equation (15) is often considerably easier than solving equation (14). Note that the unobserved data z is integrated out conditional on the observed data y and the current estimate of θ. 5 Simulation study 5.1 Setup To evaluate the relative performance of the different small area estimation methods in realistic settings, we implement a simulation study where the population census and household surveys are based on data from actual household income and expenditure surveys across 16 different countries and multiple time periods. Consider the DGP of equation (1) repeated below: yah = xah β + ua + εah . (16) For explanation of symbols and assumptions see the discussion following equation (1). The DGP specifies three statistically independent determinants of a household’s (log) per capita expenditure or income: a structural or ‘fixed’ part, xah β , an area random effect, ua , and an household-specific random effect εah . 5.1.1 Fixed part: xβ For the fixed part we use the expected household log per capita income from Brazil’s 2010 census, as predicted from linear regression. We argue that this is more appropriate to represent the fixed part, with empirically observed cross-area variation, than making arbitrary assumptions on the distribution of xah β . More specifically we use the data for Minas Gerais from the population census. This census covers more than 10% of the population. Uniquely, the census includes a household income variable. Log per capita household income is regressed on a set of explanatory variables that are commonly included in empirical applications of small area poverty estimation: household size, education, age, housing characteristics, assets, access to services, etc. This allows us to compute xah βˆ for all households in the Minas Gerais census sample. At 60% the R2 of the regression is comparatively high but the purpose of the regression is to compile a ‘fixed’ 11 part of the DGP in equation (1), to be used in otherwise synthetic data. The Minas Gerais census sample is further limited to municipalities with more than 1000 households in the sample resulting in a dataset of 345733 households from 128 municipalities. The municipality means of ˆ range from 5.7 to 6.67; the standard deviations range from 0.54 to 0.87. xah β 5.1.2 Random part: ua and εah The random parts, the area- and household error ua and εah , are based on empirical realizations of the errors derived from an inclusive set of income and expenditure regressions that spans 16 countries across multiple years, stratified by urban/rural, for a total of 142 surveys. The full list of household surveys used is documented in Appendix section 3. Taking stock of the range of error distributions observed in real data across a wide spectrum of different countries and time periods offers a first empirical contribution. Henderson’s method III is used to decompose the total residuals in a part common to all households from a given area (the area error term ua ) and a household-specific component (the idiosyncratic error term εah ).15 We proceed by estimating non-parametric density functions for the area- and household errors, and subsequently draw ua from the area-level density and εah from the household-level density. Given realizations of both the fixed and random components, household log income yah is computed as: ˆ + λ1 ua + λ2 εah , yah = xah β (17) where scaling parameters λ1,2 allow us to control the relative size of area and household effects in the simulations. The errors ua and εah in this case refer to normalized independent random variables with mean zero and unit variance, such that the location effect (share of total error variance due to the area error) equals: s = λ2 2 2 1 /(λ1 + λ2 ). The following shares of the location effect are considered: 0, 0.025, 0.05, 0.075, and 0.1, which represents the typical range of values observed in poverty mapping applications. A share of 0.1 would be high with the majority of empirically observed location shares ranging between 0.02 and 0.08. Keeping the size of the location effect fixed across different country-years helps isolate the role of deviations from normality as all variation observed across country-years is purely driven by variation in the degree of non-normality. The overall scale of λ1 ua + λ2 εah is chosen such that a regression ˆ is expected to result in an R2 of 50%, a value typically found with income of yah on xah β regressions. After a complete synthetic census of (log) incomes has been generated for households in the Minas Gerais census using equation (17), household per capita incomes are then used to classify households as poor or non-poor. From this, we compute area-level poverty head-count rates (in terms of households). Two choices of poverty lines are considered. One that yields a headcount rate of 12.5 percent and one that yields a rate of 25 percent. The results indicate that this choice is of lesser significance. Particular simulations are defined by poverty line, variance share 15 erez, Pe˜ See P´ na and Molina (2011). 12 of the location effect, country, year and urban status. 5.1.3 Drawing of the household survey Starting from the census dataset, we draw a balanced sample of households clustered by area (i.e., Minas Gerais municipality) with a cluster size of 20 households. A key parameter in survey sampling designs is the share of areas covered by the survey. Increasing this coverage will naturally benefit applications of Empirical Bayes estimation, but will also increase the costs of data collection (by increasing travel costs which constitutes a key component of total data collection costs). In practice this share can be as low as 10 and as high as 100 percent, with higher income countries often opting for higher coverage rates. In the original empirical application to Ecuador by ELL, the coverage rate was 17 percent (and the location effect as a share of total error is estimated at 0.02). Instead of varying the coverage rate, which would introduce another parameter, we consider the two limiting scenarios: (a) one where all areas are included in the sample (100 percent coverage), and (b) the other extreme where the number of areas covered by the survey is negligible, which is implemented by not employing EB estimation in any of the areas. By considering both extremes, we economize on the number of parameters to be varied in the simulation study. Given that the focus of this study is on relaxing the normality assumption in applications of Empirical Bayes estimation, the sampling design is assumed to be non-informative. Allowing for informative sampling designs, building on the work by e.g., Pfeffermann et al. (1998), Pfeffermann et al. (2006), and Pfeffermann and Sverchkov (2007), is a natural extension that could be considered in future work. 5.1.4 Small area estimation methods All methods work with the same estimates of the fixed part xβ .16 They only differ in how the errors ua and εah are imputed. These are then combined with the predicted fixed part of household log per capita income to yield household-level per capita income and area-level poverty predictions. 1. ELL: This approach is in effect non-normal-non-EB estimation. It was developed for a context where surveys provide incomplete coverage of small areas in which case the scope for applying Empirical Bayes is reduced. Area effects and household effects are simply drawn (with replacement) from the empirical estimates of ua and εah obtained using Henderson-method-3 to account for possible non-normality in errors. 2. Normal-EB: This method assumes that ua and εah are normally distributed. We as- sume zero mean distributions for the error components with variances as estimated by Henderson-method-3. Area effects are drawn conditionally on the residuals for survey households. 16 The coefficient estimates β used to impute the fixed part are subject to sampling error due to the particular survey that has been drawn. 13 3. Non-normal-EB: Like ELL this method draws each household effects directly (with re- placement) from the estimated idiosyncratic errors. Area effects ua are drawn from a conditional normal mixture according to Lemma 2. Conditioning is on the estimated area component for each area. Another candidate method is to resample total errors per area. Since there are as many as 20 households per area in the survey, it would be interesting to see if simply resampling total regression residuals per area (i.e., not attempting to split residuals in area and household components) performs well. It has the additional advantage of being robust to heteroskedasticity of error components across areas. We implemented this approach but do not report results for this method. While it outperforms ELL it is inferior to non-normal-EB, which is the main focus of this study. 5.1.5 Estimation of normal mixture distributions Obtaining estimates of the conditional normal mixtures for the area errors used in the non- normal-EB method involves two steps. First, fit a normal mixture to the estimated area errors ua (i.e., estimate the unconditional area error distribution). Second, apply Lemma 2, which uses the estimated unconditional normal-mixture as an input, to obtain estimates of the con- ditional normal mixtures for the area errors. For the estimation of the unconditional normal mixture distributions to the area errors, we adopt the EM algorithm. Identifying the number of components can require experimentation. To avoid this we have automated the search for mixtures, which involves fitting normal mixtures with 1, 2, and 3 components, as follows:17 a. As initial values use for the means 0 in the case of 1 component, -1 and 1 in the case of 2 components, and -1, 0, and 1 in the case of 3 components b. Run the EM algorithm until convergence or stop after 1000 iterations c. Integrate absolute difference between the estimated density function and the kernel density function of the normal-mixture approximation d. Choose the normal mixture with the smallest difference This automated procedure facilitates application across the large number of different surveys. Although this yields satisfactory approximations it can be argued that better approximations can be achieved with more sophisticated strategies (or tailoring specification to dataset at hand). In that sense whatever advantage our simulation method has over the other approaches can be considered a lower bound. Alternatively, one could consider restricting the variance of the components distributions to be the same across components and offset this restriction by increasing the number of components (akin to fitting a kernel density). The results obtained with this choice of normal mixture estimator (not reported here) are practically identical. 17 Note that 1 component corresponds to a normal distribution. 14 5.1.6 Performance metric All methods impute log per capita incomes at the household level, which translates to area-level head-count rates and, by comparison with the ‘actual’ head-count rates from the synthetic cen- sus, an overall Mean Squared Error (MSE). This study focuses on overall aggregate performance of the methods. A single survey is drawn from the synthetic census for each imputation, which is repeated 100 times. For each area, the 100 imputations yield an average simulated head-count poverty rate, ˆa , which is compared to the ‘true’ poverty rate µa from the synthetic census. Our main µ ˆa . Let na be the number of households in area a and performance indicator is the MSE of µ N= a na , then: 1 M SE = µ a − µa ) 2 . na (ˆ (18) N a A version of the MSE where areas are not population weighted is also implemented. This yields nearly identical results. 5.2 Empirically observed error distributions Let us first take stock of the empirical distributions for the area and household errors that are observed across the 142 different surveys (from 16 different counties spanning multiple years). The degree of non-normality is evaluated through estimates of the skewness and kurtosis. Figure 1 plots the kernel densities for the estimated skewness and kurtosis of ua and εah , respectively. The estimated moments point to significant deviations from normality (when skewness is zero and kurtosis 3), especially for the household idiosyncratic component. The skewness of area errors ranges between -1 and 1, and kurtosis from 2 to 8, disregarding outliers. Household errors have skewness ranging from -1.5 to 1.5 and kurtosis from 4 to 10. Figure 1: Kernel density for skewness (left) and kurtosis (right) The empirical relationship between the moments for the area- and the household-error com- ponents are shown in Figure 2. It can be seen that the two moments exhibit an U-shaped relationship, with minimum kurtosis if skewness is around zero. We can similarly verify whether the moments of the area error are correlated with the moments of the household error, i.e., whether a skewed (or heavy tailed) household error tends 15 Figure 2: Kurtosis versus skewness for area errors (left) and idiosyncratic errors (right) to coincide with a skewed (or heavy tailed) area error. Figure 3 confirms that skewness of household and area errors show a positive association. In summary, for the majority of surveys, the empirically observed error distributions show significant deviations from normality.18 Figure 3: Household error versus area error moments: skewness (left) and kurtosis (right) 5.3 Small area estimation results Normal-EB is expected to do well when the location effect is comparatively large and the degree of non-normality is small, while ELL is expected to do well when the location effect is small and the degree of non-normality is large. Non-normal-EB is expected to do well in either scenario. Our simulations confirm these expectations. By considering different values of the location effect and varying degrees of non-normality, we can assess at what point ELL and normal-EB are at par. Figure 4 plots the Root Mean Squared Error (RMSE) against the degree of non-normality of errors as measured by skewness of the idiosyncratic error in the case of a 12.5 headcount poverty rate.19 The graphs are smoothed summaries of simulation results obtained by fitting kernel 18 A Shapiro-Wilk normality test is rejected in all cases for the household error components and in 72% of the cases for the area components (p=0.05). 19 Given that the moments of the individual error and moments of the area error are highly correlated, our figures effectively show how RMSE varies with non-normality in errors (without isolating non-normality in individual errors from non-normality in area errors). Having said that, our results show what gains in RMSE one can reasonably expect in practical settings using empirically observed error distributions across a wide range of countries, where non-normality in individual errors and area errors are seen to co-vary. 16 regressions. Note that normal and ELL represent the counter-parts of normal-EB and non- normal-EB for areas that are not covered by the survey. Estimates of the RMSE corresponding to a headcount poverty rate of 25 percent are presented in Appendix section 4. Our results indicate that ELL outperforms normal-EB when the location effect is near-zero and errors are non-normally distributed, while normal-EB is seen to significantly outperforms ELL when the location effect is comparatively large (non-normality of errors notwithstanding). The two approaches are found to be at par for location effects in the range of 2.5 − 5 percent (with upper range corresponding to cases with higher degrees of non-normality observed in empirical data). ELL’s advantage dissipates quickly when location effects are more important. Figure 4 shows this for a location variance share of as little as 2.5%. Also, note that (as one would expect) RMSE is very sensitive to location effects, increasing more than threefold across the different choices of the location effect shares. Finally, the results obtained for the poverty that yields a headcount rate of 25 percent, which are qualitatively identical, are included in Appendix section 4. Our estimates also confirm that non-normal-EB records the lowest RMSE in all cases. The difference (in RMSE) when compared to normal-EB is however comparatively marginal at higher levels of the location effect. Similarly, the difference is marginal when compared to ELL at near-zero levels of the location effect. By the same token, there is no downside to adopting non-normal-EB. The gains of non-normal-EB over normal-EB and ELL are between marginal and notable, but never negative. Figure 4: RMSE versus skewness for different magnitudes of the location effect (as a percentage of total error) 17 One notable feature shown in Figure 4 is that negative skewness tends to result in sizable differences in RMSE, more than a positive skewness does. With kurtosis we see the difference increase if kurtosis is increasing. Such patterns are driven by the step nature of the headcount as a function of income on the one hand, and on the other hand by the distribution of xβ , the poverty line and the actual skewness (and other moments) of the error distributions used in the simulations. There is therefore no reason to expect that for e.g. skewness the pattern should be symmetric. To explore the different effects of positive and negative skewness we let skewness range from -0.453 to 0.453 by mixing a density estimator pε (ε) of the individual residuals from Bangladesh (rural Bangladesh survey from 2000) with a normal distribution as follows: pmix (ε; λ) = |λ|pε (sign(λ)ε) + (1 − |λ|)pnorm (ε), (19) and letting λ range between -1 (corresponding to skewness -0.453) and +1 (skewness +0.453). The average difference (weighted by xβ frequencies) between skewed and normal poverty prob- abilities is graphed in Figure 5 (left panel), showing indeed a larger difference for negative skewness. The overall squared difference is pictured in Figure 5 (right panel), with the same conclusion. Figure 5: Effect of skewness on expected headcount poverty 6 Concluding remarks Normal-EB prediction is expected to do well relative to ELL (non-normal-non-EB prediction) when: (i) the area error makes up a significant share of total error, (ii) the errors exhibit modest deviations from normality, and (iii) the household survey covers a large share of the total number of target areas. Using a large-scale simulation study based on actual household survey data from 142 different survey rounds across 16 different countries, thereby working with empirically relevant error distributions, we can confirm this prediction. In cases where the location effect makes up 5 percent or more of total error, normal-EB will outperform ELL (i.e., achieve lower RMSE) for all empirically observed degrees of non-normality. In those cases where the location effect is comparatively small and the deviations from normality are more 18 meaningful however, the ordering may be reversed. The non-normal-EB approach put forward in this paper is expected to do well in either of the different scenarios described above. This too is confirmed in the simulation study. Depending on the scenario, the increase in precision of non-normal-EB relative to both normal-EB and ELL can range from meaningful (up to a 15-25 percent reduction in RMSE) to negligible (near- zero reduction in RMSE), but it is always positive. Given that the proposed non-normal-EB approach is furthermore easy to implement, there is no downside to using it. Note that the approach is equally relevant for survey-to-survey prediction, where similar improvements in precision may be expected. An alternative means for handling non-normality in errors is to consider transformations of the dependent variable such that the transformed variable is more normally distributed, see for example Rojas Perilla et al. (2020) and the references therein. An investigation into combining this approach with the one presented in this paper is something that could be explored in future research. Another potential avenue for future research is to extend non-normal-EB estimation to non-linear models such as logistic regression models (see e.g., Farrell, 1997) and multiple regression contexts (see e.g., Isaki, 1990). References Araujo, M., Ferreira, F., Lanjouw, P. and Ozler, B. (2008). Local inequality and project choice: Theory and evidence from ecuador. Journal of Public Economics, 92, 1022–1046. Baird, S., McIntosh, C. and Ozler, B. (2013). The regressive demands of demand-driven development. Journal of Public Economics, 106, 27–41. Banerjee, A., Hanna, R., Olken, B. and Lisker, D. (2025). Social ptotection in the developing world. Journal of Economic Literature. Bazzi, S. (2017). Wealth heterogeneity and the income elasticity of migration. American Economic Journal: Applied Economics, 9, 219–255. Bell, K. and Bockstael, N. (2000). Applying the generalized-moments estimation approach to spatial problems involving microlevel data. Review of Economics and Statistics, 82, 72–82. Bikauskaite, A., Molina, I. and Morales, D. (2020). Multivariate mixture model for small area estimation of poverty indicators. Journal of the Royal Statistical Society Series A, 185, S724–S755. Burke, M., Driscoll, A., Lobell, D. and Ermon, S. (2021). Using satellite imagery to understand and promote sustainable development. Science, 371. Chi, G., Fang, H., Chatterjee, S. and Blumenstock, J. (2022). Microestimates of wealth for all low- and middle-income countries. Proceedings of the National Acadamy of Sciences, 119, number 3, 1–11. 19 Cordy, C. and Thomas, D. (1997). Deconvolution of a distribution function. Journal of the American Statistical Association, 92, 1459–1465. Corral, P., Henderson, H. and Segovia, S. (2024). Poverty mapping in the age of machine learning. Journal of Development Economics. Corral, P., Molina, I. and Nguyen, M. (2020). Pull your small area estimates up by the bootstraps. World Bank Policy Research Working Paper, 9256. Crost, B., Felter, J. and Johnston, P. (2014). Aid under fire: Development projects and civil conflict. American Economic Review, 104, 1833–1856. Das, S. and Chambers, R. (2017). Robust mean-squared error estimation for poverty estimates based on the method of elbers, lanjouw and lanjouw. Journal of the Royal Statistical Society Series A, 180, 1137–1161. Das, S. and Haslett, S. (2019). A comparison of methods for poverty estimation in developing countries. International Statistical Review, 87, 368–392. Demombynes, G. and Ozler, B. (2005). Crime and local inequality in south africa. Journal of Development Economics, 76, 265–292. Dempster, A., Laird, N. and Rubin, D. (1977). Maximum likelihood estimation from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Serie B, 39, 1–38. Diallo, M. and Rao, J. (2018). Small area estimation of complex parameters under unit-level models with skew-normal errors. Scandinavian Journal of Statistics, 45, 1092–1116. Elbers, C., Fujii, T., Lanjouw, P., Ozler, B. and Yin, W. (2007). Poverty alleviation through ge- ographic targeting: How much does disaggregation help? Journal of Development Economics, 83, number 1, 198–213. Elbers, C., Lanjouw, J. and Lanjouw, P. (2003). Micro-level estimation of poverty and inequality. Econometrica, 71, 355–364. Elbers, C., Lanjouw, J. and Lanjouw, P. (2005). Crime and local inequality in south africa. Journal of Economic Geography, 5, 101–118. Engstrom, R., Hersh, J. and Newhouse, D. (2022). Poverty from space: Using high-resolution satellite imagery for estimating economic well-being. World Bank Economic Review, 36, 382–412. Farrell, P., MacGibbon, B. and Tomberlin, T. (1997). Empirical bayes small-area estimation using logistic regression models and summary statistics. Journal of Business and Economic Statistics, 15, 101–108. 20 Fay, R. and Herriot, R. (1979). Estimates of income for small places: An application of james-stein procedures to census data. Journal of the American Statistical Association, 74, 269–277. Foster, J., Greer, J. and Thorbecke, E. (1984). A class of decomposable poverty measures. Econometrica, 52, 761–766. Gibson, J., Le, T. and Kim, B. (2017). Prices, engel curves, and time-space deflation: Impacts on poverty and inequality in vietnam. World Bank Economic Review, 31, 504–530. Graham, B., Campos de Xavier Pinto, C. and Egel, D. (2016). Efficient estimation of data combination models by the method of auxiliary-to-study tilting (ast). Journal of Business and Economic Statistics, 34, 288–301. Haslett, S. (2013). Small area estimation of poverty using the ell/povmap method, and its alternatives. Poverty and Social Exclusion: Chapter 12 224–245. Haslett, S. (2016). Small area estimation using both survey and census unit record data. Handbook on Analysis of Poverty Data by Small Area Estimation: Chapter 18 328–348. Henderson, C. (1953). Estimation of variance and covariance components. Biometrics, 9, 226–253. Isaki, C. (1990). Small-area estimation of economic statistics. Journal of Business and Economic Statistics, 8, 435–441. Jean, N., Burke, M., Xie, M., Davis, W. M., Lobell, D. B. and Ermon, S. (2016). Combining satellite imagery and machine learning to predict poverty. Science, 353, 790–794. Kelejian, H. and Prucha, I. (2010). Specification and estimation of spatial autoregressive models with autoregressive and heteroskedastic disturbances. Journal of Econometrics, 157, 53–67. Lahiri, P. and Salvati, N. (2023). A nested error regression model with high-dimensional parameter for small area estimation. Journal of the Royal Statistical Society Series B, 85, 212–239. Maloney, W. and Caicedo, F. (2015). The persistence of (subnational) fortune. Economic Journal, 126, 2363–2401. Marhuenda, Y., Molina, I., Morales, D. and Rao, J. (2017). Poverty mapping in small areas under a twofold nested error regression model. Journal of the Royal Statistical Society Series A: Statistics in Society, 180, number 4, 1111–1136. Marx, B., Stoker, T. and Suri, T. (2019). There is no free house: Ethnic patronage in a kenyan slum. American Economic Journal: Applied Economics, 11, 36–70. McCulloch, C. and Neuhaus, J. (2011). Misspecifying the shape of a random effects distribution: Why getting it wrong may not matter. Statistical Science, 26, 388–402. 21 McLachlan, G. and Peel, D. (2000). Finite Mixture Model. New York: John Wiley & Sons. Mendez, E. and Van Patten, D. (2022). Multinationals, monopsony, and local development: Evidence from the united fruit company. Econometrica, 90, 2685–2721. Merfeld, J. and Newhouse, D. (2023). Improving estimates of mean welfare and uncertainty in developing countries. World Bank Policy Research Working Paper, 10348. Molina, I. and Rao, J. (2010). Small area estimation of poverty indicators. Canadian Journal of Statistics, 38, 369–385. Newhouse, D. (2023). Small area estimation of poverty and wealth using geospatial data: What have we learned so far? Calcutta Statistical Association Bulletin, 76, 7–32. Newhouse, D., Merfeld, J., Ramakrishnan, A., Swartz, T. and Lahiri, P. (2022). Small area estimation of monetary poverty in mexico using satellite imagery and machine learning. World Bank Policy Research Working Paper No. 10175. Perez, B., Pena, D. and Molina, I. (2011). Robust henderson iii estimators of variance compo- nents in the nested error model. Universidad Carlos III De Madrid Working Paper, 11-43-32. Pfeffermann, D., Da Silva Moura, F. and Do Nascimento Silva, P. (2006). Multi-level modelling under informative sampling. Biometrika, 93, 943–959. Pfeffermann, D., Skinner, C., Holmes, D., Goldstein, H. and Rasbash, J. (1998). Weighting for unequal selection probabilities in multilevel models. Journal of the Royal Statistical Society, Series B, 60, 23–40. Pfeffermann, D. and Sverchkov, M. (2007). Small-area estimation under informative proba- bility sampling of areas and within the selected areas. Journal of the American Statistical Association, 102, 1427–1439. Pokhriyal, N. and Jacques, D. (2017). Combining disparate data sources for improved poverty prediction and mapping. Proceedings of the National Acadamy of Sciences. Pratesi, M. and Salvati, N. (2008). Small area estimation: the eblup estimator based on spatially correlated random area effects. Statistical Methods and Applications, 17, 113–141. Rao, J. and Molina, I. (2015). Small area estimation. New York: John Wiley and Sons. Rojas Perilla, N., Pannier, S., Schmid, T. and Tzavidis, N. (2020). Data-driven transformations in small area estimation. Journal of the Royal Statistical Society Series A, 183, 121–148. Searle, S., Casella, G. and McCulloch, C. (1992). Variance components. New York: Wiley. Sinha Roy, S. and van der Weide, R. (2025). Estimating poverty for india after 2011 using private-sector survey data. Journal of Development Economics, 172. 22 Skrondal, A. and Rabe-Hesketh, S. (2009). Prediction in multilevel generalized linear models. Journal of the Royal Statistical Society, 172, 659–687. Smythe, I. and Blumenstock, J. (2022). Geographic microtargeting of social assistance with high-resolution poverty maps. Proceedings of the National Acadamy of Sciences, 119, number 32, 1–10. Tarozzi, A. (2007). Calculating comparable statistics from incomparable surveys, with an application to poverty in india. Journal of Business and Economic Statistics, 25, 314–336. Tarozzi, A. (2011). Can census data alone signal heterogeneity in the estimation of poverty maps? Journal of Development Economics, 95, 170–185. Tarozzi, A. and Deaton, A. (2009). Using census and survey data to estimate poverty and inequality for small areas. Review of Economics and Statistics, 91, 773–792. Tzavidis, N., Zhang, L., Luna, A., Schmid, T. and Rojas Perilla, N. (2018). From start to finish: a framework for the production of small area official statistics. Journal of the Royal Statistical Society Series A, 181, 927–979. Van der Weide, R., Blankespoor, B., Elbers, C. and Lanjouw, P. (2024). How accurate is a poverty map based on remote sensing data? an application to malawi. Journal of Development Economics, 171. Van Veelen, M. and van der Weide, R. (2008). A note on different approaches to index number theory. American Economic Review, 98, 1722–1730. Verbeke, G. and Lesaffre, E. (1996). A linear mixed-effects model with heterogeneity in the random effects population. Journal of the American Statistical Association, 91, 217–221. Wang, S. and Yin, S. (2002). A new estimate of the parameters in linear mixed models. Science in China Series A: Mathematics, 45, 1301–1311. Westfall, P. (1987). A comparison of variance component estimates for arbitrary underlying distributions. Journal of the American Statistical Association, 82, 866–874. Wu, M., Yu, K. and Liu, A. (2009). Estimation of variance components in the mixed effects models: A comparison between analysis of variance and spectral decomposition. Journal of Statistical Planning and Inference, 139, 3962–3973. Zhou, T. and He, X. (2008). Three-step estimation in linear mixed models with skew-t distri- butions. Journal of Statistical Planning and Inference, 138, 1542–1555. 23 Appendix 1 A lemma and corollary that are used in the proofs of Lemmas 2 2 Lemma 3 Let v1 and v2 be two independent normal random variates, with Evi = µi , var vi = σi for i = 1, 2. Let v3 = v1 + v2 . Denote the normal density function with expectation µ and variance σ 2 by φ(·; µ, σ ). Then for the joint density p(v1 , v3 ) and the conditional density p(v1 |v3 ) we have p(v1 , v3 ) = φ(v1 ; µ1 , σ1 )φ(v3 − v1 ; µ2 , σ2 ) = p(v1 |v3 )p(v3 ) 2 σ1 2 σ2 σ1 σ2 = φ v1 ; 2 + σ2 (v 3 − µ 2 ) + 2 + σ 2 µ1 , 2 + σ2 σ1 2 σ1 2 σ1 2 ×φ v3 ; µ1 + µ2 , 2 + σ2 . σ1 2 Proof This can be directly verified. Alternatively, note that both v3 and v1 |v3 are normally distributed. The distribution of v1 |v3 has expectation 2 σ1 2 σ2 E (v1 |v3 ) = 2 + σ2 (v 3 − µ 2 ) + 2 + σ 2 µ1 σ1 2 σ1 2 and variance σ12σ2 2 var (v1 |v3 ) = 2 + σ2 . σ1 2 □ Corollary 4 Let x and y be independent random variables distributed as normal mixtures, and let q = x + y . Then (i) q is also distributed as a normal mixture, and (ii) the conditional distributions p(x|q ) and p(y |q ) are also normal mixtures. Moreover, the mixing parameters of p(q ), p(x|q ) and p(y |q ) are closed expressions in terms of the distributional parameters of x and y . In particular, if the density function of x is a f (x) = πk φ(x; µk , σk ), k=1 (with πk > 0 and k πk = 1) and the density function of y is b g (y ) = λj φ(y ; νj , τj ), j =1 (with λj > 0 and j λj = 1) then the conditional distribution of x, given q is h(x|q ) = κjk φ(x; mjk , sjk ), jk 24 where the mixing probabilities κjk are proportional to κjk ∝ πk λj φ q ; µk + νj , 2 + τ2 σk j and 2 2 τj σk mjk = 2 + τ2 (q − νj ) + 2 + τ 2 µk σk j σk j 2τ 2 σk j s2 jk = 2 + τ2 σk j Proof The random variable x can be expressed as a x= zk x k , k=1 where the xk are independent (latent) random variables, and also independent of (z1 , ..., za ). xk is normally distributed according to component distribution k , with density φ(xk ; µk , σk ). The zk s are indicator random variables taking values 0 and 1 with probability πk . One and only one indicator takes the value 1, the others are 0.20 Likewise y can be expressed as b x= wj y j , j =1 where the (wj , yj ) independent of the (zk , xk ). Note that zk = zk j wj and wj = wj k zk , so that q =x+y = zk x k + wj yj = zk wj (xk + yj ). k j jk Part (i) of the Corollary now immediately follows; the mixing probabilities for q are P [zk wj = 1] = πk λj and the component distribution for component (k, j ) has expectation µk + νj and 2 + τ 2. variance σk j For the conditional distribution p(x|q ) we have p(x|q ) ∝ p(x, q ) = πk λj φ(x; µk , σk )φ(q − x; νj , τj ) jk   2 2 τj σk σk τj = πk λj φ x; 2 + τ2 (q − νj ) + 2 + τ2 µk ,  σk j σk j 2 + τ2 σk jk j ×φ q ; µk + νj , 2 + τ2 . σk j The last equality follows from the Lemma. 20 In other words, the (z1 , ..., za ) have a multinomial distribution with class probabilities πk and a single draw. 25 The conditional distribution p(x|q ) is therefore again a normal mixture with mixing proba- bilities for the (k, j ) component proportional to πk λj φ q ; µk + νj , 2 + τ2 σk j and corresponding component distribution   2 2 τj σk σk τj φ x; 2 2 (q − νj ) + 2 2 µk , . σk + τj σk + τj 2 σk + 2 τj The distribution of p(y |q ) follows in exactly the same fashion. □ Appendix 2 Proofs ¯a , which we denote by Lemma 2: The probability density function of ua conditional on e p(ua |e ¯a ), is a normal-mixture with known parameters: i=mu j =mε k=mε p(ua |e ¯a ) = wijk φ ua ; mijk ; s2 ijk , (20) i=1 j =1 k=1 where: 2 2 + ω2 ωj σi k mijk = 2 + (ω 2 + ω 2 )/n ea − (νj + νk )/na ) + (¯ 2 + ω2 + n σ2 µi σi j k a ωj k a i 2 + ω 2 )σ 2 (ωj k i s2 ijk = 2 + ω2 + n σ2 , ωj k a i ˜ijk / and where wijk = w ijk ˜ijk with: w 2 2 2 ˜ijk = πi λj λk φ(¯ w ea ; µi + (νj + νk )/na ; σi + ( ωj + ωk )/na ), (21) 2 ) and (λ , ν , ω 2 ) where φ denotes the normal probability density function, and where (πi , µi , σi j j j denote the parameters associated with the normal-mixture distributions Fu and Gε , respectively. Proof The result stated in this lemma follows from a direct application of Corollary 4. □ Appendix 3 Household surveys used for random components 3.1 Data sources The household income and expenditure survey data used to obtain estimates of the area and household level random components across a wide range of countries and years are listed in Table 1. For each of the included countries, the table documents both the specific survey and 26 years used. These data correspond to the official nationally representative household surveys that the World Bank also uses to estimate poverty levels and consumption growth for the respective countries. Table 1: List of countries’ survey(s) and rounds Country Survey(s) Years/rounds Bangladesh Household Income and Expenditure Survey 2000, 2005, 2010 and 2016 (HIES) Brazil ılios Pesquisa Nacional por Amostra de Domic´ 2001, 2002, 2003, 2004, 2005, 2006, (PNAD) 2007, 2008, 2009, 2011, 2012, 2013, 2014, 2015, 2016 and 2017 Chile on Socioecon´ Encuesta de Caracterizaci´ omica 2000, 2006, 2011, 2013, 2015 and Nacional (CASEN) 2017 Colombia Encuesta Continua de Hogares (ECH); and Gran ECH: 2001, 2002, 2003, 2004 and Encuesta Integrada de Hogares (GEIH) 2005; GEIH: 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016 and 2017 Ethiopia Household Income, Consumption and Expendi- 2010 and 2015 ture Survey (HICES) Ghana Ghana Living Standards Survey (GLSS) 2005, 2012 and 2016 India National Sample Survey (NSS) 1993, 2004, 2009 and 2011 Kazakhstan Household Budget Survey (HBS) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016 and 2017 Kenya Integrated Household Budget Survey (IHBS) 2005 and 2015 Mexico Encuesta Nacional de Ingresos y Gastos de los 2000, 2002, 2004, 2005, 2006, 2008, Hogares (ENIGH) 2010, 2012, 2014 and 2016 Pakistan Pakistan Integrated Household Survey (PIHS); PIHS: 2001; PSLM: 2004, 2005, and Pakistan Social and Living Standards Mea- 2007, 2010, 2011, 2013 and 2015 surement Survey (PSLM) Philippines Family Income and Expenditure Survey (FIES) 2006, 2009, 2012 and 2015 Poland Household Budget Survey (HBS) 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015 and 2016 Rwanda Integrated Household Living Conditions Survey 2010, 2013 and 2016 (EICV) South Africa Income and Expenditure Survey (IES); and Liv- IES: 2010; LCS: 2014 ing Conditions Survey (LCS) Sri Lanka Household Income and Expenditure Survey 2002, 2006, 2009, 2012 and 2016 (HIES) 3.1.1 Harmonization The survey data are obtained from the Global Monitoring Database (GMD). The GMD is a global comparable micro database hosted by the World Bank that harmonizes household survey data across selected countries and years. The list of harmonized variables includes household income and consumption expenditures, demographics, educational attainment, employment, ownership of durable assets, and housing among others. 27 Appendix 4 RMSE for headcount rate of 25 percent Figure 1: RMSE versus skewness for different magnitudes of the location effect (as a percentage of total error) 28