Policy Research Working Paper 10917 Statistically Matching Income and Consumption Data An Evaluation of Energy and Income Poverty in Romania Britta Rude Monica Robayo-Abril Poverty and Equity Global Practice September 2024 Policy Research Working Paper 10917 Abstract To design effective policy instruments that target the energy imputation models to impute information on energy spend- poor in Romania, it is crucial to understand who the energy ing shares from the HBS into the EU-SILC based on a set poor are. However, these types of analyses are limited by the of matching variables, compare the performance of these current data environment. While monetary energy poverty models and apply the best-performing one. Based on the estimates rely on data from expenditure surveys, traditional resulting matched dataset, the results show that nearly all welfare indicators and detailed information on access to the monetary poor are also energy poor, but that a sig- social protection programs form part of the EU-SILC. Sam- nificant additional share of the population in Romania is ples of both surveys differ; consequently, record linkage of energy poor. Energy poverty rates are higher at the lower both surveys is impossible. This paper propose an alterna- end of the welfare distribution. This result has significant tive solution to combine information from both surveys, welfare implications. namely statistical matching techniques. It applies several This paper is a product of the Poverty and Equity Global Practice. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at mrobayo@worldbank.org and brude@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Statistically Matching Income and Consumption Data: An Evaluation of Energy and Income Poverty in Romania1 Britta Rude Monica Robayo-Abril JEL classification: O13, P28, Q42, D12, C15, C52 Keywords: energy poverty, statistical matching, poverty, data fusion, imputation, EU-SILC, Romania 1 This paper was prepared as part of the Poverty and Equity Program for Romania in the World Bank's Global Poverty and Equity Practice. The study was carried out by a team composed of Mónica Robayo-Abril (Senior Economist, World Bank) and Britta Rude (Young Professional, World Bank). We thank Sergio Oliviera (Senior Economist, Statistician, World Bank) for helpful comments and revisions. I. Introduction Within the EU, crafting evidence-based policies to combat energy poverty—exacerbated, for example, by the energy crisis induced by the Russian Federation’s invasion of Ukraine—is hindered by the current data environment. Effective policy decisions, grounded in evidence, require robust data infrastructure and strong statistical capacities. Unfortunately, informed policy making is frequently restricted due to limited data availability. This limitation is starkly evident in the ongoing EU energy crisis and broader efforts to address energy poverty. The sharp rise in energy prices resulting from Russia’s invasion of Ukraine has put immense pressure on households. Understanding which households are most affected and identifying the overlap between those in monetary poverty and energy poverty requires a dataset that measures both for each household. Regrettably, such comprehensive data is scarce in most EU countries. Welfare aggregates are typically measured through the EU-SILC, while detailed expenditure data, encompassing energy spending, is collected in household budget surveys. Consequently, jointly observing the energy poverty and monetary poverty status at the household level remains challenging. Currently, no data source contains reliable joint information on income and expenditure, as in the case of Romania. The current data environment in the European Union, including Romania, is characterized by one survey used to measure household expenditure, the Household Budget Survey, and one used to measure monetary poverty and income, the EU-SILC. While the Household Budget Survey collects data on income, the information gathered is not used to produce official poverty measures, which are based on disposable income from the EU-SILC. Consequently, no survey has joint information on income-based welfare measures and expenditure. Measuring expenditure by income deciles or poverty status in Romania is, therefore, challenging. The Statistical Office of the European Union, Eurostat, and a growing body of literature explores data fusion methods, or statistical matching, to address the lack of joint income and expenditure data (Lamarche et al., 2020). Traditionally, data sets are merged based on record linking. This approach relies on a common individual identifier, which can be used to combine data sets at the individual level. There is no common identifier in the case of EU-SILC and HBS, and households interviewed as part of 2 the EU-SILC might not coincide with those interviewed as part of HBS. Consequently, one must rely on data fusion methods to combine information from both data sets. A growing body of scholars has investigated the use of statistical matching to merge datasets that lack a shared identifier, particularly within the European Union. For instance, Donatiello et al. (2016b) applied this method to data from Italy, while Serafino and Tolkin (2017) utilized it with data from six European countries. Lamarche et al. (2020) examined the broader European context, while Schaller (2021) focused on Germany, France, and the Netherlands. Additionally, Emmenegger et al. (2022) employed statistical matching in Germany. In this paper, we apply statistical matching techniques to generate a unique expenditure and income dataset for Romania. Our goal is to produce a dataset that contains information on expenditure, explicitly the energy expenditure share of households, and reliable information on households' income and individuals’ access to social protection. The resulting dataset allows us to characterize the energy poor, analyze the welfare implications of rising energy prices, and propose effective policy interventions that target energy poverty. To this end, we apply statistical matching techniques between the EU-SILC 2020, with income reference year 2019, and the HBS 2019.2 Our ultimate goal is to use the resulting dataset to overlay monetary poverty with energy poverty. We estimate multiple imputation models following the approach developed by Ruben (1986), which concatenates data sets. Statistical matching methods borrow from the concept of imputations, which is often used for missing observations (Bacher and Prander, 2018). It encompasses various approaches, which can be categorized as parametric, nonparametric, or mixed methods, as outlined by Lewaa et al. (2021). We adopt the methodology introduced by Ruben (1986) and select matching variables to append the two datasets. We identify potential matching variables that are present in both surveys and aid in identifying similar households. However, not all overlapping variables should be considered as matching variables; only those that are relevant to the target variable and exhibit similar distributions across surveys (Serafino and Tonkin, 2017). In the first step, we harmonize the 2 EU-SILC questions on income refer to the 12 month period of the previous year. 3 overlapping variables. We then employ lasso regressions to determine the most significant variables for explaining energy spending shares. Subsequently, we concatenate the two datasets based on these selected variables and run a variety of different imputation models: linear regression imputation models, predictive mean matching (PMM), and truncated regression imputation models. We identify the best-performing imputation model based on three criteria. First, we assess how well the estimates of the imputed energy spending share align with the observed distribution, overall and by subgroups. Second, we analyze the variation in mean estimates and standard deviations across simulations to evaluate the consistency of results. We do so for the overall population and also look at subgroups from a random set of matching variables. Third, we examine how each model performs when duplicating the household budget survey and assuming that energy spending shares are missing in the duplicated data. These performance measures indicate that the predictive mean matching (PMM) imputation technique yields the best results. Moreover, predictive mean matching has several advantages over other methods, as it is an easy-to-use and versatile method and is less vulnerable to model misspecification than the imputation methods (Van Buuren, 2012). Additionally, we compare three different model specifications and find that a weighted PMM slightly outperforms the unweighted PMM. We utilize the matched dataset to demonstrate a strong association between monetary poverty and energy poverty, revealing that the majority of individuals experiencing monetary poverty are also affected by energy poverty. We find that energy spending shares are considerably higher among individuals in the lower welfare distribution segments. By overlaying the estimates of energy spending shares and monetary energy poverty with the measurements of monetary poverty obtained from the EU-SILC, we observe that almost the entire population at risk of poverty is also experiencing energy poverty. Additionally, a significant proportion of the Romanian population is identified as energy poor, suggesting that traditional welfare indicators may overlook the issue of energy poverty in the country. This highlights the necessity for additional policy measures to address energy poverty in Romania. Furthermore, our analysis demonstrates that households with lower welfare status allocate more of their income toward energy expenditures. Consequently, these households may bear a heavier burden resulting from escalating energy prices, exacerbating their existing hardships. 4 Several important caveats apply to our analysis. We find substantial differences in matching variables between the two datasets, indicating potential dissimilarities in the characteristics of the sampled individuals. Additionally, there are discrepancies in the sampling methods employed in each survey. Moreover, the time horizons of the surveys also vary, which introduces further complexity. These systematic differences can significantly impact the effectiveness of imputation models in accurately capturing the actual energy spending shares of households included in the EU-SILC. Most importantly, similar to previous studies, we cannot test the validity of the conditional independence assumption in the specific empirical example under investigation. These limitations should be taken into account when interpreting the results presented in this paper. Our paper adds to the existing studies on statistical matching techniques that have been conducted in the European Union (EU). In recent years, there has been a growing interest in the EU in exploring the potential of statistical matching. Some previous studies have already been conducted on this topic, such as the works of Leulescu and Agafitei (2013), Serafino and Tonkin (2017), and Moretti and Shlomo (2022). Our paper specifically focuses on the application of statistical matching techniques to analyze energy expenditure shares and energy poverty in Romania. To the best of our knowledge, this is the first study exploring data fusion methods in this country. In addition, as far as we know, our study is the first to investigate the use of statistical matching techniques for imputing energy poverty indicators and energy spending shares. Previous studies have primarily focused on indicators related to income or education. Enhancing the harmonization of surveys and improving data integration within a subset of the EU- SILC would facilitate future efforts of statistical matching. In the context of the two surveys discussed in this paper, various entry points can facilitate the application of statistical matching techniques in Romania. First, we suggest a more robust harmonization process between different household surveys, specifically the EU-SILC and the HBS, to enable data fusion methods based on matching variables. This harmonization effort would enhance the compatibility of data sources and facilitate the utilization of matching techniques. Furthermore, harmonizing the HBS across countries would simplify the application of methodologies developed in one country to other European Union (EU) member states. Lastly, we recommend including expenditure information in the EU-SILC for a subsample of households 5 or at an aggregated level. This inclusion would enable the validation of matching techniques' performance and provide a means to assess the accuracy and reliability of the results. Having expenditure information available within the EU-SILC would enhance the feasibility of validating and refining matching methodologies. The paper is organized as follows. Section II describes the methodology and data. Section III describes the empirical application for the case of Romania and how energy and income poverty overlap. Section IV concludes. II. Methodology and Data Two overall approaches are mainly used for data fusion: distance- and model-based. Table 1 describes these methods in more detail. Distance-based methods minimize a distance function between R=(Y, Z) and D=(X, Z). Conceptually speaking, this approach searches for similar observations based on overlapping variables Z. Examples are statistical matching with/out resampling techniques, nearest neighborhood techniques, hot-deck techniques, or cluster analyses. The variable values of donor values X are used for missing observations in recipient data based on the similarity in Z. This approach relies on several strong conditions, such as no measurement errors in Z and the appropriate distance measure (Bacher and Prander, 2018). Model-based methodologies first estimate a functional form between X and Z in the donor data and then apply this functional form to the recipient data. These methodologies assume the same functional form of X = f(Z) in donor and recipient data. Table 1: Data fusion methods Distance-based methodologies Model-based methodologies - Based on overlapping variables Z, - Estimation of functional form Approach search for similar observations between X and Z in donor data - Minimization of distance function between R=(Y, Z) and D=(X, Z) - Often, segmentation by socio- economic characteristics or geography 6 - Statistical matching with/without - Single imputations Examples resampling techniques - Multiple imputations - Nearest neighborhood techniques - Hot-deck techniques - Cluster analysis Result Variable values of donor value X are used to Use of resulting functional form to identify missing observations in recipient estimate missing values of X in recipient data. based on Z - No measurement errors in Z - The same functional form of X = f(Z) in Condition - The chosen distance measure is donor and recipient data appropriate Source: Own elaboration of authors based on Bacher and Prander (2018). Several researchers have tried to compare these approaches, but an overall assessment is missing. Schaller (2021), for example, compared the hot-desk approach to a predictive mean matching and concluded that predictive mean matching might outperform hot-desk approaches, which Eurostat traditionally recommends. Several researchers show that multiple imputation techniques are well- suited for robust data fusion (Rässler (2004); Bacher and Prander (2081); Todosijević (2012)). For our purpose, we rely on model-based methodologies. Following Bacher and Prander (2018), empirically, one first estimates a functional relationship between X and Z in the donor dataset, in our case the HBS, as follows: = (; ; ) The functional form of (; ; ) depends on the population parameters and the imprecision with which Z might have been measured, denoted as .. One example of a potential functional form would be a linear regression, such as: = 0 + 1 1 + ⋯ + + 2 is assumed to have a normal distribution with a mean value of 0 and a variance of . Based on the estimated model parameters, missing values are estimated by: = (; ´; ´) = ′ + ′ 1 + ⋯ + ′ + ′ 0 1 7 Importantly, data fusion methods are subject to several conditions, most notably the conditional independence assumption. The functional relationship between the overlapping variable Z and the missing variable X established in the donor data set must also be held in the recipient data set. Data fusion methods assume conditional independence in the variables not jointly observed based on the overlapping variables (Rässler, 2004). However, this condition might not hold empirically. While this assumption is difficult to test empirically, one possibility is to choose the overlapping variables Z in that they have high explanatory power for X. Donatiello et al. (2016a) demonstrate that the conditional independence assumption is fulfilled when at least one variable exists( , which is perfectly correlated with either X or Y). The fact that the functional form of the imputation model relies on decisions taken by the researcher and might not reflect the true underlying functional form is an important limitation of model-based methodologies. Model-based methodologies rely on the functional form assumed in the first step of the estimation procedure. The functional form is imposed by the researcher and not driven by data (Donatiello et al., 2016a). That is an important limitation in statistical matching techniques. III. Empirical Application: The Case of Romania Data Fusion Methods in the Romanian Context In the following, we perform data fusion methods on the EU-SILC and HBS for Romania to combine data on energy expenditure with information on welfare and access to social protection. For this purpose, we use the Stata Model for multiple imputations (Module MI). 3 We follow the framework developed by Bacher and Prander (2018) (Table 2). We first make sure that the underlying assumptions of the model-based approach are satisfied in our setup and establish that both the HBS and EU-SILC draw from the same population. 3 For more information see this link: https://www.stata.com/manuals/mi.pdf 8 Table 2 describes the process behind the data fusion of the variables collected as part of the EU-SILC and HBS in Romania. While the HBS contains expenditure information C, the EU-SILC contains income information I. Both data sets gather information on a set of common variables X (often called overlapping variables), such as the sex, age, labor force status, place of residence of household members, and the total number of household members, among others. The data set to which information is added is often denoted as the recipient, while the data set containing information on the missing variable is often denoted as a donor. Table 2: Conceptualization for data fusion of EU-SILC and HBS EU-SILC with information on I EU-SILC Variables, which form part of (Imputation for expenditure C) EU-SILC and Socio-economic information HBS HBS with information on C (X) (Imputation for income I) Source: Own elaboration of authors based on Bacher and Prander (2018). We match datasets at the individual, not household, level. While the matching procedure could be performed at the household and individual levels, we decided to perform the matching at the individual level, as several of the matching variables are at the individual level, and aggregating them to the household level could introduce unnecessary noise. Moreover, ultimately, we are interested in combining information on households' energy expenditure shares with individuals' access to social protection (not households' access to social protection, as several measures are at the individual level). We rely on data from the EU-SILC 2020, with reference year 2019 and HBS 2019. We did not have access to more updated survey rounds of the EU-SILC at the time we conducted this research. In addition, for the underlying exercise, it might be more appropriate to rely on data from the pre-COVID outbreak, given that the pandemic might also have adversely affected data collection, especially when 9 considering social distancing measures. Additionally, consumption patterns during the COVID-19 pandemic might systematically differ from those observed normally in Romania. Our outcome variable of interest is the energy expenditure share. Our ultimate goal is to impute information on energy expenditure shares into the EU-SILC to combine information on energy expenditure with information on households' welfare and access to social protection. We define the energy expenditure share as the total energy expenditure of a household divided by its total household income. Although the income information gathered as part of the HBS is not the best available indicator for poverty measures, it is sufficiently reliable to measure household income 4 , and constructing the energy spending share using this information is a valid approach. While there are many different approaches to constructing this variable, using this measurement approach is most reasonable in the case of Romania (for a detailed overview of why we construct the measure in this way, see Robayo-Abril and Rude (2024). We include the following spending components to calculate total energy expenditure: solid fuels, liquid fuels, natural gas, thermal energy, and electricity and renewable energy. Importantly, we annualize energy expenditure and income and replace values above one by the value one (95 cases). Next, we identify variables that are included in both the HBS and EU-SILC. This comparison is limited by the harmonization status of the different surveys conducted in Romania. While both surveys include extensive information on demographics, household composition, dwelling, labor, and income, they are not measured similarly. This might result in the need to relabel and recode some of the variables. Table A.6 details the common variables we identify in both surveys and their respective coding.5 4 Though official poverty is measured with the EU-SILC, income from the HBS is officially reported by the Romania Institute of Statistics, together with consumption, from the HBS (Romania NSI, 2022). 5 For the detailed codebook on the EU-SILC see: https://www.gesis.org/missy/files/documents/EU-SILC/Codebook_EU- SILC-2019-cross-sec.pdf 10 The harmonization of variables contains several steps, such as adjusting the time horizon of the variables and the categories. Harmonization in the case of statistical matching can take several forms, such as harmonizing the definition of units, reference period, population, variables, classification, measurement error, missing data, and derivation of variables (Leulescu and Agafitei, 2013). We harmonize the reference period (annualizing variables in the household budget survey, where possible and appropriate, e.g., in the case of energy expenditures and household income) and the categorization of variables. The harmonization of household characteristics is facilitated by the fact that the HBS and the EU-SILC have a similar definition of a household (Serafino and Tonkin, 2017). According to this definition, a household comprises individuals who live together in the same dwelling and share meals or jointly provide living conditions. In other words, a household is a group of people who reside in the same place and have a level of interdependence regarding their daily living arrangements, such as cooking and sharing meals. Some deviations between both surveys persist after our data harmonization. First, while the household reference person is identified in the HBS (those members older than 15 years who contribute most to the total income of the household (European Commission, 2015), this is not the case in the EU-SILC (European Commission, 2021). These differences affect our variable on female household heads. The sample design also differs between both surveys. The EU-SILC relies on a stratified multi-stage (two-stage stratified) sampling with a primary sampling unit (Census enumeration units) and a secondary sampling unit (dwellings) (Gesis, 2021). There is stratification in the first stage of sampling, based on 88 strata (urban and rural areas plus counties) (ibid). The sampling strategy of the HBS is similar, using a two-stage stratified sampling, but slightly different. Primary sampling units are 792 survey centers from the Population and Housing Census, selected using a stratified and balanced extraction method within each layer based on county and residence area criteria, resulting in 88 layers. This formed the Multifunctional Sample of Territorial Areas (Master Sample EMZOT’2011), serving as a sampling frame for household surveys. The EMZOT included 450 urban and 342 rural survey centers distributed across all counties and districts of Bucharest Municipality. In the second step, 9504 permanent dwellings were systematically selected per quarter in three monthly waves of 3168 dwellings. These dwellings, treated as secondary sampling units, were 11 chosen from each survey center, with 12 dwellings quarterly or four dwellings monthly (Romania NSI, 2022). These differences in the sampling design could affect the representativeness of the imputed energy spending share in the EU-SILC. We analyze if we should include all of these variables as matching variables because too many matching variables could generate unnecessary noise. As Donatiello et al. (2016b) noted, it is important to reduce the number of matching variables to the minimum possible and employ the most parsimonious approach. We apply a lasso regression to follow a data-driven approach to select the included matching variables from the full universe of available variables (for details on the lasso regression, see Meinshausen and Bühlmann (2006)). Our outcome variable is the energy expenditure share (total energy expenditure as a ratio of total income). The lasso regression does not omit any variables in this case, including these variables in a regression on the outcomes of interest results in an R-squared, often used as a measure for explanatory power, of 0.155 (0.16 when including regional fixed effects). We note that we can increase the R-squared by including the following additional variables: age, not having any education, primary schooling, lower and upper secondary schooling, tertiary education, having a Ph.D., the number of household members, and the number of children in a household. We follow the approach taken by Donatiello et al. (2016b) and include income measured in the HBS as a matching variable. Like Donatiello et al. (2016b), we address the conditional independence assumption by including the income measure of the HBS as a matching variable. While the information collected on income as part of the HBS is less reliable for poverty measurement compared to the one measured in the EU-SILC, the income measured as part of the HBS and household income derived from data collected as part of the EU-SILC are strongly correlated. Our analysis shows that the distribution of household income is different for both surveys. Still, the correlation is likely close to 1, and poor households in the EU-SILC also tend to report low income in the HBS. We follow Rubin (1986) and concatenate both data sets. The approach developed by Rubin (1986) relies on concatenating different datasets, each containing a different set of missing variables of 12 interest and a set of common control (matching) variables. 6 The result of this approach is one concatenated dataset with missing information for some variables, in our case, the energy expenditure share. We then apply multiple imputation methods. We then perform multiple variation procedures to deduce the missing values of our outcome variable of interest in the EU-SILC. Using the available data, multiple imputation models estimate the relationship between the matching variables in the donor and recipient dataset. The imputation model can follow several estimation procedures, such as regression models, propensity score matching, or machine learning algorithms. The imputation predicts the missing values in the recipient dataset based on the values of the matching variables in the donor dataset, utilizing the estimated relationship from the imputation model. Multiple imputations follow an iterative approach, meaning missing values are imputed several times. The advantage of multiple imputations is that they factor in uncertainty related to each single imputation (Johnson and Young, 2011). Multiple imputations involve three steps: Filling in missing information m times, analyzing each of the m resulting complete data sets, and pooling the resulting parameters. Before conducting the imputation, we analyze if there are auxiliary variables. Auxiliary variables improve imputations, as they are correlated with the missing variable or with missingness itself (Johnson and Young, 2011). A rule of thumb often applied is that auxiliary variables are those variables with a correlation coefficient of higher than 0.4. In the case of this paper, there are no auxiliary variables, but still, some variables report higher correlation coefficients. Income, for example, has a correlation coefficient of -0.351, followed by being an employee (-0.333) and having a PC (-0.305). When restricting the analysis to positive correlation coefficients, the highest correlations are reported for urban (0.221), age (0.222), and the number of elderly per household (0.197). Table A5 presents the correlation coefficients. 6 The approach developed by Rubin (1986) has been applied previously in the literature. Examples are by Alpman (2016) who applies the approach to Multiple-imputation Method for Statistical Matching in Stata. Moriaty and Scheuren (2003) evaluate Rubin’s approach to statistical matching in more detail. 13 We demonstrate that, on average, several matching variables systematically differ between both datasets. Table 3 compares the matching variables for both data sets. The table reveals some significant differences in the chosen characteristics between both data sets. There also seem to be some systematic differences in the sampling strategy between both surveys. For example, the population shares between macro-regions differ between both surveys, as does the share of the urban population. We additionally undertake an examination of the congruence in the distribution of prospective matching variables across the surveyed datasets. 7 This involves the generation of histograms for comparative analysis. The ensuing graphical representations, as delineated in Appendix 1, substantiate the existence of noteworthy dissimilarities. These systematic differences are an important caveat in our analysis. Table 3: Summary statistics of matching variables (HBS 2019 versus EU-SILC 2020) (1) (2) (3) Non-missing Missing T-test mean mean p Female 0.53 0.48 0.00 Married 0.47 0.41 0.00 Divorced 0.06 0.07 0.00 Widowed 0.18 0.20 0.00 EU citizen 0.00 0.00 0.01 Non-EU citizen 0.00 0.00 0.04 Age 47.77 48.83 0.00 No schooling 0.04 0.01 0.00 Primary schooling 0.03 0.07 0.00 Employee 0.34 0.33 0.08 7 Although certain scholars advocate for supplementary approaches, such as the computation of the Hellinger Distance (HD), there exists no established criterion delineating the degree of similarity that would be suitable. Consequently, we refrain from incorporating these methods into our analysis, as emphasized by Serafino and Tonkin (2017). 14 Employer 0.00 0.01 0.00 Unemployed 0.01 0.01 0.38 Student 0.12 0.04 0.00 Domestic/care responsibility 0.05 0.06 0.00 Household members 2.59 2.14 0.00 No. of children 0.43 0.39 0.00 No. +65 0.48 0.49 0.00 No. labor force 1.71 1.69 0.00 No. of female HH members 1.33 1.10 0.00 Female-headed HH 0.31 0.49 0.00 Urban 0.47 0.60 0.00 Has toilet 0.77 0.77 0.05 Has bathroom 0.81 0.78 0.03 No. of rooms 2.85 2.84 0.00 Has washing machine 0.78 0.91 0.00 Has car 0.38 0.45 0.00 Has PC 0.54 0.60 0.00 Has TV 0.99 0.99 0.87 Has phone 0.96 0.97 0.00 Can afford new furniture 0.06 0.20 0.00 Can afford new clothes 0.27 0.55 0.00 Afford a week of travel 0.19 0.42 0.00 Energy unaffordability 0.10 0.10 0.30 Rent paid monthly 23.80 16.58 0.00 Spending on housing (monthly) 217.15 435.51 0.00 Income p.c. (monthly, gross) 1858.79 2222.16 0.00 Macroregion 1 0.29 0.25 0.00 Macroregion 2 0.25 0.29 0.00 Macroregion 3 0.22 0.27 0.00 Macroregion 4 0.24 0.19 0.00 15 Observations 30769 7463 38232 Source: The table shows summary statistics for the matching variables. The first three columns refer to the HBS 2019, while the last three columns refer to the EU-SILC 2020, with income reference year 2019. To take survey weights into account, we use several approaches. It is crucial to account for the survey weights when conducting statistical matching of survey data. One possibility is to include them as an additional matching variable (Quartagno et al., 2020). Performing weighted imputations is another potential solution (ibid). Weight calibration (see, for example, work by Jausen and Tillé (2023)) might result in more precise estimates, but we abstract from them in this paper. Future research could explore these survey-matching strategies in more detail. We apply the following imputation methods 8: • Multivariate normal. These methods assume a multivariate normal distribution of the variables and that the missing values are Missing Completely at Random (MCAR) or Missing at Random (MAR). The model applies techniques like maximum likelihood estimation to estimate the observed values' mean and covariance matrix. • Linear regression. These methods assume a linear relationship between variables with missing and those without. The imputed values are calculated based on this linear relationship. It then draws random values from the estimated multivariate normal distribution to impute missing values and preserve the correlation structure among the variables. • Predictive mean matching (PMM). This method assumes that the missing values are missing at random (MAR). It imputes missing values by matching similar cases. More specifically, PMM first predicts values in the recipient dataset (with observed values for the missing variable of interest in the donor dataset). To this end, several prediction models can be applied. The next step is to match the observed values in the recipient dataset that are like the predicted values 8 We perform 20 different imputations and set the seed to 12345. 16 for the missing values in the donor dataset. This matching can be based on various similarity measures, such as Euclidean or Mahalanobis distance. This approach might be especially suitable for non-normal distributions and non-linear relationships. • Unweighted PMM. In this version of the PMM, we abstract from survey weights. • Unweighted PMM that includes survey weights as a matching variable. We evaluate the performance of these imputation methods by comparing the distribution, mean, and standard deviation of the imputed energy spending share to the ones of the true observable energy spending shares. We do so for the total population and by subgroups. There are several possibilities for evaluating the performance of statistical matching techniques (Leulecscu and Agafitei, 2013). We start by comparing the distribution of the total population. If distributions are close, as a further validation test, we compare the distribution of population subgroups (by gender and income groups). We also compare the imputed mean and standard deviation to the observed mean and standard deviation of energy spending shares. We do so for the overall population and look at subgroups from a random set of matching variables, including income quintiles. The latter is crucial because it shows how the method works in the context of distributional analyses. Additionally, we impute energy spending shares in the HBS by duplicating the HBS and acting as if energy spending shares in the duplicated version are missing to test the performance of our final imputation model. Ideally, one would want to impute observable energy spending shares in the EU- SILC, but this is not possible as expenditure data is not available in the EU-SILC. While imputing observable energy spending shares in a duplicated version of the HBS is less rigorous, it can still shed light on the performance of the imputation method. Model Comparison and Imputation Performance The weighted PMM results in an estimator of energy spending shares that closest resembles the true underlying distribution of observed energy spending shares. Figures 1 to 2 reveal that the distribution of observed and imputed energy spending shares differ significantly in the case of multivariate normal imputation methods and linear regression imputation methods (independently of using survey weights or not). These imputation methods result in energy expenditure shares with values below zero and are 17 significantly more dispersed. When applying weighted PMM techniques, the resulting distribution looks very similar (Figure 3). While the weighted distributions look slightly different, they are still significantly similar (Figure 4). Distributions of imputed and observed values also look similar for different subgroups in the case of the weighted PMM. To further test the quality of this method, we analyze the distribution for subgroups. The figures in Appendix 3 demonstrate that distributions look similar for different sexes and income groups. Importantly, when comparing the estimates and true average energy spending share across income quintiles, values differentiate but not substantially. Figure 4 presents unweighted distributions resulting from an unweighted PMM. The distribution is close but not as close as in the case of the weighted PMM. Lastly, the unweighted PMM that uses survey weights as an additional matching variable resembles the true distribution of energy spending shares to a lesser extent than the weighted PMM or unweighted PMM that does not include survey weights within the list of matching variables. We confirm that the weighted PMM is also the best-performing model when analyzing weighted distributions of energy spending shares in Appendix 2. 18 Figure 1: Distribution of energy spending shares (true versus Figure 2: Distribution of energy spending shares (true versus imputed) – Multivariate normal imputation method imputed) – linear regression imputation method Figure 3: Distribution of energy spending shares (true versus Figure 4: Distribution of energy spending shares (true versus imputed) – weighted PMM imputed) –unweighted PMM Figure 5: Distribution of energy spending shares (true versus imputed) –unweighted PMM with weight as control 19 Notes: The graphs show the distribution of energy spending shares in Green and imputed values in Orange. Distributions are not weighted by survey weights in this case. Source: Matched dataset, consisting of EU-SILC (2020, income reference year 2019) and HBS (2019). The mean imputed energy spending share and its standard deviation are generally similar across imputations in all the different imputation models but above the true observed values. Another indicator we use to assess the quality of the imputed energy spending share is the mean energy spending share's consistency and dispersion across imputations. If the imputed values were to vary significantly across imputations, this might indicate that the imputation is unreliable, and another imputation method might be better suited. The figures in Appendix 4 reveal that – in the case of most imputation models – the estimated mean energy spending share and its standard deviation barely vary across simulations. Moreover, all imputation models result in estimates slightly above the true observed mean, which shows that the models might be subject to some limitations (e.g., because of the systematic differences demonstrated in Table 3 or because of differences in sampling strategies). When conducting a subgroup analysis for a random set of matching variables, some small differences between the imputed and observed average energy spending share become evident. Appendix 5 and 6 presents the results. In addition, we increase the number of simulations to 100 and calculate the confidence interval of the estimates across the 100 imputed values. We perform one last performance check by duplicating the HBS and assuming that the energy spending share in the duplicated version of the HBS is missing. To investigate the performance of the different imputation models further, we duplicate the HBS and assume that energy spending shares are missing in this duplicated HBS. We then impute the missing energy spending share in the duplicated HBS. While ideally, we would want to perform this kind of test on a subset of the EU-SILC for which we would ideally observe energy spending shares, this type of data is not available in Romania. The test we perform on the HBS is less rigorous because we do not account for the systematic differences in the matching variables between the HBS and the EU-SILC. The results presented in Figures 6 to 8 show that PMM imputation models perform well. Nevertheless, the weighted PMM seems to slightly outperform the unweighted PMMs when weighting distributions by survey weights. Appendix 2 presents the related figures. 20 Figure 6: Imputed energy spending share of duplicated HBS- Figure 7: Imputed energy spending share of duplicated HBS - weighted PMM unweighted PMM Figure 8: Imputed energy spending share of duplicated HBS - unweighted PMM with survey weights Notes: The graphs show the distribution of energy spending shares in Green and imputed values in Orange. Distributions are not weighted by survey weights in this case. Source: Matched dataset, consisting of EU-SILC (2020, income reference year 2019) and HBS (2019). Final Model and Resulting Estimates Based on these results, we chose the weighted PMM as our final imputation model. While the unweighted PMM and the unweighted PMM with survey weights as a matching variable perform similarly, we choose the weighted PMM as we want to weight observations in future analyses that we plan to undertake with the fused dataset. Figure 9 shows the energy spending share's imputed mean 21 and standard deviation when performing 20 imputations. The resulting estimate for the mean varies between 11.1 and 11.3 percent, and the standard deviation is between 0.134 and 0.154. Figure 9: Imputed mean and standard deviation of energy spending shares - PMM 0.12 0.115 0.11 0.105 0.1 0.095 0.09 0.085 0 1 2 3 4 5 6 7 8 9 10 Mean Std. Dev. Notes: The graph shows the means (in Blue) and standard deviations (in Orange) of imputed energy spending shares by imputation (0- 20) when applying an unweighted PMM imputation. Observations are weighed by the respective survey weights. Source: Matched dataset, consisting of EU-SILC (2020, income reference year 2019) and HBS (2019). We can now overlay energy poverty with monetary poverty based on the imputed dataset. Figure 10 depicts the results. Only a small share of those who are monetary poor are not energy-poor at the same time. In addition, our results show that a significant share of the population who is not monetary poor are energy poor. Figure 11 reveals that – as expected – energy spending shares are higher in the lower income quintiles. Spending shares are nearly twice as high among the poor than the non-poor. Figure 10: Overlap of monetary and energy poor (2019/20) Figure 11: Energy spending shares (in %) by income groups 0.7 0.14 0.12 0.6 0.1 0.5 0.08 0.4 0.06 0.04 0.3 0.02 0.2 0 0.1 0 Poor Energy Both None poor Notes: The left graph shows the overlap between monetary and energy poverty at the household level. Energy-poor households are all households with an energy spending share above 10 percent. Monetary poverty is the at-risk of poverty rate based on equivalized 22 household income. The right graphs show income groups' energy poverty rates (in %). Q1 is the lowest income quintile, based on per capita household income, and Q5 is the highest. Energy spending shares the total household expenditure on energy (without car-related energy expenditure) as a ratio of the total household income. Observations are weighed by the respective survey weights. Source: Matched dataset, consisting of EU-SILC (2020, income reference year 2019) and HBS (2019). IV. Conclusion Evidence-based policy making in the EU that targets energy poverty, for example, due to the energy crises induced by Russia’s invasion of Ukraine, is limited because household expenditure and welfare information is primarily collected in separate surveys from different samples. Evidence- based decision-making requires strong data environments and sound statistical capacity. Nevertheless, informed policy making is often restricted because of limited data availability. We show that this is the case considering the energy crisis in the EU and policy making that targets energy poverty, more broadly speaking. The increase in energy prices sparked by Russia’s invasion of Ukraine imposed significant pressure on households. To better understand which households are most affected and if there is an overlap between the monetary poor and energy poor, one ideally would want to use of a dataset that measures each household's energy poverty and monetary poverty. However, this information is not available in most EU countries because welfare aggregates are measures in the EU- SILC, while expenditure information – including information on energy spending – forms part of household budget surveys. Consequently, it is not possible to jointly observe the energy poverty and monetary poverty status at the household level. We bridge this data gap by applying statistical matching techniques to Romania's EU-SILC 2020 and HBS 2019. In this study, we investigate the potential of employing statistical matching techniques to amalgamate household living conditions and welfare data with household energy expenditure data sourced from separate surveys. To elaborate further, we employ data fusion methods on the EU-SILC 2020, with an income reference year of 2019, and the HBS 2019 in Romania. The objective is to augment the EU-SILC with detailed information on energy spending shares. Given that the households selected for the HBS and EU-SILC do not overlap, traditional data linkage is not feasible, necessitating statistical matching. We begin by identifying several common variables that can function as matching 23 variables. Subsequently, we select the most relevant matching variables using lasso regressions. Utilizing these variables, we construct a consolidated dataset by harmonizing the HBS and EU-SILC information. Based on the resulting matched dataset, we can overlap each household's monetary poverty status with each household's monetary energy poverty status. This information can generate valuable insights into the welfare implications of rising energy prices in Romania. We apply several imputation models and choose the best-performing one. We apply different imputation models and choose the best-performing one based on three criteria. First, we compare the distribution of resulting energy spending share estimates from the imputation to the distribution of the observed energy spending share. Second, we analyze to which extent the resulting mean estimates and standard deviations vary between simulations. We do so for the total population and for subgroups for a random subset of matching variables. Third, we investigate the performance of each model when assuming that energy spending shares are missing in the household budget survey. These performance measures show that the predictive mean matching imputation technique results in the best-performing model in the example. We compare three different specifications and observe that a weighted PMM slightly outperforms the unweighted PMM. We use the matched dataset to show that nearly all monetary poor are also energy-poor and that energy spending shares are significantly higher in the lower end of the welfare distribution. We overlay the resulting estimate of energy spending shares – and resulting monetary energy poverty estimates – with monetary poverty measured as part of the EU-SILC. We show that nearly the full universe of those at risk of poverty are also energy-poor but that an additional significant share of the population in Romania is energy-poor. Traditional welfare measures might, therefore, neglect the energy-poor population. Additional policy measures might be required to tackle energy poverty in Romania. Moreover, we show that energy spending shares are significantly higher at the lower end of the welfare distribution. Consequently, poorer households might suffer more due to rising energy prices, as they already face a more significant burden. Our analysis is subject to important caveats. We show that matching variables significantly differ between both datasets and that there are differences in the sampling of individuals. Moreover, the 24 time horizons of both surveys also differ. These systematic differences could significantly influence the degree to which imputation models can model the true underlying energy spending share of households sampled into the EU-SILC. In addition, like previous research, we cannot test whether the conditional independence assumption holds in the empirical example. This is a serious limitation in the analysis at hand. Lastly, the fact that we conduct imputations at the individual level while welfare is measured at the household level could introduce additional imprecision in the estimation procedure. We recommend improving the harmonization of surveys and including expenditure information into a subset of the EU-SILC. Several entry points could facilitate statistical matching techniques in Romania in the case of the two surveys discussed in this paper. First, better harmonizing different surveys, such as the EU-SILC and the HBS, is recommended to facilitate data fusion methods based on matching variables. In addition, harmonizing HBS between countries would make it easier to apply methods developed in one country to other countries in the EU. Lastly, including expenditure information in the EU-SILC, at least for a subsample or at an aggregated level, would facilitate the validation of matching techniques. Finally, the results on energy and income poverty have several policy implications for targeting social programs, energy subsidies, and energy efficiency programs. First, as the monetary poor and energy poor overlap, policymakers may consider integrating energy assistance programs within existing social safety nets. This could help ensure that those struggling with monetary poverty also have access to affordable and reliable energy services, essential for their well-being and socio-economic development. Second, recognizing that not all energy-poor individuals are monetary poor, targeted energy subsidies could be implemented to support vulnerable populations who may not qualify for traditional monetary poverty assistance but still face challenges in accessing adequate energy services, especially in times of crisis. Finally, measures to encourage energy efficiency in the medium term can benefit both the monetary poor and energy-poor individuals. These initiatives can help reduce energy costs for low- income households and contribute to overall energy sustainability (see more details in Robayo-Abril et. al 2024). Therefore, policymakers should invest in comprehensive data collection and research to better understand the dynamics of energy poverty and its relationship with monetary poverty, as this information can lead to more effective and targeted policy interventions. 25 References Alpman, A. (2016). Implementing Rubin's alternative multiple-imputation method for statistical matching in Stata. The Stata Journal, 16(3), 717-739. Adăscăliţei, Dragoș and Cristina Rat and Marcel Spătari. Improving Social Protection in Romania. Friedrich Ebert Stiftung. Link: https://library.fes.de/pdf-files/bueros/bukarest/16834.pdf Bacher, J., & Prandner, D. (2018). Datenfusion in der sozialwissenschaftlichen Wahlforschung– Begründeter Verzicht oder ungenutzte Chance? Theoretische Vorüberlegungen, Verfahrensüberblick und ein erster Erfahrungsbericht. Österreichische Zeitschrift für Politikwissenschaft, 47(2), 61-76. Donatiello, G., D’Orazio, M., Frattarola, D., Rizzi, A., Scanu, M., & Spaziani, M. (2016a). The statistical matching of EU-SILC and HBS at ISTAT: where do we stand for the production of official statistics? In DGINS–Conference of the Directors General of the National Statistical Institutes (pp. 26-27). Donatiello, G., D'Orazio, M., Frattarola, D., Rizzi, A., Scanu, M., & Spaziani, M. (2016b). The role of the conditional independence assumption in statistically matching income and consumption. Statistical Journal of the IAOS, 32(4), 667-675. D’Orazio, Marcello/Di Zo, Marco/Scanu, Mauro (2006), Statistical matching. Theory and practice, Chichester: John Wiley (Wiley series in survey methodology), Internet: http://dx.doi.org/10.1002/0470023554. European Commission (2015). Household Budget Survey 2015 Wave EU Quality Report. Link: 72d7e310-c415-7806-93cc-e3bc7a49b596 (europa.eu) European Commission (2021). Methodological Guideline and Description of EU-SILC Target Variables. Link: 434b2180-33b3-0d8c-ed1e-2da912d6a685 (europa.eu) Internet: http://dx.doi.org/10.1002/0470023554. Johnson and Young (2011). Towards Best Practices in analyzing Datasets with Missing Data: Comparisons and Recommendations. Journal of Marriage and Family, 73(5): 926-45 GESIS (2021). Study: EU-SILC 2021. Link: GESIS: Missy - Study: EU-SILC 2021 26 GESIS (2022). European Union Statistics on Income and Living Conditions (EU-SILC). Link: https://www.gesis.org/gml/european-microdata/eu-silc IEA Data Services (2022). Link: https://www.iea.org/countries/romania Jauslin, R., & Tillé, Y. (2023). An efficient approach for statistical matching of survey data through calibration, optimal transport and balanced sampling. Journal of Statistical Planning and Inference, 225, 121-131. Lamarche, P., Oehler, F., & Rioboo, I. (2020). European household's income, consumption and wealth. Statistical Journal of the IAOS, 36(4), 1175-1188. Leulescu, A., & Agafitei, M. (2013). Statistical matching: a model based approach for data integration. Eurostat-Methodologies and Working papers, 10-2. Moriarity, C., & Scheuren, F. (2003). A note on Rubin's statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business & Economic Statistics, 21(1), 65- 73. Meinshausen, N., & Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The annals of statistics, 34(3), 1436-1462. Quartagno, M., Carpenter, J. R., & Goldstein, H. (2020). Multiple imputation with survey weights: a multilevel approach. Journal of Survey Statistics and Methodology, 8(5), 965-989. Rässler, S. (2004). Data fusion: identification problems, validity, and multiple imputation. Austrian Journal of Statistics, 33(1&2), 153-171. Robayo-Abril, M., and Rude, B. 2024. Conceptualizing and Measuring Energy Poverty in Bulgaria. Policy Research Working Paper; 10827. © Washington, DC: World Bank. http://hdl.handle.net/10986/41800 Robayo-Abril, Monica; Karver, Jonathan George; Rude, Britta Laurin; Tomio, Andrea Ailin; Silvestri, Alessandro; Cadena Kotsubo, Kiyomi Erika. “Romania Energy Poverty Assessment: Understanding and Addressing Energy Poverty in Romania - Exploring the Roles of Structural and Behavioral Constraints (English)”. Washington, D.C.: World Bank Group. http://documents.worldbank.org/curated/en/099090924104039235/P1798901886cab0f5189e819f1 3e2a50387 27 Romania National Institute of Statistics (2022). Coordinates of living standards in Romania. Population Income and Consumption in 2021. https://insse.ro/cms/en/tags/coordinates-living-standard-romania-population-income-and- consumption Rubin, D. B. (1986). Statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business & Economic Statistics, 4(1), 87-94. Schaller, J. (2021). Datenfusion von EU-SILC und Household Budget Survey–ein Vergleich zweier Fusionsmethoden. WISTA–Wirtschaft und Statistik, 73(4), 76-86. Serafino and Tonkin (2017). Statistical matching of European Union statistics on income and living conditions (EU-SILC) and the household budget survey. Link: 3587dc1b-9f29-42cb-b0f9-0dfa21a47d41 (europa.eu) StataCorp. 2021. Stata: Release 17. Statistical Software. College Station, TX: StataCorp LLC. Todosijević, B. (2012). Transfer of Variables between Different Data Sets, or Taking "Previous Research" Seriously. Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique, 113(1), 20–39. UCLA: Statistical Consulting Group (2022). Introduction to SAS. from https://stats.oarc.ucla.edu/sas/modules/introduction-to-the-features-of-sas/ (accessed August 22, 2021 Van Buuren, S. (2012). Flexible Imputation of Missing Data (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/b11826 28 Appendix Appendix 1 – Data Preparation Preparing the HBS for Data Fusion We prepare the HBS 2019 for the data fusion with the EU-SILC 2020 (with income reference year 2019). We first use household model S1 of the HBS to generate variables that describe individual-level characteristics of household members and household characteristics. We present these variables in Table A1. Next, we analyze the general conditions of the dwellings that households live in (Model S10A). Table A2 presents the resulting variables. In addition, we generate variables on household living conditions (Model S11) (compare Table A3) and information related to housing (Model S6) (compare Table A4). As shown in Table A4, we annualize the spending variables on housing. We also annualize household income reported as part of the HBS (Figure A1). Similarly, we annualize households' energy spending. The energy spending is all spending on electricity (variable r511), thermal energy (variable r512), and natural gas (r513) of Model S6, as well as spending on non-durable forms of energy (product codes 340 to 344 (bottled gas, petroleum, liquid fuels for radiators, firewood, coal) in Model S5). We add these spending components at the household level and then annualize them. The resulting energy spending share is the amount spent on energy divided by household income (both annualized). Table A1: Household and individual-level characteristics (HBS, 2019) (1) (2) (3) (4) (5) VARIABLES N mean sd min max Female 30,876 0.530 0.280 0 1 Married 30,876 0.465 0.397 0 1 Divorced 30,876 0.0586 0.207 0 1 Separated 30,876 0.00788 0.0785 0 1 Widowed 30,876 0.175 0.351 0 1 EU citizen 30,876 0.00116 0.0251 0 1 Non-EU citizen 30,876 0.00102 0.0276 0 1 29 Age 30,876 47.77 19.57 7 100 No schooling 30,876 0.0439 0.122 0 1 Primary schooling 30,876 0.0286 0.0912 0 1 Lower secondary schooling 30,876 0.0982 0.237 0 1 Upper secondary schooling 30,876 0.239 0.349 0 1 Secondary schooling 30,876 0.337 0.383 0 1 Vocational 30,876 0.181 0.313 0 1 Technical training 30,876 0.249 0.349 0 1 Tertiary schooling 30,876 0.0507 0.174 0 1 PhD 30,876 0.0285 0.137 0 1 Employee 30,876 0.335 0.359 0 1 Employer 30,876 0.00211 0.0335 0 1 Self-employed 30,876 0.0297 0.123 0 1 Unemployed 30,876 0.0145 0.0829 0 1 Student 30,876 0.117 0.195 0 1 Domestic/care responsibility 30,876 0.0532 0.134 0 1 Household members 30,876 2.585 1.351 1 12 No. of children 30,876 0.429 0.792 0 7 No. +65 30,876 0.482 0.700 0 3 No. of unemployed 30,876 0.0432 0.232 0 4 No. labor force 30,876 1.713 1.255 0 9 No. of female HH members 30,876 1.326 0.837 0 7 Female-headed HH 30,876 0.308 0.462 0 1 Macroregion 1 30,876 0.290 0.454 0 1 Macroregion 2 30,876 0.251 0.433 0 1 Macroregion 3 30,876 0.220 0.414 0 1 Macroregion 4 30,876 0.239 0.427 0 1 Urban 30,876 0.468 0.499 0 1 No. of units 30,876 1.008 0.0948 1 4 30 Source: HBS (2019) – Model S1 Table A2: Characteristics of dwellings (HBS, 2019) (1) (2) (3) (4) (5) VARIABLES N mean sd min max Washingmachine 30,876 0.775 0.418 0 1 Car 30,876 0.380 0.485 0 1 Pc 30,876 0.537 0.499 0 1 Tv 30,876 0.992 0.0869 0 1 Phone 30,876 0.959 0.199 0 1 Source: HBS (2019) – Model S10A Table A 3: Households' living conditions (HBS, 2019) (1) (2) (3) (4) (5) VARIABLES N mean sd min max Can afford new furniture 30,876 0.0611 0.240 0 1 Can afford new clothes 30,876 0.269 0.443 0 1 Afford a week of travel 30,876 0.191 0.393 0 1 Energy unaffordability 30,876 0.0955 0.294 0 1 Source: HBS (2019) – Model S11 31 Table A 4: Information on housing (HBS, 2019) (1) (2) (3) (4) (5) VARIABLES N mean sd min max Rent charge and maintenance (electricity) 30,853 108.7 70.34 0 1,500 energy spending 30,853 193.4 152.1 0 3,350 Annualized energy spending 30,853 2,322 1,827 0 40,223 Rent paid monthly 30,853 24.01 155.1 0 2,850 Annualized rent 30,853 288.1 1,862 0 34,200 Spending on housing (monthly) 30,853 217.4 219.3 0 3,350 Annualized spending on housing 30,853 2,609 2,632 0 40,200 Source: HBS (2019) – Model S6 Figure A 1: Histogram of household income from HBS (2019) Figure A 2: Histogram of household income from EU-SILC (2020) Source: HBS (2019) and EU-SILC (2020, the reference year is 2019). To analyze if any of the variables included in the HBS could serve as a matching variable, we analyze their correlation coefficients with energy spending shares. Table A5 presents the resulting coefficients. 32 Table A 5: Correlation coefficients Variable Correlation Coefficient age 0.299807 widowed 0.244673 nelderly 0.203307 urban 0.196951 vocational 0.177478 secondary 0.130471 femaleheaded 0.105601 impossible_energy 0.08193 sex 0.080939 nrooms 0.051326 unemployed 0.023924 selfemployed 0.009049 separated 0.003417 stayathome 0.001387 eu -0.00441 numberhhs -0.0082 tv -0.00927 employer -0.00949 divorced -0.01023 noneu -0.01279 housing -0.01337 noschool -0.02388 phd -0.05984 tertiary -0.06467 primary -0.07231 rent -0.08397 furniture -0.0919 nchildren -0.10747 nfemale -0.11286 married -0.12194 student -0.12991 phone -0.1415 technical -0.14182 clothes -0.14976 washingmachine -0.1857 members -0.19179 weeklytravel -0.19849 bath -0.22379 toilet -0.22409 33 car -0.24776 nlaborforce -0.26623 pc -0.29442 employee -0.36188 Household income -0.36946 Source: HBS (2019) Preparing the EU-SILC for Data Fusion and Harmonization Process We next prepare all variables that are included both in the EU-SILC and HBS in the EU-SILC data. Table A.6 gives an overview of the variables included in both datasets that can potentially be used as matching variables. Table A6: Common variables and their coding (EU-SILC 2020 and HBS 2019) HBS 2019 Coding EU-SILC 2020 Coding Number HH in dwelling Continuous Can be calculated Region (regiune) Discrete (1 to 8) Region Discrete (RO1, RO2, RO3, RO4) Age (varsta) Continuous (1 to 100) Age (rx020 and rx010) Continuous (1 to 81) Sex 1 (Male) and 2 Gender (pb150) 1 (Male) 2 (Female) (Female) Family relationship to Discrete (1 to 10) Person (RG_2) Discrete (1 to 95) head Marital status Discrete (1 to 6) Marital status (PB190) Discrete (1 to 5) Country of origin Discrete (1 to 3) Country of birth String (EU, LOC, OTH) (PB210) 34 Last graduation level Discrete (1 to 10) Highest ISCED Discrete (0-800) (nive) obtained (PE040) Main occupational Discrete (1 to 14) Self-defined current Discrete (1 to 11) status in the past 12 economic status months (stocupan) (PL031) Status of occupation of Discrete (1 to 7) Tenure status (HH021) Discrete (1 to 5) the place of residence (STOCUPL) Rent charge (R509) Continuous Current rent related to Continuous occupied dwelling (HH060) Can be calculated Total housing cost Continuous Sewerage type Discrete (0 and 3) Bath or shower in the Discrete (1 to 3) (CANALIZ) dwelling (HH081) Toilet location Discrete (1 to 3) Indoor flushing toilet Discrete (1 to 3) (GRUPS) for the sole use of household (HH091) Replacement of the Discrete (0-2) Replacing worn-out Discrete (1 to 3) used/outdated furniture (HD080) furniture (LUX_02) Can be calculated Household size (HX040 Continuous 35 Can be calculated Equivalised household Continuous size (HX050) Can be calculated Household type Discrete (5 to 16) (HX060 Automatic washing Continuous (0-2) Washing machine Discrete (0 to 3) machine (INZES_07) (HS100) Motor Car (INZEs_26) Continuous (0-4) Car (HS110) Discrete (0 to 3) PC (INZES_16) Continuous (0-5) Computer (HS090) Discrete (0 to 3) Colour TV (INZES_24) Continuous (0-5) Colour TV (HS080) Discrete (0 to 3) Mobile Phone Continuous (0-8) Telephone (HS070) Discrete (0 to 3) (INZES_25) Sustenance of the 0 and 2 Ability to keep 1 (Yes) 2 (No) residence (thermal household warm energy, natural gas, (HH050) etc.) Capacity to afford a Dummy variable Capacity to afford a 1 (Yes) 2 (No) week away from home week away from home (LUX_01) (HS040) Capacity to buy new 0 (No) 4 (Yes) Replace worn-out Discrete (1 to 3) clothes (LUX_04) clothes (PD020) 36 Notes: The table shows the variables included both in the EU-SILC and HBS. Rows in Red mean that the variables do not coincide in their labeling/coding. Rows in Green mean that these variables can be matched between both surveys without the need for additional data preparation. Source: Compilation of authors based on EU-SILC (2020) and HBS (2019). From these variables, we deduce the following potential matching variables and harmonize them: 1. Overarching variables: • Number of households in dwelling • Central region (1 to 4) • Bath or shower in the dwelling • Indoor flushing toilet for the sole use of household 2. Individual characteristics: • Age • Sex • Marital status • Country of origin • Last graduation level • Main activity status 3. Household characteristics • Household type (number of children, elderly, unemployed, those in labor force, number of female household members, female-headed) • Household size • Rent charged • Housing costs (defined as rent and expenditure for housing maintenance) • Replacement of furniture • Replace worn-out clothes • Capacity to afford a week away from home • Ability to keep household warm • Having a washing machine, car, computer, color TV, telephone Table A.7 presents the summary statistics of the resulting variables from the EU-SILC. Table A7: Description of matching variables in EU-SILC (2020) (1) (2) (3) (4) (5) VARIABLES N mean sd min max Female 7,356 0.480 0.317 0 1 Married 7,356 0.410 0.395 0 1 37 Divorced 7,356 0.0725 0.234 0 1 Widowed 7,356 0.204 0.375 0 1 EU citizen 7,356 4.32e-05 0.00465 0 0.500 Non-EU citizen 7,356 4.22e-05 0.00460 0 0.500 Age 7,356 48.83 18.88 10 81 No schooling 7,356 0.0106 0.0847 0 1 Primary schooling 7,356 0.0666 0.223 0 1 Employee 7,356 0.331 0.366 0 1 Employer 7,356 0.00796 0.0669 0 1 Unemployed 7,356 0.00896 0.0679 0 1 Student 7,356 0.0427 0.126 0 1 Domestic/care responsibility 7,356 0.0588 0.143 0 1 Household members 7,356 2.142 1.135 1 10 No. of children 7,356 0.395 0.793 0 8 No. +65 7,356 0.487 0.680 0 4 No. labor force 7,356 1.692 1.355 0 9 No. of female HH members 7,356 1.105 0.702 0 8 Female-headed HH 7,356 0.488 0.500 0 1 Urban 7,356 0.598 0.490 0 1 Has toilet 7,356 0.770 0.421 0 1 Has bathroom 7,356 0.779 0.415 0 1 No. of rooms 7,356 2.838 1.032 1 6 Has washing machine 7,356 0.911 0.285 0 1 Has car 7,356 0.455 0.498 0 1 Has PC 7,356 0.598 0.490 0 1 Has TV 7,356 0.993 0.0811 0 1 Has phone 7,356 0.974 0.159 0 1 Can afford new furniture 7,356 0.205 0.403 0 1 Can afford new clothes 7,356 0.556 0.427 0 1 Afford a week of travel 7,356 0.423 0.494 0 1 38 Energy unaffordability 7,356 0.0970 0.296 0 1 Rent paid monthly 7,356 16.36 126.5 0 2,000 Spending on housing (monthly) 7,356 436.0 272.2 0 3,000 Source: EU-SILC (2020, income reference year 2019) Before estimating multiple imputation models, we compare the distribution of matching variables to each other. Figure A 3: Distribution of matching variables - EU-SILC versus HBS. 39 40 41 42 43 Notes: Observations are weighted by survey weights. Source: Own estimation based on a harmonized, synthetic dataset consisting of HBS (2019) and EU-SILC (2020). 44 In addition to these comparisons of the matching variables, we also compare the distributions of disposable income between the 2020 EU-SILC and the 2019 HBS. The histogram below shows that household disposable income for the income reference year 2019 follows a very similar pattern in the HBS and the SILC dataset. The comparison of different moments of the distribution shows that these distributions are relatively similar. This is consistent with past studies that have used imputations of the HBS into the SILC. Household disposable income, monthly (equivalised) .0005 .0004 .0002 .0003 density .0001 0 0 2000 4000 6000 8000 10000 SILC 2020 HBS 2019 EU-SILC 2020 HBS 2019 Mean equivalised household disposable income (monthly, Lei) 2,078 1,715 Median equivalised household disposable income (monthly, Lei) 1,495 1,848 Sample size 16,766 55,869 Notes: Observations are weighted by survey weights. Source: Own estimation based on HBS (2019) and EU-SILC (2020). Similar results are obtained when comparing the EUSILC 2022 and the HBS 2022 equivalized household disposable income. When comparing the mean equivalized household income by deciles, the mean equivalized household income based on the EUSILC represents 95% of the one in the HBS. Similar differences are observed across income deciles, with a few differences at the bottom of the 45 distribution, where the EUSILC income is 56% of the one in the HBS. These differences at the lower end of the distribution emphasize the importance of poverty measures using the EUSILC. Matched Dataset To match both datasets, we append the EU-SILC data with the HBS. Importantly, we conduct the matching and imputations at the individual level to account for the fact that several of the variables included in both datasets are at the individual level. We argue that conducting the imputation at the individual level might, therefore, result in more precise results. We can then also abstract from the problem that the definition of a household in the EU-SILC might be different from the one in the HBS. 46 Appendix 2 - Distribution of Energy Spending Shares -weighted distributions Figure A 4: Distribution of energy spending shares (true Figure A 5: Distribution of energy spending shares (observed versus imputed) – multivariate normal imputation versus imputed) –Regression imputation method. method. Figure A 6: Distribution of energy spending shares Figure A 7: Distribution of energy spending shares (observed (observed versus imputed) - Weighted PMM versus imputed) - unweighted PMM. 47 Figure A 8: Distribution of energy spending shares (observed versus imputed) - unweighted PMM with weight as control. Notes: The graphs show the distribution of energy spending shares in Green and imputed values in Orange. Distributions are weighted by survey weights in this case. Source: Matched dataset, consisting of EU-SILC (2020, income reference year 2019) and HBS (2019). 48 Appendix 3 - Distribution of Energy Spending Shares by Groups Weighted PMM Figure A 9: Distribution of energy spending shares (observed Figure A 10: Distribution of energy spending shares (observed versus imputed) - Weighted PMM - Female. versus imputed) - Weighted PMM - Male. Figure A 11: Distribution of energy spending shares (observed Figure A 12: Distribution of energy spending shares (observed versus imputed) - Weighted PMM – low-income versus imputed) - Weighted PMM – high-income Notes: The graphs show the distribution of energy spending shares in Green and imputed values in Orange. Distributions are weighted by survey weights in this case. Source: Matched dataset, consisting of EU-SILC (2020, income reference year 2019) and HBS (2019). 49 Appendix 4 – Average and Standard Deviations of Imputed Energy Spending Shares Figure A 13: Mean and standard deviation of energy spending shares across imputations - MVN. 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 1 2 3 4 5 6 7 8 9 10 Mean Std. Dev. Notes: The graph shows the means (in Blue) and standard deviations (in Orange) of imputed energy spending shares by imputation (0- 10). 0 is the true observed value. Observations are weighed by the respective survey weights. Source: Matched dataset, consisting of EU- SILC (2020, income reference year 2019) and HBS (2019). Figure A 14: Mean and standard deviation of energy spending shares across imputations – regression imputation method. 0.11 0.105 0.1 0.095 0.09 0 1 2 3 4 5 6 7 8 9 10 Mean Std. Dev. Notes: The graph shows the means (in Blue) and standard deviations (in Orange) of imputed energy spending shares by imputation (0- 10). 0 is the observed value. Observations are weighed by the respective survey weights. Source: Matched dataset, consisting of EU-SILC (2020, income reference year 2019) and HBS (2019). 50 Figure A 15: Mean and standard deviation of energy spending shares across imputations – unweighted PMM. 0.12 0.115 0.11 0.105 0.1 0.095 0.09 0.085 0 1 2 3 4 5 6 7 8 9 10 Mean Std. Dev. Notes: The graph shows the means (in Blue) and standard deviations (in Orange) of imputed energy spending shares by imputation (0- 10). 0 is the observed value. Observations are weighed by the respective survey weights. Source: Matched dataset, consisting of EU-SILC (2020, income reference year 2019) and HBS (2019). Figure A 16: Mean and standard deviation of energy spending shares across imputations – weighted PMM. 0.12 0.115 0.11 0.105 0.1 0.095 0.09 0.085 0 1 2 3 4 5 6 7 8 9 10 Mean Std. Dev. Notes: The graph shows the means (in Blue) and standard deviations (in Orange) of imputed energy spending shares by imputation (0- 10). 0 is the observed value. Observations are weighed by the respective survey weights. Source: Matched dataset, consisting of EU-SILC (2020, income reference year 2019) and HBS (2019). 51 Figure A 17: Mean and standard deviation of energy spending shares across imputations – unweighted PMM with survey weights as an additional matching variable. 0.12 0.1 0.08 0.06 0.04 0.02 0 0 1 2 3 4 5 6 7 8 9 10 Mean Std. Dev. Notes: The graph shows the means (in Blue) and standard deviations (in Orange) of imputed energy spending shares by imputation (0- 10). 0 is the observed value. Observations are weighed by the respective survey weights. Source: Matched dataset, consisting of EU-SILC (2020, income reference year 2019) and HBS (2019). 52 Appendix 5 – Weighted Distribution of Imputation in Duplicated HBS Figure A 18: Imputed energy spending share of duplicated Figure A 19: Imputed energy spending share of duplicated HBS-weighted PMM HBS - unweighted PMM. Figure A 20: Imputed energy spending share of duplicated HBS - unweighted PMM with survey weights as control. Notes: The graphs show the distribution of energy spending shares in Green and imputed values in Orange. Distributions are weighted by survey weights in this case. Source: Matched dataset, consisting of EU-SILC (2020, income reference year 2019) and HBS (2019). 53 Appendix 6 – Mean Energy Spending Share by Selected Set of Matching Variables 54 Notes: Observations are weighted by survey weights. Source: Own estimation based on a harmonized, synthetic dataset consisting of HBS (2019) and EU-SILC (2020). 55