Policy Research Working Paper 11039 Exploring Geospatial-Based Approaches to Develop a Pre-Census National Sampling Frame in Armenia Saida Ismailakhunova Avralt-Od Purevjav Tsenguunjav Byambasuren Sarchil Qader Poverty and Equity Global Department January 2025 Policy Research Working Paper 11039 Abstract The lack of a workable and accurate national sampling the pre-census enumeration areas tool, leveraging a semi-au- frame is one of the methodological constraints in conduct- tomatic and resource-efficient approach and the most recent ing representative national surveys. It undermines policy data available. The generated national sampling frame offers and research efforts in many developing countries, partic- several advantages over the existing potential sampling ularly those experiencing significant internal displacement frames in the country, which are based on census settle- and relocation due to territorial and military conflicts. This ments, electoral precincts, grid sampling, and old census paper addresses this issue by constructing the first digitized enumeration areas. This new frame is applied to conducting national sampling frame in Armenia, a conflict-affected the World Bank’s “Listening to Armenia” survey, demon- developing country, where reliable and accessible national strating its potential for other socioeconomic surveys in the sampling frames for household and individual surveys are country. The method can also be employed to generate and severely limited. The first-ever sensible digitized urban and update national sampling frames in other countries more rural classification boundaries in the country have been cre- efficiently, given that manually generating both the digi- ated for the sampling frame. The paper constructs a national tized and non-digitized national sampling frames requires sampling frame that meets international standards based on substantial financial and non-financial resources. This paper is a product of the Poverty and Equity Global Department. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at sismailakhunova@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Exploring Geospatial-Based Approaches to Develop a Pre-Census National Sampling Frame in Armenia Saida Ismailakhunova Avralt-Od Purevjav Tsenguunjav Byambasuren Sarchil Qader1 JEL Classification: C55, C83, C80 Keywords: Sampling frame, Enumeration areas, Census, Geospatial, Household surveys, Developing countries 1 Ismailakhunova: Poverty and Equity Global Practice, The World Bank, Tbilisi, Georgia (email: sismailakhunova@worldbank.org); Purevjav: Poverty and Equity Global Practice, The World Bank, Washington, DC 20433, United States (email: apurevjav@worldbank.org); Byambasuren: Dyson School of Applied Economics and Management, Cornell University, 438 Warren Hall, Ithaca, NY 14853, United States & Poverty and Equity Global Practice, The World Bank, Washington, DC 20433, United States (email: tb497@cornell.edu); Qader: WorldPop, University of Southampton, University Road, Southampton SO17 1BJ, United Kingdom (email: s.qader@soton.ac.uk). Acknowledgments: We thank Obert Pimhidzai, Ambar Narayan, Carolin Geginat, Juan Muñoz, Walker Kosmidou-Bradley, Narine Tadevosyan, Hans Peterson, and the UNICEF Armenia team for helpful feedback and comments. We are also grateful to CRRC-Armenia for sharing the sampling frame based on the 2011 census settlements and the 2023 electoral precincts. 1 Introduction The national sampling frame, a list of primary sampling units (PSUs) such as enumeration areas from a recent census, is essential for collecting reliable data. This is a global challenge in conducting representative surveys necessary for high-quality research and policy analysis with minimum bias in many developing countries, especially in conflict-affected nations. Most countries, including the Republic of Armenia, have national sampling frames based on census enumeration areas developed for past population censuses. However, the census frames are often outdated, non-digital, incomplete, and restrictive or hardly accessible to parties other than local governmental organizations, such as researchers, academics, and international organizations. These issues are particularly severe in Armenia as the country recently experienced a large-scale refugee crisis and dramatic change in population distribution across regions over time due to the Nagorno-Karabakh Conflict. 2 The lack of accessible and workable population sampling frames due to these challenges undermines the initiatives and efforts to collect surveys for statistical, policy, and research purposes. This paper first systematically assesses the existing sampling frames in the country, focusing on those that leverage census settlements—villages in rural areas, towns in urban areas, districts in the capital city, Yerevan—and electoral precincts as PSUs. The strengths and limitations of these sampling frames and potential solutions to those challenges and drawbacks have also been discussed. To overcome the limitations of those existing and accessible national sampling frames in the country, this paper thus constructs an innovative national sampling frame based on pre-census enumeration areas (pre-EAs) using a novel approach and the most recent population estimates. The primary challenge with using census settlements as PSUs is the relatively few and large-sized PSUs, making the sampling extremely challenging and costly even in a stratified two-stage cluster sampling design via increasing the monetary and non-monetary cost of household listing. Especially, in this frame, Yerevan is split into only 12 districts with large populations and extensive territories. All districts will likely be selected in the first stage of the two-stage stratified cluster sampling design according to the probability proportional to size (PPS) method. In the second stage, thus, the survey conductor is likely to conduct the household listing, the ideal and preferred approach for the second stage of the two-stage sampling, in the entire Yerevan, which would take the costs and efforts required in the Census. So, under this sampling frame, household listing is almost impossible for household surveys. 3 Electoral precincts, on the other hand, offer relatively 2 Armenia and Azerbaijan have been in a territorial conflict over the Nagorno-Karabakh area which has a majority ethnic-Armenian population but was controlled by Azerbaijan before the collapse of the Soviet Union in the late 1980s. Following the collapse of the Soviet Union, the local parliament of Nagorno-Karabakh voted to become part of Armenia, which led to an ethnic clash between the two countries. Then the first Nagorno-Karabakh War started and ended in 1994 with an agreement stating that the area remains part of Azerbaijan but run by a self-declared republic of ethnic Armenians, backed by the Armenian government. However, Azerbaijan regained power over the area in 2020, and more than half of the estimated 120,000 ethnic Armenians fled from Nagorno-Karabakh to Armenia since the ceasefire (https://www.bbc.com/news/world-europe-66852070). 3 Although it is not ideal and not encouraged, one might consider using a random walk approach to select households in the second stage of the two-stage sampling as a last resort under pressing constraints. However, sample households will likely be concentrated in particular areas or demographic groups under this approach if the random walk is not properly executed, which happens most of the time, because the enumerators can walk around the starting point. Listing the households is preferred over random walk but requires time and financial resources. Survey conductors tend to use random walks if subject to those constraints, like the Listening to Armenia Survey. If the starting point and skipping steps are randomly selected, the random walk approach would be essentially the same as random sampling with equal probabilities. However, households selected via random walk can be concentrated around certain areas, leaving out most PSU territories. Consequently, representativeness suffers, and only a sub- population might be sampled. Due to these and other issues of the random walk method that can introduce serious biases, it is 2 more PSUs with smaller sizes; however, cartography or boundary information is often unavailable for this type of sampling frame like in Armenia and the PSU sizes are still quite large, indicating a costly procedure for household listing as well. 4 Hence, alternative cost-effective national sampling frames based on reasonably- sized PSUs with digital boundaries are worthwhile and necessary to facilitate conducting household surveys essential for policy making and research analysis. The national sampling frame is commonly based on three sources: Census enumeration areas (EAs), 5 sub- national administrative boundaries, and a gridded population sampling frame. The census is conducted every 5-10 years or more, depending on financial and administration costs and the nature of the census questions. For example, Armenia’s sampling frame based on census settlements relies on the 2011 Census, which is severely outdated due to demographic changes associated with population displacements and mobility across regions in response to territorial conflicts. 6 The Committee of the Republic of Armenia (ArmStat) conducted a Population Census using a combined approach of administrative data and sampled census in 2022, providing a potential up-to-date national sampling frame. Due to the COVID-19 pandemic, the census was held in a hybrid format, which was the case in many countries. However, this frame, based on a sample of 25% of the addresses in the State Population Register (SPR), is restrictive and unreachable. Because this dataset is inaccessible, it is crucial to have an alternative methodology to update the dataset from the previous census. 7 One of the early techniques to overcome this challenge is grid sampling, where cells—rectangular areas with population estimates—serve as PSUs. The sample frame relies on millions of grid cells publicly available from different data sources, such as the Gridded Population of the World (GPW), which can be accessed through platforms like Google Earth Engine. The grid size varies across data sources, and the choice of the grid size and data provider differs across countries. The grid sampling, however, is subject to some issues. First, grid boundaries are unnatural and can go through buildings and disregard visible features. Second, although the spatial size of the grids is equal, the population size of the grids is unequal. Sparsely populated grids need to be collapsed, and densely populated grids need to be segmented (Qader et al., 2020, 2021). 8 This paper thus provides Armenia’s first digital national sampling frame, successfully developed using various innovative geospatial techniques and datasets. Multiple datasets and maps have also been produced to facilitate the ground survey data collection. The proposed national sampling frame, where PSUs or EAs are automatically created, offers several advantages over other sampling frames (Qader et al., 2021). First, the EAs have natural boundaries like streets and natural elements like rivers. Second, administrative boundaries are strictly respected because the EAs are nested within administrative boundaries by design. While some not widely accepted as a standard approach. This paper does not discuss the method in detail since the focus is on the national sampling frame. 4 The sampling frame that relies on the list of electoral precincts has been used by the “Caucasus Research Resource Center – Armenia” Foundation (CRRC-Armenia), an independent, non-partisan research center, for the Caucasus Barometer Biennial Survey that studies citizens’ views on democracy, governance, political reforms, and socio-economic developments in the region. 5 The enumeration areas are the operational geographic units for the collection and dissemination of census data and are often used as a national sampling frame for various types of socio-economic surveys. 6 Figure A.1 visually demonstrates an example of significantly changing population distribution over time to show the magnitude of the shifting population, which indicates the limitation of sampling frames with outdated population information. 7 Several attempts have been made to access the State Population Register (SPR) but failed to obtain the sampling frame based on SPR addresses. An alternative approach thus has been explored due to the time constraint. 8 A sampling frame based on the traditional grid sampling method has been explored. However, several challenges are encountered for this approach. For example, it is challenging to identify the grids' settlement type (rural or urban). Identifying administrative units, e.g., districts or villages, to which a particular grid belongs is also challenging. In addition, this strategy has other limitations, such as grid boundaries cutting buildings or settlements. 3 automatically created boundaries can be unnatural, creating EAs via an automatic procedure avoids any geometric errors (e.g., pockets, disjoint sections, and overlap) and is much more resource-efficient than manual creation by a human cartographer. 9 Third, the population estimates are generally homogeneous, i.e., neither too small nor too large, even though they require further adjustments. Although some basic comparisons were carried out to evaluate the gridded population outputs in this paper, comprehensive validation of gridded population estimates is out of the scope of this paper due to a lack of data, time, and resources. The datasets used to create a semi-automatic mapping of pre-EAs include high-resolution gridded populations, the spatial distribution of settled areas, and publicly available natural and administrative boundaries from numerous sources such as OpenStreetMap (OSM) and WorldPop. The paper also implements cross-validation exercises to examine the applicability of the proposed sampling frame and compare it with other existing sampling frames in the country. For this purpose, several datasets have been leveraged, including (i) the 2011 Census with spatial information on marzes 10 and settlement type (urban or rural), (ii) Census settlements that are based on the 2011 Census, (iii) 2023 electoral precincts, and (iv) other population data from the ArmStat at the aggregate level, such as marz-settlement type (urban and rural regions). This paper contributes to the literature and the field of survey sampling in several ways. First, the paper develops a national sampling frame in Armenia based on pre-census enumeration areas (pre-EAs) as an exemplary case to demonstrate the applicability of the semi-automated spatial technique which can benefit many other countries. The method has been implemented and tested in several countries such as Somalia, which lacks a digital national sampling frame (Qader et al., 2020, 2021), Cameroon, which has a special demand for a customized national sampling frame for refugees, the Democratic Republic of Congo (Qader et al., 2023), and Burkina Faso (Qader et al., 2022). The tool has also been piloted in Uzbekistan as the first case in Central Asia. However, the implementation of the approach in Armenia is much more extensive. Second, the national sampling frame created in this paper contributes to the survey collection in Armenia. Depending on their availability, using various sampling frames within the country makes it challenging for researchers, policy makers, and users of household and individual survey datasets to compare results across surveys. This paper thus provides a methodological contribution by offering a national sampling frame based on publicly available datasets, which contributes to avoiding the use of frames with different population estimates. This supports all individuals and organizations conducting household surveys by providing a standardized frame that ensures consistency across surveys and decentralizes the sources of the national sampling frame. Third, a rigorous evaluation of various sampling frames, including the new national sampling frame, provided in this paper contributes to the survey sampling literature and practice in Armenia. To the best of our knowledge, no study has systematically compared various sampling frames yet, especially in Armenia. Our 9 The manual approach to delineate enumeration areas (EAs) can also be challenging due to security restrictions on accessing fragile, conflict-affected, and violent areas, like COVID-19 restrictions. 10 Armenia is subdivided into eleven administrative divisions. Of these, ten are provinces, known as marzer (մարզեր) or in the singular form marz (մարզ) in Armenian. The marz is the first administrative unit in Armenia, which is equivalent to a region or state. Yerevan is treated separately and granted special administrative status as the country's capital. 4 sampling frame does not necessarily substitute the existing sampling frames but could complement them, particularly the national census frame, and can be used as an alternative. Finally, this approach creates demand for public datasets like OpenStreetMap. The existence of such high- quality, public, geospatial data can return tens to hundreds of millions of dollars in societal value, even without counting their indirect benefits (Hansen and Schrøder, 2019). 2 Existing and Accessible Sampling Frames The household survey is one of the most popular methods for collecting population and socioeconomic indicators. An accurate sampling frame is essential to carry out household and other surveys at the national level. These frames are frequently created and updated during the national housing and population census. Some countries may not have a digital national sampling frame because their census was not digitized and more seriously some might not even have a physical map of the census frame, or it might be not accessible. Additionally, the frames might be outdated because demographic factors change quickly, and most censuses are conducted every ten years. In Armenia, there is no workable and accessible map or cartographic information that can be used for a national sampling frame. It is a significant obstacle to conducting nationally representative socioeconomic surveys. There is also no up-to-date digitalized national sample frame in the country. The last census completed in Armenia was conducted in 2011 (ArmStat, 2024), and there are currently no up-to-date enumeration areas that can be employed as national sampling frames for representative socioeconomic surveys. The spatial resolution of the census results in the current use is restricted to the provincial or district level (2nd and 3rd administrative units), making it difficult to understand how people are distributed at the facility, sub-district, or neighborhood levels—the levels where policy interventions are typically provided. In developing countries, creating a sampling frame for surveys using representative community samples usually involves manually delineating small geographic areas (or enumeration areas) on high-resolution satellite imagery. This is a standard method employed by multiple National Statistical Offices (NSOs) but is logistically complex, and requires substantial material resources, Geographic Information System (GIS) experts, and training (Qader et al., 2020, 2021). In addition, the process is extremely time- consuming and expensive and often leads to postponing the survey. For example, it can take 2-3 years to carry out a survey (Grosh and Glewwe, 1995). These factors demand a fast and inexpensive sampling frame and population enumeration methodologies. In this paper, a digital national sampling frame has been created using a semi-automatic and cost-effective approach. However, background on the existing sampling frames in the country is provided before discussing the new sampling frame. Three existing datasets that can serve as a potential sampling frame for household surveys include the 2022 Census with addresses from the State Population Register (SPR), the 2011 Census with settlements (villages in rural areas, towns in non-Yerevan urban regions, and districts in Yerevan), and 2023 election data with electoral precincts. The ArmStat conducted a Population Census in November 2022 using a combined approach based on administrative data from the State Population Register (SPR) and a sample of 25% of the addresses in the SPR (ArmStat, 2022). In this dataset, primary sampling units (PSUs) are the workload of SPR addresses for each enumerator during the census. Although these PSUs do not have identifiable boundaries like the traditional census enumeration areas (EAs), this dataset can be used as a sampling frame since they cover approximately 93% of all addresses in the country. The household listing in this dataset is as of October 5 2022; however, the addresses are distant because the EAs are large. Regardless of its strengths and limitations, this frame is inaccessible as the data is only available on a restricted computer at ArmStat (Pettersson, 2023). Therefore, our focus is on the latter two datasets, including the 2011 settlement-based frame and the 2023 electoral precinct-based frame, and their advantages and disadvantages are described below. 2.1 Census Settlements The most granular spatial information in this dataset is “settlement”. There are 1,037 settlements (980 villages, 45 towns, and 12 districts in Yerevan). The main advantage of this frame is that it identifies slightly over 1,000 geographical areas in the country, which is better than having only regions or marzes. Suppose that 400 settlements were selected in the first stage of the two-stage stratified cluster sampling design as primary sampling units (PSUs), and all 12 districts in Yerevan are likely to be selected via probabilities proportional to size (PPS) approach in the first stage due to the large population in these large districts. Supposedly, if 10 households were randomly selected from each district in the second stage, the sample size from Yerevan would be 120 households, only 3% of the total sample of 4,000 households. However, approximately 35% of the population and 38% of total households live in Yerevan according to the 2011 Armenia Census. One can select a disproportionate number of households from each PSU in the second stage to account for the variations in the size of the PSUs in the first stage. For example, selecting 100 households from each district in Yerevan would yield 1,200 households from Yerevan, which is 30% of the sample. It is better than having spatial information only at the level of administrative units or strata (strata is defined as a combination of marz and settlement types–urban or rural status–following other official and major surveys) like in several other easily accessible datasets. The primary and more serious disadvantage of this sampling frame, however, is the large-sized settlements, especially in Yerevan. Although the identification information is unavailable, the Census frame from the Committee of the Republic of Armenia (ArmStat) has about 12,000 enumeration areas. The settlements in this frame are thus 12 times larger than the Census enumeration areas on average. Using a few large PSUs would also conceptually undermine the two-stage sampling design, converging to a one-stage design. Large “settlements” thus must be segmented into smaller areas to make workable PSUs. 11 For example, large-sized PSUs were manually segmented into smaller PSUs in the past, such as in Nepal (Central Bureau of Statistics of Nepal, 1996); however, this traditional or manual method is costly and time-consuming in practice. Thus, an innovative technique has been proposed in this paper to divide these large areas into smaller areas. Another problem with this potential sampling frame is that the population based on the 2011 Census in this sample frame is outdated and misallocated. The outdated data is generally not a severe concern for national sampling frames because any survey fielded before the 2022 Armenia Census could be based on the 2011 Census frame. However, this problem can be significant in Armenia, given that it has been experiencing substantial household displacement and domestic migration due to territorial conflicts in recent years, i.e., the current spatial distribution of the population might be significantly different from that during the 11If these large census settlements were divided into grids, challenges related to the traditional gridded sampling approach would be raised, such as building-cutting boundaries. 6 previous census. One can update the population using population growth at some aggregate level. If the population is updated at a more granular level than administrative units, changes in the population distribution across areas can at least partly resolve problems due to outdated data. However, any adjustments will not change the PSU’s probability of being selected in the first stage if the adjustments are made at the administrative units or more aggregate level, such as regions. The reason is that the PPS is specific across administrative units, and any monotonic transformation of PSU size within the administrative units cannot change the PSU’s selection probability. Therefore, the population and number of households should be adjusted at a level more granular than administrative units. 2.2 Electoral Precincts The data on electoral precincts is publicly available and generally trustworthy as the official general election committee publishes the data. There are two advantages to using this dataset as a sampling frame for household surveys. First, the data is regularly updated and current. Second, there are 1,992 electoral precincts, more than the 1,037 settlements in the 2011 Census settlement data. Therefore, the spatial information is relatively more granular than the 2011 Census data, and problems associated with a few large-sized PSUs are less severe than those under a potential sampling frame based on census settlements. However, like the “settlements” in the 2011 Census data, employing electoral precincts as PSUs also poses a problem associated with large-sized PSUs. It is also worth mentioning that there are also some small-sized electoral precincts, which are less problematic. As stated in Pettersson (2023), the size of voting point areas in Armenia ranges from 7 addresses to 1,200 addresses. The small voting point (VP) areas can be combined with a neighboring area or neighboring areas, and large VP areas can be divided into smaller segments. The procedure for merging multiple VP areas should be straightforward; however, segmenting large areas could be costly as it requires some spatial analysis and probably cartographic tasks. Additionally, the boundaries of PSUs are needed because the enumerator might go beyond the targeted area if they do not know the boundaries. This feature is absent in both sampling frames under investigation in this section since there is no boundary (neither digitally nor physically) available for the Census settlements and electoral precincts. However, it is more disadvantageous for the precinct-based sampling frame because precincts are relatively smaller in area than settlements. The probability of enumerators walking into the non- selected PSUs that neighbor with the selected PSU is thus higher for precincts. Enumerators can walk outside the boundary for some very small electoral precincts, such as the one with 7 addresses, if not provided with a proper map while doing the fieldwork. The pros and cons of a frame based on the 2023 electoral precincts show that it is relatively better than the frame based on the 2011 Census settlements. Thus, the 2023 electoral precincts have been further evaluated as a sampling frame for representative individual- and household-level surveys by exploring the data on electoral precincts. The size of the electoral precinct is measured by the number of voters or adult population, which leaves out children or individuals younger than 18 years old. Table 1 shows the distribution of strata size using total and adult population. The strata are similarly defined as earlier, i.e., the combination of marz and settlement type—urban or rural status. The 2020 population is obtained from the WorldPop predicted based on the 2011 Census (Bondarenko et al., 2020). Then the 2020 population has been predicted further to 2022 as shown in Column 1 using strata-level population growth obtained from the Committee of the Republic of Armenia (ArmStat). The estimated total population in 2022 in this data is 2.9 million, close to the official statistics from ArmStat. Column 2 displays the 2023 number of voters (adult population who are 18 or older) obtained from the electoral precinct-based sampling frame. Column 3 presents the difference between the population and the adult population in subsequent years. Although these two population 7 numbers are in different years, there are some irregular instances where the adult population exceeds the total population by approximately 15,000 people, for example, in rural areas of Ararat. Table 1: Population and number of voters across strata Population Number of voters Difference (2022 estimate Settlement (18 or older, 2023 between total and Marz name Strata ID based on 2020 type election data) adult population WorldPop data) (1) (2) (3) = (1) - (2) Aragatsotn Urban 1 26,738 27,847 -1,109 Aragatsotn Rural 2 98,949 82,622 16,327 Ararat Urban 3 72,294 56,730 15,564 Ararat Rural 4 136,288 150,805 -14,517 Armavir Urban 5 82,953 76,205 6,748 Armavir Rural 6 183,009 137,823 45,186 Gegharkunik Urban 7 65,902 61,823 4,079 Gegharkunik Rural 8 161,812 109,803 52,009 Kotayk Urban 9 137,493 126,680 10,813 Kotayk Rural 10 108,771 94,876 13,895 Lori Urban 11 124,050 134,962 -10,912 Lori Rural 12 87,532 76,157 11,375 Shirak Urban 13 133,620 128,691 4,929 Shirak Rural 14 96,856 77,832 19,024 Syunik Urban 15 90,205 65,044 25,161 Syunik Rural 16 44,350 33,042 11,308 Tavush Urban 17 49,859 39,214 10,645 Tavush Rural 18 69,943 58,907 11,036 Vayots Dzor Urban 19 16,160 16,811 -651 Vayots Dzor Rural 20 31,501 25,678 5,823 Yerevan Urban 21 1,098,866 824,317 274,549 Armenia 2,917,151 2,405,869 511,282 Notes: The 2023 election data on the number of voters or adult population and other spatial information are obtained from the CRRC-Armenia. Column (3) presents the difference between the 2022 population updated based on WorldPop data and the 2023 number of voters or adult population from the election data. As shown in Column 2 of Table 1, the total number of voters or adult population is 2.4 million, which is quite close to the total population. It indicates approximately 18% of the population is children below 18 years old. However, children below 18 years of age account for around 23%-24% of Armenia’s population in other datasets. So, Armenia’s total number of adult populations from different and official data sources has been investigated to compare the aggregate numbers. Table 2 reports the findings. The share of the adult population from the 2011 Armenia Census (population 0-17 years old) is 77%, and the adult population share based on the combination of 2022 World Bank data (population 0-14 years old) and the 2011 Armenia Census (population 15-17 years old) is 76%, indicating that election data overstates the number of adult populations by 4%-5%. Despite these discrepancies, the PSU size measured by the number of adults or voters is not a serious concern since the total and adult populations across strata or administrative units are strongly and positively correlated with the correlation coefficient = 0.996 (-value = 0.000). 8 Table 2: Share of adult population in different datasets Adult population Adult population Number of voters (18 or older, 2011 (18 or older, 2022 World Bank data & (18 or older, 2023 Armenia Census) 2011 Armenia census) election data) (1) (2) (3) Share in total population 77% 76% 81% Notes: The 2011 Census data used in Column 1 reports age-specific population, and the share of the population with 18 or older is shown. In addition to the absolute value of PSU size, the distribution of the size measure across PSUs is also important. Figure 1 thus plots the distribution of the 2023 adult population across electoral precincts. It is ideal to have equally sized PSUs or the PSU size to be evenly distributed in the sample frames, and the population in census frames tends to be normally distributed with few small and large PSUs. However, the distribution of voters across electoral precincts is U-shaped. The precinct size ranges from 10 to 2,061 voters, with the mean (median) size of 1,208 (1,399), confirming that merging and segmentation are necessary to make the electoral precinct-based frame workable, consistent with the distribution of households in Pettersson (2023). Figure 1: The distribution of voters or adult population across electoral precincts in Armenia in 2023 Finally, no other major and official surveys use electoral precincts for sampling frame purposes in Armenia. Major surveys such as the Demographic and Health Survey (DHS) for Armenia leverage sample frames based on enumeration areas. 12 Table 3 lists the sampling frames of the major surveys in Armenia. The lack of a 12 Refer to Appendix A of Armenia’s DHS 2015-2016 Report for details about its sample design (https://dhsprogram.com/pubs/pdf/FR325/FR325.pdf). 9 usable and accessible sample frame based on enumeration areas demands a workable and up-to-date national sampling frame in the country. Thus, given the unsuitability of the existing sample frames, this paper develops a national sampling frame based on pre-defined enumeration areas (pre-EAs) in Armenia that conceptually prevails over the existing sampling frames evaluated in this section. Table 3: Sampling frames of major surveys in Armenia Surveys PSUs in the sampling frames Demographic and Health Survey (DHS) Enumeration areas in the Armenia Population and Housing Census Integrated Living Conditions Survey (ILCS) Population census enumeration areas UNICEF Multiple Indicator Cluster Survey Currently in search of a suitable sampling frame (MICS, first ever in Armenia) Notes: The PSUs stand for primary sampling units. 3 Data and Methods 3.1 Input Datasets This section describes the datasets obtained from different sources to establish the national sampling frame and support collecting field data in Armenia. 3.1.1 Gridded Population The gridded population for Armenia is obtained from WorldPop (Bondarenko et al., 2020) and is based on the 2020 population census or projection-based estimates for 2020. The population data contain an estimated total number of people per grid cell (panel (a) Figure 2). The data at a resolution of 3 arcs (about 100 meters at the equator) can be downloaded in Geotiff format with the projection of Geographic Coordinate System, WGS84. The estimated number of people is in units of one pixel. The values marked with “NoData” denote regions mapped as unpopulated according to the results of the Built-Settlement Growth Model (BSGM) developed by Nieves et al. (2020). The WorldPop gridded population dataset was produced by disaggregating the projected subnational population totals into grid cells using machine learning techniques and a variety of geospatial layers derived from satellite imagery (Stevens et al., 2015). 3.1.2 Digitized Features Visible from the Ground The outline of enumeration areas (EAs) should be positioned in line with prominent visible ground features to facilitate ground-based data collection. Extensive digitized visible ground features are needed to ensure that the outline of the pre-EA meets the criteria during the automatic creation of the national sample frame in Armenia. Thus, extensive digital features, both natural and man-created, mostly from OpenStreetMap (OSM) are compiled. Panel (b) of Figure 2 illustrates information obtained from the OSM data, including road networks, waterways, and railways. This data provides modifiable and updatable inputs to enable multiple iterations of EA generation. The figure depicts the input datasets where the two datasets have not been combined yet. The subsequent figures show how the input datasets have been used to divide the country with their estimated total population. 10 Figure 2: Population estimates and digitized visible ground features Notes: Panel (a) shows the gridded population estimates at ∼100×100 meters, while panel (b) depicts the digitized visible ground features from OSM. Basemap: ESRI Satellite Imagery. 3.1.3 Settlement Boundary and Classes Determining the precise location and boundary of each settlement is essential for creating pre-EAs and providing field direction. The settlement boundaries can facilitate identifying the outline of settled areas and preventing the mixing of pre-EAs among large and settled areas. Thus, the settlement boundary has been created and the settlements have been classified into urban and rural classes. First, using the Global Human Settlement Layers (GHSL) product, settlements across the country are identified and converted to a vector layer to generate the settlement boundary. Then, the settlements are classified into urban and rural classes with the help of GHSL classes. The GHSL provides various settlement information in different spatial and temporal resolutions with various informative classes. 13 The following two datasets are acquired from GHSL for settlement delineation. GHS-BUILT-S R2023A: The spatial raster dataset known as GHS-BUILT-S illustrates the distribution of built-up (BU) surface estimates in five-year intervals between 1975 and 2030, together with two functional use components: the overall BU surface and the non-residential (NRES) BU surface (Schiavina et al., 2023). Panel (a) of Figure 3 displays the data, which is created by spatially and temporally interpolating five observed sets of multiple-sensor and multi-platform satellite images (Landsat and Sentinel 2). GHS-SMOD R2023A: Settlements have been delineated and classified globally by the GHS Settlement Model layers (GHS-SMOD) using the logic of cell cluster population size, population, and built-up area 13 https://human-settlement.emergency.copernicus.eu/index.php 11 densities as recommended by the United Nations Statistical Commission and defined according to stage I of the Degree of Urbanization (Eurostat, 2021). The built-up surface, land layer, and a 1km2 population spatial raster dataset are the inputs for the GHSL SMOD. As shown in panel (b) of Figure 3, the GHS SMOD classifies the 1km2 grid cell at the first hierarchical level into three spatial entities, including “urban center”, “urban cluster”, and “rural grid cells” (Schiavina et al., 2023). Figure 3: Global human settlement layer Notes: Panel (a) presents the built-up area, and panel (b) shows the SMOD classes. Basemap: ESRI Satellite Imagery. 3.1.4 Administrative Boundary It is necessary to nest the produced enumeration area (EA) boundary inside administrative borders. The generated EA boundary should be nested within administrative boundaries. The HDX website (https://data.humdata.org/) provided the necessary administrative border, which is based on the 2011 census, for Armenia. 3.2 Semi-automatic Creation of pre-EAs This section describes the semi-automated process employed to generate the first digital national sampling frame in Armenia. The national sampling frame is automatically created, although some manual adjustments 12 may be required to improve the outputs due to inadequate and poor-quality input datasets. The process is categorized into three main parts presented in the following sections. 3.2.1 Urban and Rural Classification People living in and around cities are referred to as the urban population, which usually has a fairly high population density. In contrast, a rural population is dispersed over vast areas of land and is found in developing regions. The overall population and geographic area are primary determinants of the size of the enumeration areas (EAs). A balance between maximum population and area constraints is needed to produce EAs of manageable size. Because of the disparities in population density and distribution between urban and rural administrative units, different criteria should be applied while forming EAs. Armenia does not have well-defined boundaries to separate urban and rural areas. The following approaches have been used to establish the border between Armenia’s urban and rural classes as a first step in developing the national sampling frame: 1. Take the GHS-SMOD dataset and extract all urban classes; combine them into a single class. 2. Convert the raster dataset from the combined classes to polygon vector format. These polygons represent the urban area. 3. Extract the built-up area with values greater than 0 from the GHS-BUILT-S data. 4. Create polygons from the raster built-up area. The country’s settlement boundaries and extent are represented in this output. 5. Apply a 50-meter buffer to output 4 to account for recent urban growth and prevent cutting structures at the edge of the settlements. 6. Intersect output 5 with the administrative boundary in Armenia. 7. Make the necessary manual and automatic adjustments when necessary. For example, an administrative boundary should be regarded as urban if more than 90% of areas are urban (e.g., Yerevan). 8. Any polygon in output 7 that crosses the output 2 boundary is an urban area. The rural class encompasses the remaining portion of the administrative boundaries. 3.2.2 The preEA Tool The “preEA” is a powerful and flexible package that has been developed by WorldPop in close collaboration with GeoData and consultation with various governments, UN agencies, and experts around the world (Qader et al., 2021, 2023). Since its development is ongoing, the tool is not yet available to the public. It has been designed as a user-friendly QGIS plugin using the Python programming language. The implementation of the preEA tool is fully automatic; however, some preparation processing steps should be taken before using the tool. These processes will be elucidated below. Input Data Preparation: The data preparation consists of three steps. First, the input boundary datasets are reprojected into the projected UTM WGS 1984 coordinate system. The preEA tool is compatible with such projection to ensure that the output units are in familiar area units such as meters or feet. Second, the digitized boundary (roads, waterways, and railways) is masked by the extent of the administrative boundary. Third, to prevent the creation of sliver polygons in the outputs, double lines (such as motorways) are merged 13 in the road datasets. It was accomplished by utilizing a 25-meter distance on Merge Divided Road in ArcGIS Pro. A single line will be created from any roads that fall within 25 meters of each other and have the same road code or class (Figure 4). Figure 4: Preparation of road dataset Notes: Panel (a) shows the original road data, while panel (b) presents the road data after applying the Merge divided road technique. Basemap: ESRI Satellite Imagery. Create Uncrossable Features: Enumerators should avoid crossing major obstacles when collecting data on the ground to increase efficiency and save costs. For this purpose, different features were set uncrossable: (1) Uncrossable lines: In the OSM line features, the name class of the digitized features is recorded in the “fclass” column. After a considerable visual assessment, road classes such as primary, secondary, trunk, and tertiary, and waterway classes such as river were extracted, combined, and defined as uncrossable lines. (2) Uncrossable settlement boundary: It is always preferred if enumeration areas (EAs) from different cities, towns, or villages are kept apart from one another. Three procedures are used to establish uncrossable settlement boundaries to prevent this. First, the Zonal Stat technique was used to summarize the total population in each settlement polygon that was created in section 3.2.1. Second, a visual inspection approach was utilized to identify the minimum population threshold to define the uncrossable settlement boundaries. Consequently, all settlement boundaries with a population of more than 200 were declared uncrossable. Third, the polygons containing more than 200 people were transformed into lines and merged with additional uncrossable lines to get the final uncrossable features. Implementing the preEA Tool: In the first part of the process, the preEA tool divides the region into small geographic units by polygonizing all the input feature datasets. The user needs to define the hard constraints, which include maximum population and maximum area. Apart from the hard constraints on population and area, it is necessary to input administrative boundaries and uncrossable features, which the pre-EA boundaries formed must adhere to. In addition, the user should define soft constraints, including minimum length shared boundaries and different weighting coefficients, to guarantee that pre-EAs are created that meet user expectations and global standards. Following the establishment of all the parameters, 14 the small geographic units will be merged until one of the hard constraints is satisfied. The output is in vector format, and each generated pre-EA has necessary attribute information from the administrative boundary, total estimated population, area, and unique IDs (Random IDs and ID numbering using serpentine technique). Table 4 presents the parameter calibration used in the preEA tool. The user can adjust the merging process’s priority parameters. For instance, when the population in the tool is given more weight, the created pre-EAs’ total population will be homogeneous and near the maximum population threshold. In contrast, raising the shape factor coefficient can achieve a more compact shape. Table 4: Parameter calibration in the preEA tool Minimum Maximum Area Population Share factor Class Maximum area length shared population (coefficient) (coefficient) (coefficient) boundaries Urban 1000 10km2 20% 0 1 2 Rural 800 10km2 20% 0 1 2 Post-Processing: The quality of the input datasets primarily undermines the quality of the outputs from the preEA tool. Some of the generated pre-EA outputs need to be assessed and modified because of the limitation of the input datasets and the imposing of various constraints in the tool. This is also the primary rationale behind the output’s designation as pre-EA rather than EA. For example, pre-EAs with negligible populations and geographic areas were automatically merged with their neighbor via employing the “Eliminate Small Area” tool in the preEA tool package. Manual modifications were applied to pre-EAs spanning large geographic areas with high populations. 3.3 Further Manual Adjustments The national sampling frame created using the automatic approach requires additional modifications for several reasons. First, some adjustments are needed in areas where the existing datasets were inadequate to create smaller EAs. Second, the output contains several polygons with positive populations but no settlement of residents. The pre-EAs with positive populations that clearly fall within non-residential areas such as offices, buildings under construction, 14 recreational centers, factories, manufacturing zones, and agricultural lands are manually adjusted. Appendix B discusses the use of boundary information produced by the approach for creating field and offline maps to supplement the survey data collection process. 4 Results: A National Sampling Frame – Pre-EAs for Armenia This section presents the sampling frame in which the pre-census enumeration areas (pre-EAs) serve as primary sampling units (PSUs). 14 Note that the judgments are based on satellite images from Google Street Maps available at the time of this study, i.e., April 2024. So, settlements and residential buildings recently built might have been overlooked in the construction of this sampling frame. However, these measurement errors are more likely to be random, so this limitation should not introduce systematic bias in the frame. 15 4.1 Urban and Rural Classification Armenia’s first accessible and usable urban and rural digitized border has been produced in this paper (Figure 5). According to the generated boundary and WorldPop gridded population data for 2020, urban areas account for 20% of the land area, while rural areas cover the remaining 80%. In terms of population, more than 60% of Armenians live in urban areas, with the remaining less than 40 residing in rural areas. Figure 5: Generated urban and rural areas in Armenia Notes: The map shows areas classified as urban areas in red and rural areas in yellow and how those are nested within the subregion boundaries in Armenia (https://data.humdata.org/dataset/geoboundaries-admin-boundaries-for-armenia). Basemap: OSM Standard. 4.2 Population Estimates A critical ingredient in creating EAs and sampling frames is the availability of granular information on the number and location of the population, which helps to set EA size at a suitable level for enumerators to 16 cover the area in a reasonable timeframe. The grid-level population estimates are obtained from WorldPop data, which provides population density estimated using a global high-resolution population model (WorldPop, 2024). The population count is available from the 2011 Census data from the Committee of the Republic of Armenia (ArmStat) with limited spatial information, including marz and settlement types. Despite the limited spatial information in the census data, the population estimates from WorldPop have been compared with the actual population count at the aggregate level to examine the accuracy of the estimated population. The population distribution at the marz and urban/rural levels are highly comparable. The correlation coefficient between populations from the two sources is 0.99 (SE: 0.05, p-value: 0.00) for marz populations and 1 for urban/rural populations. Figure 6 compares the “urban” 15 population at the marz level. Figure 7 compares the “total” population at the urban/rural level. These suggest that the population estimates from the WorldPop data are consistent with the actual population, at least at the marz and urban/rural levels. More importantly, the automatic creation of urban and rural boundaries has produced an accurate classification. Figure 6: Urban population in marzes , 2011 (a) Absolute number (b) Share in total Figure 7: Urban and rural population, 2011 (a) Absolute number (b) Share in total 15The “total” population was not compared because marz names were missing for rural settlements in the WorldPop data used for this analysis. However, there is hardly a reason for rural populations to vary significantly across the two data sources. 17 4.3 Armenia’s National Sampling Frame Figure 8 illustrates the pre-EAs created in this paper using the method described in Section 3, showing that the pre-EAs cover the entire territory and the boundaries are drawn without any geometric mistakes. In the first stage of automatic production of the pre-EA, the preEA tool produced 130,378 building blocks (panel (a)). After the merging process, 7,413 pre-EAs were delineated for the entire Armenia of which 3,813 were in urban and 3,600 in rural areas (panel (b)). Nearly 60% of the pre-EAs, specifically 4,354, have a population strictly greater than zero after the manual adjustments, most of which, have an estimated population of between 100 and 1,000 people. The remaining 3,059 pre-EAs are unsettled, i.e., their population is zero as described in Section 3. Thus, although the map covers the entire country, approximately 41% of the area in the map would not be accounted for in the sampling designs because the probability of being selected is zero for empty PSUs. However, all individuals are included. Figure 8: PreEA tool outputs Notes: Panel (a) depicts building blocks before merging, while panel (b) shows pre-EA outputs after merging. Basemap: OSM Standard. Next, the PSU size is characterized to examine the sampling frame and analyze the population distribution across pre-EAs, focusing on those with a positive population. Figure 9 plots the distribution of 2022 population estimates across pre-EAs in the country. 18 Figure 9: Distribution of population across pre-EAs in Armenia Notes: The figure presents the distribution of population across pre-defined enumeration areas (pre-EAs) in 2022. The extreme values of the population have been winsorized at the 1st and 99th percentiles to account for the potential outliers, i.e., we set the low (high) values at the 1st percentile (99th percentile). The population distribution is fairly normal: 90% of the PSUs have a population below 1,156, and the remaining 10% have a population between 1,156 and 2,274 after winsorizing at the 1st and 99th percentiles to avoid potential outliers. The population of the pre-EA is not winsorized in the sample frame, and it is our suggestion to adjust those extreme values as the population might have been overestimated in those areas. However, the users of this sampling frame can make their judgments regarding such values. Despite some outliers at the top part of the distribution, the population across pre-EA is relatively more evenly distributed than the adult population across electoral precincts, which displays U-shaped patterns as shown in Figure 1. Figure 10 depicts examples of pre-EAs with positive populations that fall within non-residential areas. Since these pre-EAs do not contain residents, the frame is adjusted by setting the population estimates of those pre-EAs to zero. Before this modification, there were 1,601 pre-EAs already with zero populations, and the population of 1,458 additional pre-EAs was changed to zero. Thus, a total of 3,059 pre-EAs are considered as unsettled areas. After setting population estimates to zero for these pre-EAs, the population estimates of the remaining pre-EAs with positive populations have been adjusted by multiplying them with a strata-level factor to match the strata-level population counts. 19 Figure 10: Examples of pre-EAs that are located in non-residential areas yet have a population greater than zero Notes: Basemap: ESRI Satellite Imagery. Certain international guidelines must be followed during the development of pre-EA boundaries, as shown in Figure 11. Within the specified administrative limits, the pre-EA boundaries must be nested (panel (a)). Rivers and major roads are examples of uncrossable features that the pre-EA outlines must respect (panel (b)). The pre-EA boundaries should be aligned with ground-visible features such as roads and infrastructures. Panel (c) shows examples of the pre-EA boundaries in urban areas, whereas the pre-EA boundaries in rural 20 areas are shown in panel (d). The figures demonstrate how well the pre-EA’s outlines followed the discernible ground features. 16 Figure 11: The outline of pre-EA boundaries Notes: Basemap: OSM Standard and ESRI Satellite Imagery. 4.4 The First Application: Listening to Armenia This section presents an application of the proposed national sampling frame based on pre-EAs in Armenia, specifically describing the sampling design of the World Bank Group’s “Listening to Armenia” survey (L2Arm). The application of the proposed sampling frame in the L2Arm survey has proven to be a successful approach. It allowed for a nationally representative sample that met the objectives of capturing urban and rural distinctions across all regions of Armenia. The innovative use of pre-EAs optimizes resource allocation and ensures the feasibility of the survey within time and budget constraints. 16There are no boundaries of electoral precincts and Census settlements in Armenia, which is one of the limitations of those potential sampling frames in the country, as discussed in Section 2. 21 4.4.1 Overview of the Survey Listening to South Caucasus (L2SC) is an expansion of a collaborative effort that has been conducted in multiple countries in the Europe and Central Asian region. This initiative aims to comprehensively monitor the views and well-being of a representative group of people as the government introduces social and economic reforms that affect every business and citizen. By reflecting on the experience of this group over the years, the study provides an up-to-date understanding of how policies reflect on people’s daily lives. The study comprises a nationally representative baseline survey and a high-frequency panel survey of a subset of the baseline participant households. The information collected through the L2SC initiative informs reform efforts directly by raising the profile of citizens’ views and enabling in-depth economic analysis. While the L2SC survey covers Armenia and Georgia, this paper focuses on the baseline survey in Armenia— Listening to Armenia (L2Arm)—where the new national sampling frame based on pre-EAs has been proposed. 17 4.4.2 Sampling Design The sampling design optimizes the spatial allocation of the household sample to provide valid representativeness at the national level for both urban and rural areas. A two-stage stratified cluster sampling design is employed to select participating households, ensuring a balanced sample distribution across regions 18 and accounting for differences between urban and rural areas, survey budgets, and discrepancies in population estimates. The L2Arm survey’s implementation highlighted the robustness of the sampling frame, as it successfully captured the population distribution across diverse geographic and demographic strata. The use of probability proportional to size (PPS) sampling ensured that the selection process was equitable and aligned with population estimates, further validating the practicality of the proposed approach. In the first stage, a certain number of primary sampling units (PSUs) will be selected in each urban and rural stratum (urban and rural areas within each administrative region). In the second stage, the ultimate sampling units or the secondary sampling units (SSUs)—households in the case of L2SC—are randomly selected within each PSU. The survey is then implemented among the selected households. Given that our focus in this paper is on the first stage of the two-stage procedure, the sampling frame used for the survey in the first stage is highlighted. Sampling Frame: As mentioned, the sampling frame is based on pre-census enumeration areas (pre-EAs) providing the most accurate information on the geographic distribution of the population across Armenia. This allows for the most precise formulation of a sample design. Sample Size: The objective of any sample design is to achieve the highest precision in indicators of interest given survey parameters. The sampling design aims to efficiently allocate the given PSUs across strata and 17 The Listening to Georgia (L2Geo) leverages the national sampling frame based on the 2014 Census with population updated by 2022 and 9,514 enumeration areas as PSUs. The preference to use comparable frames in Georgia and Armenia was one of the motivations for developing a new national sampling frame in Armenia based on pre-EAs in this paper. 18 A balanced sample refers to a sample where the means of covariates are equal to the population means of those covariates, ensuring that the sample is representative of the population of interest (Brown et al., 2015). 22 the households across PSUs. For L2Arm, 400 PSUs are allocated across strata proportional to the population 19 (i.e., implicit allocation) with some adjustments. Ten households are targeted within each PSU. 20 Table 5 presents the proposed baseline sample allocation. In the first stage of the two-stage stratified cluster sampling design, PSUs in each stratum are randomly selected using systematic random sampling with probability proportional to size (PPS), size being the estimated population of the pre-EA. This method assigns each PSU’s likelihood of selection based on the PSU’s size within the stratum. Population size, rather than the number of households, is used due to a lack of data on the number of households at the PSU level in the sampling frame. Thus, each PSU’s likelihood of selection corresponds to the percentage of the stratum population residing in the PSU. In the second stage, a set number of households are randomly selected from each chosen PSU. 21 Table 5: Baseline sample design based on proportional sample allocation (2022) Baseline Stratum Estimated Allocated Target N of HHs Number of HHs Population Size Number of PSUs PSU size Yerevan 1,098,866 147 10 1470 Aragatsotn - Urban 26,738 4 10 40 Aragatsotn - Rural 98,949 13 10 130 Ararat - Urban 72,294 10 10 100 Ararat - Rural 186,983 25 10 250 Armavir - Urban 82,953 11 10 110 Armavir - Rural 183,703 25 10 250 Gegharkunik - Urban 65,902 9 10 90 Gegharkunik - Rural 162,809 22 10 220 Kotayk - Urban 137,493 18 10 180 Kotayk - Rural 116,364 16 10 160 Lori - Urban 124,050 17 10 170 Lori - Rural 87,532 12 10 120 Shirak - Urban 133,620 18 10 180 Shirak - Rural 96,856 13 10 130 Syunik - Urban 90,205 12 10 120 Syunik - Rural 44,350 6 10 60 Tavush - Urban 49,859 7 10 70 Tavush - Rural 69,943 9 10 90 Vayots Dzor - Urban 16,160 2 10 20 Vayots Dzor - Rural 31,501 4 10 40 Total 2,977,130 400 4000 Notes: The population estimates at the stratum level are by the end of 2022 and match with the population statistics from the Statistical Committee of the Republic of Armenia. Refer to https://armstat.am/file/doc/99538403.xlsx for the urban population and https://armstat.am/file/doc/99538413.xlsx for the rural population. 19 It is preferred to use total number of households as the size. However, there is no available information on the number of households in each pre-EA. 20 Although 400 PSUs and 10 households in each PSU, with a total target sample of 4,000 households, are targeted, the sample size can be smaller than the target if the total number of households in the selected PSU is less than the target or due to non- responsiveness. 21 Appendix C describes the calculation of survey weights. 23 The approach proposed in this paper can be applied to other countries across the Europe and Central Asia (ECA) region and beyond. The method can help survey implementation in countries where a reliable national sampling frame is unavailable which often causes the World Bank, UN, and other international organizations to face critical challenges in preparing socioeconomic surveys. In those countries, the National Statistical Offices (NSOs) are often unwilling to share their census sampling frame with international organizations although the sample frame might be available. Even if the census sampling frame is available, census EAs are often large and cover large population areas which require substantial resources and time to conduct the household listing. The resource and time-efficient approach we propose thus can save a significant amount of budget, which could be invested in other areas, in all sampling stages. 5 Discussion The national sampling frame based on pre-EAs provides several benefits that the existing and accessible potential sampling frames cannot offer. However, it could also introduce some practical challenges and methodological limitations. This section thus now discusses the additional benefits and potential concerns and proposes alternative solutions that have been previously tested to resolve some of those challenges in other developing countries where pre-EAs have been employed, such as Somalia (Qader et al., 2021) and the Democratic Republic of the Congo (Qader et al., 2023). Population Estimates as a Measure of EA Size: The main challenge of using the proposed national sampling frame for household surveys is that the size of the pre-EA is measured by population estimates, which can differ from the actual population and thus introduce bias in the probability of EAs being selected in the first stage of the two-stage design. Although it has been made clear that this study does not address the validation of the gridded population estimates, it is important to provide some clarity to understand the limitations of the data. The automatic creation of a national sampling frame requires granular information about the population to ensure that the output sampling units are manageable. This level of granularity does not exist in the existing census data in Armenia. Instead, gridded population data is leveraged. In developing countries, several gridded population datasets with various spatial resolutions are accessible, including Gridded Population of the World (GPWv4) (CIESIN, 2016), WorldPop (Bondarenko et al., 2020), High-Resolution Settlement Layer (HRSL) (Facebook and CIESIN, 2016), Demobase Population datasets (Azar et al., 2013), Global Human Settlement Population Grid (GHS-POP) (JRC, 2015), Global Rural-Urban Mapping Project (GRUMP) (CIESIN, 2011) and LandScan (Dobson et al., 2000). The quality of the input data model, which includes census data, satellite- derived covariates, and statistical model selection, is primarily responsible for the accuracy and quality of the gridded population data. The WorldPop-constrained gridded population data for Armenia in 2020 has been used to create the national sampling frame. It was WorldPop’s most recent gridded population available. It implies that there may be notable differences between the population’s size and distribution in 2020 and today. Since the user can update the national sampling frame’s population based on their desired population such as population registry, this should not be a major concern. This paper established an accessible and usable urban and rural classification in Armenia for the first time, and it fits within the concept of a national sampling frame. The output might not reflect the reality, despite a significant correlation between the aggregated population estimates from the census and gridded population estimates for urban and rural areas, as discussed in Section 5. It is primarily because the GHSL SMOD divides the world into urban and rural classes using gridded populations and built-up areas developed using various data sources. The algorithms used to generate the input data for both datasets and the date of the satellite imagery from which the covariates are extracted introduce certain biases. The severity of bias 24 due to using population estimates as PSU size depends on the size of the discrepancy between the actual population and population estimates and whether the difference is systematic. Our findings clearly show that the pre-EAs’ total populations vary, ranging from zero to a certain population size. In the preEA tool, the user has the option to define various constraints. The maximums of population and geographic areas are the two main hard constraints. These maximum thresholds may differ depending on the goal of the work or from one country to another. The primary goal of establishing maximum thresholds for both population and area is to strike a balance between the two limits to prevent the creation of unmanageable pre-EAs in areas with sparse populations. When one of these constraints is met, the aggregation will cease during the merging procedure. The size of pre-EAs in uninhabited areas (indicated by gridded population areas) is simply determined by the maximum geographic area; if the maximum area is reached, the aggregation will cease. As a result, several preEAs with zero or low population values might have been created. The main benefit of this strategy is that it eliminates the need to include uninhabited areas in your sampling surveys, which will result in significant time and cost savings. However, as the method primarily depends on gridded population data, a few inhabited areas may be overlooked if the data is unreliable. This can be resolved in the tool by ignoring the geographic area constraint, but doing so creates enormous pre-EAs that may be difficult to enumerate, especially in rural areas. Model estimates of gridded population data can be enhanced using a reliable approach and sufficient resources and data to identify non-residential buildings. Despite significant efforts, this still presents a challenge because of the complexity of the problems (such as the structures' resemblance and the coexistence of residential and commercial tenants in the same building). As a result, non-residential areas are not excluded in the population prediction of several gridded population datasets. Consequently, certain pre-EAs that are located within non-residential areas exhibit positive population values. It should be mentioned that when the pre-EAs were verified against high-resolution satellite imagery base maps from ESRI and Google, many of these validations were carried out by observation. Since the dates of the satellite imagery were not considered, these evaluations might not be impartial. Therefore, without a comprehensive ground check, these assumptions cannot be verified. Digitized Elements and Boundaries: Digitalized elements, both natural and man-made, are essential for the automatic generation of the national sample frame. The method in this paper made use of OpenStreetMap’s (OSM) extensive digital line availability on roads, railways, and waterways. However, the pre-EAs that were created exceeded the specified thresholds, such as population and geographic area, because of the poor quality and incomplete spatial coverage of the existing digitized boundaries. The primary causes of this are (i) incomplete and (ii) disconnected lines. If any of the natural and artificial features still need to be fully digitized, more work must be done either manually or automatically. A line should never be left open, and it should always be connected to other features as much as feasible. This is because disconnected lines will not be polygonized during the polygonization process, which leads to the creation of large pre-EAs. In addition to spatial coverage, the quality of the OSM attribute data is critical for the automatic creation of the national sample frame. Several line features, including major roads, rivers, and other impediments, were designated as uncrossable to facilitate the collecting of ground data and increase efficiency. The sole source that can determine the kinds of features on the ground is the attribute information. An inaccurate feature class in the attribute table may result in the national sampling frame’s uncrossable features being set incorrectly. The semi-automatic approach creates pre-EA with digitized visible ground features that are unlikely to intersect with buildings or other structures. But occasionally, this might happen. The administrative boundary 25 or the poorly entered visible digital lines may be the cause for this. Given that administrative boundaries cannot be changed or modified without consulting with pertinent government agencies, the user should exercise caution when determining the reasons behind cutting buildings and other structures. Figure 12, for example, shows a pre-EA output with an outline that cuts buildings. This is due to the administrative boundary of the municipality and thus cannot be modified. Figure 12: An example of pre-EA boundary-cutting buildings This method has solely utilized publicly accessible natural and man-made features, and settlement boundaries, such as OSM and GHSL, for reproducibility and worldwide application. Nonetheless, several government agencies can provide input datasets such as roads and waterways with higher quality and greater geographic coverage. In addition, future studies could also investigate leveraging the more modern and comprehensive commercial road network (Strano et al., 2017). If inadequate spatial coverage is a major concern, an alternative approach would be to use “mapathons”—a coordinated mapping event—to enhance the current open-source data on roads and rivers before implementing this method. One of the main challenges for collecting high-quality surveys in Armenia was the lack of boundaries for PSUs. To the best of our knowledge, it is questionable whether there is any digital map of census enumeration areas at the Statistical Committee of the Republic of Armenia (ArmStat). 22 Thus, this paper creates the first 22Although it is hardly accessible, the ArmStat has a physical map of census enumeration areas. It is however uncertain if there is a digital version and what information is embedded in the map. 26 accessible digitized national sampling frame for the country. The boundaries of the PSUs and other administrative units in our sampling frame offer several benefits. For example, it helps to avoid potential mistakes of households outside of the selected to be listed and covered in the survey data. If this potential mistake is not random, it might severely undermine the quality of the subsequent analysis using the collected data. However, this error could be non-random and more systematic as this mistake can be more frequently made for PSUs with larger areas and thus longer boundaries. So, this feature could play a critical role from the quality control perspective. 23 It may be challenging to directly compare the resources and budget of our automatic method to manual approaches because many countries do not release a breakdown of the costs associated with conducting censuses at different phases, especially when it comes to the resources needed to manually digitize national census enumeration areas. For example, the 2010 census mapping effort in Zambia was projected to cost US $7 million and take almost two years to complete (United Nations Secretariat, 2007). If the pre-census sample frame in Armenia had been digitalized manually, we would have needed to dedicate significant financial resources to the extensive training of a team of cartographers to teach them how to digitize all the units on high-resolution satellite imagery. In addition, all pre-enumeration areas were to be manually digitized by the team according to all specified requirements. It would have taken a lot of work to verify quality and correct several geometric errors because this was hand-drawn. The entire procedure is obviously time- and resource- consuming. Nonetheless, it took less than two months to complete the automatic creation of the pre-census sampling frame in Armenia, including the manual correction required due to a lack of spatial input data. One specialist carried out the entire procedure. The significant labor, time, and cost savings achieved by developing an automatic pre-census sample frame could be invested in various aspects of national surveys and census preparation. Despite limitations, the method was successfully implemented, and a national sampling frame was produced for Armenia. From financial, timeframe, and technological perspectives, the method performed better than the conventional manual techniques. Delineating a nationwide sampling frame by hand has historically required years and massive financial resources. In addition, the manual method is prone to several geometric issues, including gaps, overlaps, pockets, and disjoints since it is carried out by humans. These geometric problems could lead to bias being introduced into the frame and eventually into the collected data. Nevertheless, there are no geometric problems with the automatic method. This approach has several benefits over the gridded population sampling frame. Several researchers have directly employed gridded population data as a sample frame (Thomson et al., 2017; Cajka et al., 2018; Qader et al., 2020). The design of the sampling units is the primary distinction between our approach and the gridded population sampling frame. Houses and other structures are cut because the grid’s outline is not in line with features that can be seen on the ground. In contrast, the preEA tool creates pre-EA boundaries that adhere to observable elements like natural borders and roads. 23 A digital map of the enumeration areas also provides a significant amount of spatial information, such as settlement areas with residents within the PSU and their centroids, which can be used, for example, as starting points for random walks. If the household listing is possible for each pre-EA and all household addresses are available, a random walk procedure and thus starting point information are unnecessary. However, complete household listing is not available or hard to obtain in some developing countries, and the household listing operation—visiting each of the selected PSUs and listing all residential households in those PSUs—is costly and time-consuming. So, although not encouraged, some surveys employ random walks as a last resort to select households in the second stage if subject to pressing constraints, and thus, the digital map supports such a procedure. 27 Potential Applications of the Method in Different Countries: The lack of a national sample frame poses serious implementation-stage issues for many household surveys. An up-to-date digitized national sampling frame may not be available in some countries. The sampling frame might exist in other countries, but NSOs are unwilling to provide the World Bank and other international agencies access to the data. In other situations, the frame is based on the census enumeration areas, and because of their large spatial unit, conducting the second and third stages of sampling will cost a significant amount of money to achieve the necessary household size. Furthermore, if the sample selection is based on census enumeration areas, the household listing for the selected sampling units would also need additional time and resources because of their extensive spatial coverage and huge population totals. Our proposed method and strategy can be used to create a new national sampling frame if one of the previously described difficulties arises in future surveys conducted in other countries. 6 Conclusion This paper develops a new national sampling frame in the Republic of Armenia, serving as an example of developing nations with limited access to workable sampling frames for representative individual and household surveys. Specifically, pre-census enumeration areas (pre-EAs) have been created using an innovative approach for the automatic delineation of a national sampling frame, which offers strengths in various dimensions over other potential sources of sampling frames, such as those based on old census enumeration areas, census settlements, electoral precincts, and traditional gridded sampling. The national sampling frame constructed in this paper divides the country into about 7,500 pre-EAs, most of which have population estimates strictly greater than zero. These population estimates are most recently updated and homogeneous, mostly between 100 and 1,000 people. The digitized enumeration areas with clear boundaries support the household selection, e.g., by enabling to avoid selecting households outside the selected PSUs. Several methodological and practical contributions have been made to the survey sampling literature and entities that collect and leverage representative surveys, such as researchers and policy makers. The paper first extends the application of the semi-automatic approach for creating national sampling frames by generating the first digitized frame based on pre-EAs in Armenia as an alternative to traditional approaches to delineating national sampling frames. Our exercise and analysis on the new national sampling frame showcase the applicability of pre-EAs in other countries with similar challenges in the sampling frame. Second, our national sampling frame contributes to the survey collectors and users of household and individual surveys in Armenia by providing a standardized and decentralized frame. Third, the existing sample frames in the country have been systematically evaluated in comparison against the proposed frame by discussing their strengths and limitations. This suggests that our frame complements the existing potential sampling frames and can serve as an alternative. The paper concludes with a caveat and a direction for future research. A new and workable national sampling frame has been proposed in this paper, resolving a challenge often faced by survey conductors in the first stage of the two-stage sampling design. However, solutions to the challenges faced in the second stage of the two-stage design, e.g., strategies to perform household listing, are outside the scope of this paper. So, future research can explore innovative ways to list the households in the country, for example, when using the sampling frame proposed in this paper. 28 References ArmStat. 2022. “The Necessity of Population Census Conduction and Ensuring Its Legal Basis.” https://www.armstat.am/file/article/introduction.pdf. ArmStat. 2024. “The Results of 2011 Population Census of the Republic of Armenia (Indicators of the Republic of Armenia).” https://armstat.am/en/?nid=532. Azar, Derek, Ryan Engstrom, Jordan Graesser, and Joshua Comenetz. 2013. “Generation of Fine-Scale Population Layers using Multi-Resolution Satellite Imagery and Geospatial Data.” Remote Sensing of Environment, 130: 219–232. https://doi.org/10.1016/j.rse.2012.11.022. Bondarenko, Maksym, David Kerr, Alessandro Sorichetta, and Andrew Tatem. 2020. “Census/Projection- Disaggregated Gridded Population Datasets for 189 Countries in 2020 using Built-Settlement Growth Model (BSGM) Outputs.” University of Southampton https://dx.doi.org/10.5258/SOTON/WP00684 [Dataset]. Brown, J. A., B. L. Robertson, and Trent McDonald. 2015. Spatially Balanced Sampling: Application to Environmental Surveys. Procedia Environmental Sciences, 27: 6-9. https://doi.org/10.1016/j.proenv.2015.07.108. Cajka, James, Safaa Amer, Jamie Ridenhour, and Justine Allpress. 2018. “Geo-Sampling in Developing Nations.” International Journal of Social Research Methodology, 21(6): 729–746. https://dx.doi.org/10.1080/13645579.2018.1484989. Center for International Earth Science Information Network - CIESIN - Columbia University, International Food Policy Research Institute - IFPRI, The World Bank, and Centro Internacional de Agricultura Tropical - CIAT. 2011. “Global Rural-Urban Mapping Project, Version 1 (GRUMPv1): Population Count Grid.” Palisades, New York: NASA Socioeconomic Data and Applications Center (SEDAC). https://doi.org/10.7927/H4VT1Q1H. Center for International Earth Science Information Network - CIESIN - Columbia University. 2016. “Gridded Population of the World, Version 4 (GPWv4): Administrative Unit Center Points with Population Estimates.” Palisades, NY: NASA Socioeconomic Data and Applications Center (SEDAC). https://doi.org/10.7927/H4F47M2C. Central Bureau of Statistics of Nepal. 1996. “Nepal Living Standard Survey – 1995: Basic Information Document.” https://microdata.worldbank.org/index.php/catalog/2301/download/34437. Facebook Connectivity Lab and Center for International Earth Science Information Network - CIESIN - Columbia University. 2016. “High Resolution Settlement Layer (HRSL).” Source imagery for HRSL © 2016 DigitalGlobe. https://www.ciesin.columbia.edu/data/hrsl/. Dobson, Jerome E., Edward A. Bright, Phillip R. Coleman, Richard C. Durfee, and Brian A. Worley. 2000. “LandScan: A Global Population Database for Estimating Populations at Risk.” Photogrammetric 29 Engineering and Remote Sensing, 66(7): 849–857. https://www.asprs.org/wp- content/uploads/pers/2000journal/july/2000_jul_849-857.pdf. Eurostat. 2021. “Applying the Degree of Urbanisation: A Methodological Manual to Define Cities, Towns and Rural Areas for International Comparisons.” Publications Office of the European Union. https://dx.doi.org/10.2785/706535. European Commission, Joint Research Centre (JRC). 2015. “GHS-POP R2015A - GHS population grid, derived from GPW4, multitemporal (1975, 1990, 2000, 2015) - OBSOLETE RELEASE.” European Commission, Joint Research Centre (JRC). https://data.europa.eu/89h/jrc-ghsl- ghs_pop_gpw4_globe_r2015a [Dataset]. Grosh, Margaret E., and Paul Glewwe. 1995. “A Guide to Living Standards Measurement Study Surveys and their Data Sets.” LSMS Working Paper No. 120. https://documents1.worldbank.org/curated/ar/270551468764720584/pdf/multi-page.pdf. Hansen, Henning S., and Lise Schrøder. 2019. “The Societal Benefits of Open Government Data with Particular Emphasis on Geospatial Information.” In Electronic Government and the Information Systems Perspective: 8th International Conference, EGOVIS 2019, Linz, Austria, August 26–29, 2019, Proceedings 8. eds. by A. Kő, E. Francesconi, G. Anderst-Kotsis, A. Tjoa, and I. Khalil: Springer International Publishing, 31-44. https://doi.org/10.1007/978-3-030-27523-5_3. Nieves, Jeremiah J., Maksym Bondarenko, Alessandro Sorichetta, Jessica E. Steele, David Kerr, Alessandra Carioli, Forrest R. Stevens, Andrea E. Gaughan, and Andrew J. Tatem. 2020. “Predicting Near-Future Built-Settlement Expansion using Relative Changes in Small Area Populations.” Remote Sensing, 12(10): 1545. https://www.mdpi.com/2072-4292/12/10/1545. Pettersson, Hans. 2023. “Sample Design and Sample Size for the Armenia MICS 2024.” UNICEF Consultancy Report. Unpublished manuscript. Qader, Sarchil H., Andrew Harfoot, Mathias Kuepie, Edith Darin, Sabrina Juran, Attila Lazar, and Andrew Tatem. 2022. “National Automatic preEnumeration Areas (preEAs) in Burkina Faso (2019), Version 1.0.” WorldPop, University of Southampton. https://data.worldpop.org/repo/wopr/BFA/preEAs/v1.0/. Qader, Sarchil H., Heather Chamberlain, Mathias Kuepie, Freja K. Hunt, Attila Lazar, and Andrew Tatem. 2023. “Field Testing of Pre-Enumeration Areas Created using Semi-Automated Delineation Approach, Democratic Republic of Congo.” WorldPop, University of Southampton. https://dx.doi.org/10.5258/SOTON/WP00759. Qader, Sarchil H., Veronique Lefebvre, Andrew J. Tatem, Utz Pape, Warren Jochem, Kristen Himelein, Amy Ninneman, Philip Wolburg, Gonzalo Nunez-Chaim, Linus Bengtsson, et al. 2020. “Using Gridded Population and Quadtree Sampling Units to Support Survey Sample Design in Low-Income Settings.” International Journal of Health Geographics, 19: 1–16. https://dx.doi.org/10.1186/s12942-020-00205-5. 30 Qader, Sarchil, Veronique Lefebvre, Andrew Tatem, Utz Pape, Kristen Himelein, Amy Ninneman, Linus Bengtsson, and Tomas Bird. 2021. “Semi-Automatic Mapping of Pre-Census Enumeration Areas and Population Sampling Frames.” Humanities and Social Sciences Communications, 8(1): 1–14. https://dx.doi.org/10.1057/s41599-020-00670-0. Schiavina, M., M. Melchiorri, M. Pesaresi, P. Politis, S.M. Carneiro Freire, L. Maffenini, P. Florio, D. Ehrlich, K. Goch, A. Carioli, J. Uhl, P. Tommasi, and T. Kemper. 2023. “GHSL Data Package 2023.” Publications Office of the European Union. https://dx.doi.org/10.2760/098587. Stevens, Forrest R., Andrea E. Gaughan, Catherine Linard, and Andrew J. Tatem. 2015. “Disaggregating Census Data for Population Mapping using Random Forests with Remotely- Sensed and Ancillary Data.” PLoS ONE, 10(2): e0107042. https://doi.org/10.1371/journal.pone.0107042. Strano, Emanuele, Andrea Giometto, Saray Shai, Enrico Bertuzzo, Peter J. Mucha, and Andrea Rinaldo. 2017. “The Scaling Structure of the Global Road Network.” Royal Society Open Science, 4(10): 170590. https://doi.org/10.1098/rsos.170590. Thomson, Dana R., Forrest R. Stevens, Nick W. Ruktanonchai, Andrew J. Tatem, and Marcia C. Castro. 2017. “GridSample: An R Package to Generate Household Survey Primary Sampling Units (PSUs) from Gridded Population Data.” International Journal of Health Geographics, 16: 1–19. https://dx.doi.org/10.1186/s12942-017-0098-4. United Nations Secretariat. 2007. “Report of the Sub-regional Workshop on Census Cartography and Management.” ESA/STAT/AC.144/L.3. WorldPop. 2024. “WorldPop Data.” https://www.worldpop.org/datacatalog/. 31 Appendix A: Additional Figures Figure A.1: Change in population distribution in an area between 2011 and 2024 Appendix B: Field and Offline Maps To take advantage of the boundaries and other spatial information, the survey conductor can create field maps of selected pre-EAs for enumerators to feed their navigation when they collect the survey. Even in the presence of the field maps, it could still be challenging for enumerators to navigate themselves in the selected pre-EA especially, when the enumerators are not familiar with the area and cannot eyeball the boundaries from the information provided in the physical maps, like street addresses and some information about churches and schools. A potential solution to this problem could be offline maps, which inform the enumerators of their location live and signal if they overstep outside the selected pre-EAs. In many developing countries like Armenia, access to the internet is a substantial challenge, especially in rural and remote areas, so enumerators can benefit from the offline maps that operate well with the minimum requirement of only access to satellite. So, enumerators should be able to navigate smoothly and avoid the risk of going beyond the boundaries of selected PSUs unless they are, for example, under the tunnel or between narrow alleys in the mountain. In total, 400 field paper maps and 400 georeferenced offline maps were created using QGIS software. Multiple settled areas are probably present in various pre-EAs. Map definitions of these settled areas may be helpful for ground navigation. Several actions have been taken to display the settled region on the field maps. Administrative boundaries with incorporated urban and rural areas were intersected with settlement boundaries. Zonal Statistic Polygons in QGIS were then used to determine the population sum for the resultant intersected polygons based on the gridded population data. A point layer was created from the output. Additionally, the settlement locations’ maximum population values inside each pre-EA were extracted. The settlement with the highest population numbers is shown by the black circle surrounding settlement sites on the field map. This can help the enumerator to find the most densely populated area as a starting point. The X and Y coordinates were calculated for all the settlement points and were shown on the field map (Figures B.1 and B.2). 32 Figure B.1: An example of a detailed field paper map in rural areas Figure B.2: An example of a detailed field paper map in urban areas 33 Appendix C: Survey Weights Sampling weights account for the fact that different members of the population have different probabilities of being selected for interviews, represent various numbers of people in the overall population, and are necessary when computing the representative statistics at the level of the domain. Sampling weights are also adjusted accounting for non-responsive rates given the survey design if required. The dataset will have two sets of weights, including household and individual weights. The household weights are the inverse probability of selection of households and are calculated from the following two components in our two-stage sampling design. The first component is the sampling weight (inverse probability of selection) of PSU within the stratum, and the second component is the sampling weight of selection of households within the PSU. For calculating individual weights, the third component, sampling weights of individuals within the household, is added to compute the household weights. Each of the components is calculated as follows: Component 1: The inverse probability of selection of PSU within the stratum by using PPS (Probability Proportional to Size) is calculated as: 1 = = , × where is the probability of selection of PSU is the sampling weight of PSU within the stratum, within the stratum, is the number of selected PSUs within the stratum, is the size of selected PSU, and is the population of the stratum. The size measure can be the population, the number of households, the number of electors, or the school attendance, while the number of households would be the preferred option in most household surveys. Component 2: The inverse probability of selection of household within PSU is calculated as: 1 ℎℎ ℎℎ = = , ℎℎ ℎℎ where ℎℎ is the sampling weight of household within PSU, ℎℎ is the probability of selection of household within PSU, ℎℎ is the number of sampled (interviewed) households within PSU, and ℎℎ is the total number of households within PSU. Component 3: The inverse probability of selection of individual within the household is calculated by: 1 ℎℎ ℎℎ ℎℎ = = = , ℎℎ ℎℎ 1 where ℎℎ is the sampling weight of individual within the household, ℎℎ is the probability of selection of individual within the household, ℎℎ is the number of sampled (interviewed) individuals within the 34 household, equal to 1, as only one individual was allowed to be interviewed from each household, and ℎℎ is the size of the household surveyed (asked and recorded during the interview). Based on these components, household and individual weights are calculated as: × ℎℎ , ℎℎ = = ℎℎ × ℎℎ . 35