Second-Stage Sampling for Conflict Areas: Methods and Implications

The collection of survey data from war zones or other unstable security situations is vulnerable to error because conflict often limits the implementation options. Although there are elevated risks throughout the process, this paper focuses specifically on challenges to frame construction and sample selection. The paper uses simulations based on data from the Mogadishu High Frequency Survey Pilot to examine the implications of the choice of second-stage selection methodology on bias and variance. Among the other findings, the simulations show the bias introduced by a random walk design leads to the underestimation of the poverty headcount by more than 10 percent. The paper also discusses the experience of the authors in the time required and technical complexity of the associated back-office preparation work and weight calculations for each method. Finally, as the simulations assume perfect implementation of the design, the paper also discusses practicality, including the ease of implementation and options for remote verification, and outlines areas for future research and pilot testing.

The collection of survey data from war zones or other unstable security situations is vulnerable to error because conflict often limits the implementation options. Although there are elevated risks throughout the process, this paper focuses specifically on challenges to frame construction and sample selection. The paper uses simulations based on data from the Mogadishu High Frequency Survey Pilot to examine the implications of the choice of second-stage selection methodology on bias and variance. Among the other findings, the simulations show the bias introduced by a random walk design leads to the underestimation of the poverty headcount by more than 10 percent. The paper also discusses the experience of the authors in the time required and technical complexity of the associated back-office preparation work and weight calculations for each method. Finally, as the simulations assume perfect implementation of the design, the paper also discusses practicality, including the ease of implementation and options for remote verification, and outlines areas for future research and pilot testing.

Introduction
The collection of survey data from war zones or other unstable security situations provides important insights into the socioeconomic implications of conflict. Data collected during these periods, however, are vulnerable to error, because conflict often limits the options for survey implementation. For example, the traditional two-stage sample design for face-to-face surveys in most developing countries first selects census enumeration areas as the primary sampling unit (PSU) with probability proportional to size and then conducts a listing operation to create a frame of households from which a sample is selected. Such an approach, however, may not be feasible in conflict areas. At the first stage, updated counts are often not available, making probability proportional to size selection inefficient. Also as the second stage requires that survey staff canvas the entire selected area, it may also be too dangerous in a conflict setting. As a result, many surveys of conflict areas are limited to qualitative work or resort to non-probability designs.
This paper uses simulations to explore several alternative sampling approaches considered for the baseline of the Mogadishu High Frequency Survey Pilot (MHFS). The baseline was a face-to-face household survey in Mogadishu, Somalia, conducted from October to December 2014 by World Bank and Altai Consulting. A full listing (see Harter et al, 2010 for details) was deemed unsafe in Mogadishu because the additional time in the field and the predictable movements by interviewers would increase their exposure to robbery, kidnapping, and assault, and increase the likelihood that the local militias would object to their presence. The survey needed an alternative second-stage sample design that would minimize the time spent in the field outside the households, but also could be implemented without expensive equipment or extensive technical training. In addition, international supervisors from the consulting firm could not go to the field, necessitating a sample design in which quality could be verified ex post.
The implementing partner originally proposed a random walk procedure. While this methodology has the benefits of fast implementation and unpredictability of movement, the method is non-probability and literature has shown the procedure to give biased results, even if implemented under perfect conditions (Bauer, 2014). Intuitively, a random walk would only be unbiased if the paths taken during the selection crossed each household once and only once, which is extremely unlikely in the field. Therefore the team considered four alternatives for household selection. The first option was to use a satellite map (of which many high quality options exist, due the limited cloud cover and political importance of the region) to identify all structures in the PSU and select ten for the survey. The second option considered was to subdivide the selected PSUs into segments consisting of eight to ten households and ask enumerators to list and choose households from the segments. 2 The segments would be of roughly equal size in terms of number of households but are likely to have irregular outlines reflecting the irregular layout of structures in Mogadishu. The third option considered was to lay a uniform grid over the PSU and ask enumerators to list and choose households from selected grid boxes. The final option considered was to start at a random point in the cluster and walk in a set direction, in this case the Qibla, or direction in which Muslims pray, until the interviewer encountered a structure.
3 The paper will make use of data from the completed Mogadishu pilot survey and geo-referenced maps and three example EAs to explore the following questions: (1) How might a given method be implemented in the field, given the information available and the security constraints? (2) What information is necessary to generate sampling weights, necessary for representative estimates, and how should those weights be calculated? (3) What are the implications in terms of precision and bias for each of the methods described above? (4) What are the implementation concerns for each method, including the options for verification and the impact of non-household structures?
The next section briefly describes the literature as it relates to the questions above, followed by section 2 describing the data. Section 3 addresses research questions 1 and 2 by giving further detail on the methods considered. Section 4 presents simulation results covering questions 3 and 4. Section 5 concludes by offering some discussion of the overall performance and potential future applications.

Literature Review
The most common method for collecting household data in Sub-Saharan Africa is to use a stratified twostage sample, with census enumeration areas selected proportional to size in the first stage and a set number of households selected with simple random sampling in the second stage (Grosh and Munoz, 1996). Since administrative records are often incomplete and most structures do not have postal addresses, as is the case in Mogadishu, a household listing operation is usually necessary prior to the second-stage selection. However, due to the security concerns cited above, listing was not feasible.
A number of alternatives for second stage selection can be used when household lists are not available. A common alternative used in both Europe (see Bauer, 2014, for recent examples) and in the developing world is a random-walk. The Afrobarometer survey, which has been conducted in multiple rounds in 35 African countries since 1999, and the Gallup World Poll, which conducted surveys in 29 Sub-Saharan African countries in 2012, use random walk methodologies. Although the random walk methods do not necessarily produce equal probability samples, they do not collect any information with which to calculate probabilities of selection. For this reason, weights are not calculable for random walk samples; instead, the samples are analyzed as if they were equal probability. Bauer (2014) shows that this assumption is not correct by simulating all possible random routes using standard procedures within a German city and finds substantial deviation from equal probability. These results apply even when interviewers perfectly implement the routing instructions, which is unlikely given the limited ability to conduct in-field supervision of random walk selection and strong (though understandable) incentives for interviewers to select respondents who are willing to participate (Alt et al 1991). Several other studies have also shown that data collected via random walk do not match the population on basic demographics such as age, sex, education, household size, and marital status (Bien 1997, Hoffmeyer-Zlotnick 2003, Blohm 2006, Eckman & Koch 2016. In the context of Mogadishu, a household listing was too dangerous and costly, a random walk too biased, and no household or person register existed. Therefore the researchers explored several alternative methods using a combination of satellite maps and area-based sampling. As satellite technology has improved in quality and become more readily available, it has been increasingly used for research in the developing world. Barry and Rüther (2001) and Turkstra and Raithelhuber (2004) use satellite imagery to study informal urban settlements in South Africa and Kenya, respectively. Aminipouri et al (2009) use samples from high resolution satellite imagery to estimate slum populations in Dar-es-Salaam, Tanzania. Afzal et al (2015) incorporated satellite data into poverty prediction modeling for Pakistan and Sri Lanka, and concluded that its inclusion can lead to substantial improvements in modeled estimates of poverty.
Specifically related to sampling, the literature is more limited and mainly found in the public health literature. Dreiling et al (2009) tested the use of satellite images for household selection in rural counties of South Dakota, and found, while less time consuming, the method did a poor job of identifying the inhabited dwellings. Grais et al (2007) used a random point selection methodology in their study of vaccination rates in urban Niger, and compared the results to a random walk. They do not find statistically significant differences between the methods, though the sample size was limited, but conclude that interviewers found the random point selection methods more straightforward to implement than the random walk. Lowther et al (2009) used satellite imagery to map more than 16,000 households in urban Zambia to select young children for a measles prevalence survey. They find the method easy to implement, but do not do a formal comparison with alternatives. Kondo et al (2014) use a point selection mechanism in the city of Sanitiago Atitlán similar to our proposed Qibla method, but assume the method to be equal probability. A similar method was used by Kumar (2007), in which satellite maps overlaid with remote sensing data were used to create stratification for an air pollution study in India, and then selected random points. Kolbe and Hutson (2006) use a similar method to select households in Port-au-Prince, Haiti. Their study incorporates the probability of selection for randomly selecting from nearby structures when the selected points do not fall on the roof of a dwelling, but these probabilities would be distorted if the household density were unevenly distributed or for households close to the boundary. Other public health studies, such as the World Health Organization Expanded Program on Immunization studies, use the "spin the pen" method to choose a starting household and then interview a tight cluster of households, though this method has been shown to be nonprobability (Bennett et al 1994, Grais et al 2007. Outside the field of public health, Himelein et al (2014) used circles generated around random points to survey pastoralist populations in eastern Ethiopia, with the stratification developed from satellite imagery. A variation of this method was considered for the High Frequency Survey Pilot, but the methodology is likely unsuited to a dense metropolitan area, because it involves surveying all households within the selected circles. The uncertainty over the final cluster size was also an issue in Mogadishu as clusters with too few households increased costs, while clusters with too many households increased the time in the field and raised security concerns. This paper brings together alternatives developed from this literature and applies them to a conflict environment. We take a rigorous approach using simulations and careful estimation of weights to compare the methods across a variety of potential field conditions. The results offer general guidelines for practitioners developing implementation plans for conflict settings.

Data
To explore the challenges of the random walk and the four proposed alternatives, we simulated the use of each method in three example PSUs from Mogadishu, Somalia. We purposefully chose three census enumeration areas as the PSUs for this exercise to illustrate the variation in physical layout present in Mogadishu. Maps of the three example PSUs are shown in the appendix. The first is in Dharkinley district, a comparatively wealthy section of southwestern Mogadishu where the households are laid out in relatively uniform gridded streets. This PSU has 68 total structures, and a total area of 24,390 square meters according to the December 2013 Google Earth imagery, which was the most current at the time of the initial analysis (in January 2015). The second PSU is on the eastern edge of Heliwa district in the northeast of the city. This area is more irregular in layout with larger gaps between buildings, and has a total of 309 structures in an area of 42,615 square meters based on imagery from March 2014. The third selected was in the more central Hodon district. It is densely populated with very irregularly laid out structures, has 353 total structures, and a total area of 345,157 square meters. This is also based on imagery from March 2014.
We explore the impact of each sampling method on estimates of household consumption. To construct the data set for the simulations, we drew consumption totals from the data collected by the MHFS. The survey covered both households in neighborhoods and those in internally displaced persons camps, but for the purposes of this simulation, we use only the neighborhood sample as within the camps there is little variation in consumption, due to reliance on food aid; furthermore, there are no camps within our three PSUs. Data were collected from the selected households on a limited range of food and non-food items which we sum to calculate a consumption measure (see Mistiaen and Pape, forthcoming, for further details on these calculations). There were 624 cases outside the IDP camps with non-missing values on the two consumption measures. The distribution of consumption across these cases is shown in Figure 1. The values follow a log-normal distribution and the underlying normal distribution has mean 40.0 and standard deviation 27.5. 3 To simulate the variety of situations that may be found in the field, we use three different mechanisms for assigning consumption values to households in the three example PSUs. In the first, values are randomly assigned across the households in each PSU. In the second, the same values are reassigned to households to create a moderate degree of spatial clustering. In the third assignment mechanism, the spatial clustering of consumption values is more extreme. We study the ability of each of the proposed methods to estimate consumption under these three conditions. While these distributions may not mimic actual conditions in these PSUs, they are illustrative of the different situations encountered in the field.

Sampling Methodologies
For each of the methods (satellite mapping, segmentation, grid squares, Qibla method, and random walk) and assignment mechanisms (random assignment, spatial clustering, and extreme spatial clustering), 10,000 simulated samples were drawn and relevant probability weights calculated. Each sample consisted of ten structures per PSU. For the cases in which the household sample was selected in two stages, two segments of five households were chosen within each PSU for segmentation, and two grid boxes of up to five households were randomly chosen in the grid square selection. This section provides further detail on the selection methods and describes the weight calculations necessary to achieve unbiased results.

Satellite mapping
A full mapping of the PSU entails using satellite maps to identify the outline of each structure (see appendix). In this case, we used maps publically available on Google Earth and maps of the EA boundaries provided by the Somali Directorate of Statistics. From these maps, the structures inside each PSU can be assigned numbers (either by hand or digitally) and selected easily in the office with simple random sampling. The coordinates of the selected households can be loaded onto GPS devices to assist interviewers' locating efforts.
This approach is the closest of the proposed study methods to the gold standard of a well-implemented full household listing. The main differences are that in a field listing, enumerators can exclude ineligible structures, such as uninhabited and commercial buildings, and include information not available from satellite maps, such as the identification of individual units within multi-household structures. In addition, whereas listing is always done just before the data collection, the satellite maps may be out of date, leading to under-coverage of newly constructed units and/or selection of units that no longer exist. As noted in the previous section, the maps used for this paper were about one year old at the time of the initial analysis. The historical imagery from Google Earth indicates these maps are generally updated at least once a year, but this may vary substantially depending on the specific location and year. Selection from a satellite mapping therefore requires an additional set of field protocols for addressing and documenting the above issues.
The calculation of the probability of selection, and by extension the survey weight, is straight-forward.
The probability is simply , where n is the number of structures selected and N is the total number of structures, plus any necessary adjustment for multi-household structures (for example, one unit from a three unit building) encountered in the field.

Segmenting
Segmenting is a standard field procedure of subdividing large PSUs into smaller units, approximately equal sized in terms of number of households, for listing and selection purposes. The individual segments are estimating the true mean in the population, it should not affect the means when compared between methods. First, as this bias is associated with a selected household's decision to participate, it would impact all methods equally. Secondly, as non-response attenuates any differences, they would appear smaller in magnitude in the simulations than in the true population. To address this, we ran a large number of simulations to generate narrow confidence intervals on the results.
then selected with simple random sampling, listed by field enumerators, and households selected from these lists. Segmenting is less time consuming than a full mapping exercise in terms of office preparation, but still requires the manual demarcation of segment boundaries. When creating segments, best practice is to use clearly discernable landmarks to draw boundaries, but these can change over time or not be correctly identified by the interviewers. If the interviewer incorrectly identifies the segment, it may be necessary to exclude the resulting data as they cannot be properly weighted. Properly implemented this method is able to produce unbiased estimates, but is not as dangerous or costly as a full listing, as listing only the selected segments would involve substantially less time in the field.

Figure 2: Example of Grid Sampling Method
The calculation of the probabilities of selection is also straightforward: it is the product of the probability of selection of the segment and the probability of selection of the household within the segment. The additional clustering introduced by this method, however, could decrease the precision of estimates due to the design effects. The magnitude of the decrease in precision would depend on the number of segments selected, the number of households selected per segment, and the degree of homogeneity within PSUs for the study variables. The impact would largest if all ten households were selected from the same segment, and decrease as more segments were selected. At the other extreme, if one household were selected from ten segments, segmentation would produce more precise estimates than simple random sampling as the segmentation prevents a chance geographic concentration of selected households. As a balance between efficiency and practicality, two segments and five households per segment were selected for the simulations.

Grid Squares
To implement the grid method, a uniform grid of squares (or other uniform shape) is overlaid on the PSU map. Figure 1 shows an example using 50 x 50 meter squares for the Dharkinley PSU. The area of a grid square includes all of the area that lies both within the grid square and within the PSU boundaries. For example, in grid square 17 in figure 1, the majority of the structures inside the square would not be eligible for the survey, as they lay outside of the PSU boundaries. Only the structures which lie in the bottom left corner are both within the grid and PSU boundaries.
One or more squares can then be selected with simple random sampling from the set of all squares that overlap the selected PSU. Depending on the survey protocols, a structure may be defined as eligible if all or part of it lies within the grid space. The more common protocol, including the structure if the majority lies within the grid square, has the benefit of simplifying the weight calculations, but the risk of subjective decisions made by interviewers in the field about where the majority of the building lies, which could lead to some buildings having no chance of selection. Since the options for supervision and field re-verification were limited in this survey, it was decided to consider the structure as eligible if any portion of the structure lay within the grid boundaries, to ensure that all units had a positive probability of selection.
To select a sample of households within the selected squares, a common approach would be for interviews to be conducted with all eligible respondents within the grid square. This could lead, however, to issues with verification as well as decreasing control over the final total sample size. Therefore the protocol used in Mogadishu had interviewers list all households within the selected squares and use the application on their smart phones to select a fixed number of households for the survey.
This variation of the grid method has the advantage that it requires less preparation time compared to mapping or segmenting. There are considerable drawbacks, however, in the ease of implementation and additional work to accurately calculate the selection probabilities. Since the grid squares do not follow visible landmarks, the boundaries must therefore be programmed into the GPS and identified by the interviewers. As it is unlikely that they will be able to walk straight along the boundary, additional training may be required to correctly identify eligible structures.

Source: Authors' diagram based on PSU boundaries and Google Earth images
This approach also still requires some listing work, which may have security implications depending on the size of the squares in the grid. The size can vary depending on the physical size of the PSU and the density of the population. Larger grid squares may be necessary in sparsely populated areas, but increase listing time and interviewer exposure. Smaller squares require less listing work, but also mean that more buildings will lie on the boundaries between squares. Those selected structures which lie on boundary lines require either an arbitrary and unverifiable decision by field staff as to whether the majority of the structure lies within the grid square, or additional time for field implementation, as discussed below.
Let s be the number of squares selected in PSU and S be the number of squares that are partially or completely contained within PSU. For households that are entirely contained within square j, the probability of selection, given that PSU was selected, is: where n j is the number of structures selected from square j and N j is the total number of eligible structures in the square. is the probability of selection of the square when a simple random sample of size s is selected from the S squares in PSU.
If household i lies in both squares j and j', the probability of selection is: For a structure overlapping more than two grid squares, there would be additional terms in equation (2), up to the extreme case lying on a four way intersection. Interviewers would also have to spend significant time on additional listing, which greatly increases exposure in the field and provides disincentives to interviewers to report such households.

Qibla Method
This sampling approach involves selecting multiple random locations within each PSU and traveling from each selected point in a common fixed direction until a structure is found. If the first structure the interviewer encounters is a household, the interview is done with the household. In Somalia, the consulting firm suggested using the Qibla (the direction in which Muslims pray) since it is common for interviewers to have an app on their cell phones which indicates this direction. Figure 3 gives a stylized example of this method. Household 510 will be interviewed whenever any of the points in the shaded region are selected. This region includes the area of the dwelling itself (its roof) and all points in its "shadow" -that is, all land inside the PSU that lies in the direction opposite the Qibla, excluding points that lead to the selection of other buildings. See figure A4 in the appendix for an example at the PSU level.
Despite its seeming ease-of-use, this approach contains many challenges. For one, it is not clear how nonresidential structures should be handled. The interviewer could walk around business and vacant housing units, continuing in the Qibla direction until she finds a residential unit. This approach would work in theory, but in addition to the difficulties in remote verification it would create, it would also complicate the calculation of probabilities of selection (discussed below). Therefore we do not suggest it. Instead, we suggest coding points that lead to non-household selections as out-of-scope, and selecting additional points to replace them.
Perhaps the biggest challenge with this method is the collection of the information needed to calculate probabilities of selection of the selected households. Figure 3 shows Household 510 and, in the shaded area, the set of all points that lead to the selection of this household. Each household, i, in the PSU has an associated selection region: call this region A i . The probability of selection of household i (conditional on selection of PSU), if c points in the PSU are selected, is one minus the probability that all c selected points are not in A i : (based on Särndal et al. 1992, p.50). This approach is essentially probability proportional to size selection with replacement, where the measure of size is the area of A i . The weight is then the inverse, .

Source: Authors' diagram based on PSU boundaries and Google Earth images
From Equation 3, the most difficult quantity to calculate is the area of A i . For the purposes of this paper, we manually delineated building footprints individual structures from relatively recent Google Earth maps to calculate the A i region for each selected household. Though requiring no additional software expertise, this method was time consuming in terms of preparation. For the three PSUs used in the paper, it took about one minute per household to construct a digital outline. If the PSUs contain approximately 250 structures (the ones used here contain 68, 309, and 353 structures, respectively), mapping the 106 PSUs selected for the full Mogadishu High Frequency Survey Pilot would have required more than 50 work days. It may be possible to automate the process for larger mapping efforts by using GIS-based algorithms for feature extraction that were not used here due to the limited number of PSUs.
Calculating A i would be much harder, if not impossible, if high quality and recent satellite maps are not available. Any structures added since the imagery was captured would not appear and resulting areas of neighboring structures would be incorrectly included in A i . We therefore also consider three methods of approximating A i as defined in Equation 3. The first is the distance to the next structure in the opposite direction of the Qibla multiplied by the actual width of the dwelling (proxy weights 1). This is l x w in figure  4. The second is the measured distance to the next structure multiplied a categorical shadow width variable as defined by the interviewer (proxy weights 2), and the third ignores the weights completely. Though theoretically biased, the variations have the benefit of not requiring digitized maps and being more flexible in accounting for new construction, and under certain conditions they may be a good alternative for researchers who find themselves in second or third best scenarios.
The first alternative requires additional information from the field teams. Way points (latitude and longitude coordinates) must be captured with the GPS at the selected point and when the interviewer arrives at the structure so that the distance can be calculated. Then the actual width of the structure perpendicular to the Qibla must be measured, which may be complicated if the dwelling has an irregular outline or if the perpendicular runs diagonally through the building. This may be done by asking interviewers to record a track as they walk the perimeter of the structure, though this requires additional processing from the team following data collection. The second variation is simpler to implement in that it does not involve any additional measurements, beyond recording the waypoints for the selected point and structure edge, though it does introduce additional elements of subjectivity into the weight

Source: Authors' diagram based on PSU boundaries and Google Earth images
measurements. Ignoring the weights completely would introduce bias, as it would only be approximately unbiased if dwellings were identical in size and equidistant.
In addition to the above concerns weight calculations, another potential issue with this group of methods is that there are points in the PSU that would not lead to the selection of any households. Consider the shaded area of figure 5. If any of these points were selected, the interviewer would not find any household before she left the boundaries of the PSU. This issue raises questions for the field protocols. Should interviewers stop at the PSU boundary, or should they continue and select housing units outside of the selected PSU? If the former, how would the interviewer know where the PSU boundaries are? If the latter, the probabilities should be adjusted for the fact that the A i region extends outside of the PSU, which is not straightforward. Additional structures outside of the boundaries of the PSU would need to be mapped, requiring additional preparation time. For the purposes of this paper, we mapped all households in a 50 meter buffer zone around the PSU boundaries. This increased the number of structures required from 309 to 408, 68 to 207, and 353 to 724, respectively, nearly doubling the required mapping time if manual delineation is used. A third option would be to allow interviewers to travel outside of the PSU in search of a selected household, but then remove these interviewed households outside the selected PSUs from the data set, because their probabilities of selection are too complex to calculate. This approach preserves the probabilities of selection and is easy for the interviewer to implement, but deleting data is inefficient in terms of cost.

Random Walk
There are many different implementations of the random walk procedure, of which each invokes choosing a starting point within the selected area and then proceeding along a path, selecting every k th household. The methods differ in how the path is defined. In this paper, we follow the method used by the Afrobarometer survey.

Source: Authors' diagram based on PSU boundaries and Google Earth images
Fieldworker 3…. Walking in their designated direction away from the SSP, they will select the fifth household for their first interview, counting houses on both the right and the left (and starting with those on the right if they are opposite each other). Once they leave their first interview, they will continue on in the same direction, and select the tenth household (i.e., counting off an interval of ten more households), again counting houses on both the right and the left. If the settlement comes to an end and there are no more houses, the Fieldworker should turn at right angles to the right and keep walking, continuing to count until finding the tenth dwelling" (Afrobarometer, pg. 35).
To simulate the random walk in the Mogadishu context, we replicate the Afrobarometer protocols to the extent possible. First we selected a random starting point (since it is not possible to identify landmarks with the level of detail available on the maps, we simply use a random point as the path start). To simulate the direction of the sun, a random angle is chosen and the direction of the interviewer's path assigned at 90 degree intervals. For example, if 13 degrees from due north was selected, then the four paths would be at 13 degrees, 103 degrees, 193 degrees, and 283 degrees. From these lines, it was assumed that every dwelling within five meters on either side of the direction of walking was within the interviewer's line of sight. These dwellings were sequentially numbered and every fifth dwelling selected. If the interviewer reached the PSU boundary before selecting the requisite number of households, the path made a 90 degree turn and continued. If each of the four interviewers selected three households, the total cluster size would be twelve. In order to ensure comparability with the other methods, each of which aimed to select ten households, we dropped the last two selected households. 4

Simulations
For each of the sampling methods discussed above and the three different methods of allocating consumption values to households (random, some spatial clustering, extreme clustering), we simulated 10,000 samples and calculated the mean for each one. We report in table A2 in the appendix the mean, standard deviation, 5 5 th percentile, and 95 th percentile of the distribution across all 10,000 samples and evaluate the different sampling approaches in terms of their bias and variance. If a sampling method is unbiased, the expected value of the sample means should be 40, by design the true mean consumption in each simulated PSU.
While generally it was possible to implement all of the methods in our simulations, there were notable challenges with three of the designs. In simulating the Qibla method, certain selected points did not lead to a selection within the EA. The impact was largely negligible in Heliwa or Hodon, where only 0.4 percent and 1.4 percent, respectively, of the total area led to no selection, but in Dharkinley, the smallest and most regular of the PSUs tested, 13 percent of the area led to no selection within the PSU, substantially decreasing the efficiency of that method. Then in the implementation the grid selection method, there was little control over the number of households in each grid square. In some cases, grid squares were 14 empty or did not have the minimum number of structures to achieve the expected sample size. In the most extreme case of the large and sparsely populated PSU of Heliwa, when 50 x 50 m grid squares were used, 42 of the 169 grid squares contained no structures. Of those remaining, a further 90 had less than the necessary five structures. Therefore the grid squares were combined into 100 x 100 m squares. After combination, there were 51 grid squares, 7 of which were empty, but 16 continued to contain less than the minimum number of structures. For the simulations, we dropped grid squares without households, though this would likely not be possible in true field implementation, leading to cost inefficiencies. Figure 6: Heliwa PSU with 50 meter grid square overlay

Source: Authors' diagram based on PSU boundaries and Google Earth images
Finally, there are several documented problems with random walk methods, as we discussed in Section 2. One difficulty not previously discussed in the literature but encountered in the simulated implementation was that the protocols above fail in certain situations. As shown by figure A7 in the appendix, depending on the start point and direction, it may not be possible to turn right and remain within the boundaries of the PSU. The interviewer would need to violate the protocols or seek advice from a supervisor to continue implementation.

Bias and Variance
The mean, standard deviation, and coefficient of variation are shown in figures 7 to 10 and in table A2 in the appendix for the eight methods under the three different consumption values, for each PSU as well as overall. 6 From this table, we can evaluate how well each method worked in terms of bias and variance. From a true mean of 40, it was unsurprising that the full listing / satellite mapping method showed the most consistently efficient and unbiased results. Segmentation also showed consistently unbiased results but had higher variance for higher degrees of clustering in the underlying distribution due to homogeneity within the segments. The Qibla method with the full weights yielded unbiased results but with wide confidence intervals, though these are likely artificially wide in the simulations. (See note in the technical appendix for more detail.) In addition, the wide confidence intervals are partially driven by a few outlier values. The values of the 5 th and 95 th percentiles of the distribution for this method are similar to those in the segmentation method when clustering is applied. The two methods of estimating the measure of size for the Qibla method showed a small amount of bias, ranging between 1.5 and 6.5 percent depending on the degree of clustering. There is also evidence of the trade-off between bias and variance. The weights for the proxy methods are based on where the random point is selected, which is necessarily shorter than the full shadow width, truncating the values of the weights. While this introduces a bias into the measures, it also limits the possibility of having large weights for outlier values, which increase the variance. The width of the confidence interval showed almost no impact when clustering was introduced.
There was also little difference in terms of bias and variance between proxy weights 1 and proxy weights 2, indicating there is little information lost if the categorical method is implemented correctly. The unweighted version consistently underestimated the true mean, though showed narrow confidence intervals, due to the weighting loss, or the increase in variance resulting from the application of weights (see Eckman and West, forthcoming, and Kish 11.2C, 1965 for further discussion). The final two sampling methods both over-estimate the means with a bias up to ten percent for the clustered distributions. This is most likely due to grid squares which do not have the required number of dwellings, so that the final sample size did not reach 10. The random walk as noted above, is not theoretically unbiased and this is reflected in the simulation results.
Across the three PSUs, there are also some important differences in the methods, as shown in the violin graphs in figure A8 and A9 in the appendix. Dharkinley, despite being the most regular in terms of layout, was problematic for many of the methods. Satellite mapping, segmentation, and random walk all showed a second bulge in the distribution about 20 percent above the true mean, as compared to an expected smooth normal distribution. The full weighting scenarios for the Qibla method also had the most difficulties in Dharkinley, generating a small number of outlier estimates over 1000 compared to a true mean of 40. In contrast in the chaotic Hodon, there were no issues with satellite mapping and the full weight Qibla method estimates were on par in precision with segmentation for the clustered distributions. The Qibla proxy methods, however, showed substantial bias when clustering was introduced with only a slight decrease in variance. Random walk also had substantial difficulties in Hodon, showing both high levels of bias and variance. The bias is caused by the interviewer instructions, which predefine the path an interviewer has to take. Even though the starting location is randomly selected, interviewers tend to reach certain areas with a higher likelihood. Estimates for variables which are correlated with these unevenly distributed selection probabilities are biased.

Ease of Implementation and Remote Supervision
In conflict and capacity constrained environments, such as Mogadishu, the ease of implementation and options for remote supervision were also necessary considerations in the selection of the final methodology. Satellite mapping requires little specialized training beyond the use of a GPS device for navigation as target households were selected in the office. The Qibla and random walk methods similarly require the ability to navigate with a GPS to the selected point, but also require additional training for interviewers to correctly implement household selection protocols, which are substantially more complex with random walk. The proxy weights version of the Qibla methods also require interviewers to be training on using GPS for field measurements. Segmentation and grid methods are the most difficult to implement in the field as they require interviewers to identify the boundaries of sub-sections, which in the case of the grid method may not follow landmarks and may cross through structures.
For the purposes of remote verification, the two main GPS-based tools available to for supervision are waypoints and tracks. Waypoints record the latitude and longitude coordinates of a given location while track records the path of the GPS from the time it was activated. The satellite mapping can be effectively supervised remotely with waypoints. The point recorded by the interviewer when they arrive at the household can be compared to the coordinates of the selected household to ensure they correspond to the same structure. This would be most effective in sparsely populated spaces with little overhead obstruction. Verification would be more difficult in dense urban areas where the minimum of 15 feet (5 meters) accuracy of the GPS could lead to multiple possible structures, or if heavy overhead cover of metal roofs blocks GPS signals. The Qibla and random walk methods would both use a way point to identify the starting point then the track to confirm the path taken. Grid and segmentation can both use waypoints to confirm points are within selected areas, and it may also be possible to use tracks to confirm the listing process if strict protocols are used (ie. Start in the NE corner and continue clockwise) though the intersection of the interviewers paths may make results less clear.

Replacements
Due to high transportation costs, most surveys in the developing world use replacements for nonresponse due to refusals or out of sample selections. This is done either through selecting additional households from the PSU listing exercise, as is recommended in the World Bank's Living Standards Measurement Study (Grosh and Munoz, 1996), or selecting a neighboring structure based on field protocols, such as selecting the dwelling immediately to the right (Lowther et al, 2009). While replacements for out of sample selections with new random points does not introduce bias, it is inefficient and increases costs. For non-response due to refusal, it is likely to be non-random, and therefore replacements will create at least some degree of bias in the data. The reason and method for the replacement may influence the degree. If refusals tend to come from households in the highest and lowest wealth quintiles, as the opportunity cost of their time is high, and replacements come from the main part of the distribution, the use of replacements will attenuate the variation in the sample. This may cause the results to underestimate measures such as inequality that depend on accurately capturing the extremes of the distribution. When using a replacement method that uses near neighbors, if structures are abandoned or commercial buildings, those households living adjacent may be systematically different from the remainder of the PSU. In addition, those households near the boundary of the PSU would have a lower probability of selection since there are fewer households near them that would lead to them being selected as replacements.
Of the methods discussed above, segmenting and gridding require a short listing exercise at which time non-eligible structures can be excluded. Satellite mapping and the Qibla method rely on maps that cannot differentiate based on eligibility, and are therefore more vulnerable to issues with out of sample selections. In addition, regardless of method, the survey protocols should address procedures for the inevitable refusals, which may be more likely in conflict areas.

Discussion
Ultimately the most appropriate method for second stage sampling in any survey depends on a trade-off between cost, necessary precision, and tolerable bias. In conflict zones, these decisions are further complicated by time pressures, available back office resources, and security concerns. Satellite mapping, segmentation, and the Qibla methods with full area weights are all probability methods for which it is possible (though necessarily not easy) to calculate weights, and thus all produce unbiased estimates of the population mean. Of these options, the simulations demonstrated that satellite mapping yielded the most consistently unbiased and efficient design, under the assumption that recent maps are available and potential issues with out-of-sample buildings can be adequately addressed. The Qibla method provides promising results in the simulations but has yet to be tried in the field. The proxy weight variations of the Qibla method also show promise as they remove the requirement of updated satellite maps and greatly reduce the calculation burden for the weights, but do show substantial bias in certain circumstances. The non-probability methods, random walk and the unweighted Qibla method, do not produce unbiased results. Random walk, in particular, did not perform well in the simulations despite being common practice for many surveys.
The simulations also showed the implications of bias in the estimates can be substantial in terms of policy conclusions drawn from the data. In this study, the main indicator was household consumption, which underpins poverty calculations in much of the developing world. For a hypothetical poverty line set at the bottom 40 percent of the population, the bias resulting from using a random walk over satellite mapping would lead to an under-estimation of a poverty rate by five percentage points. Given the expanding availability of satellite maps and decreasing costs of GPS technology, much of which is integrated into the phones and tablets used by interviewers, alternative methods based on probability sampling may reduce bias with little impact on cost or complexity of implementation.
Beyond the simulated results, a number of questions remain that can only be addressed by field testing. For example, it is not possible to discuss the cost considerations of the choice of method nor to comparatively discuss the implications on interviewer safety. Also, the simulations assume perfect implementation and further research is needed on the implications of human error or of outdated maps.
In the case specifically discussed here, the Mogadishu High Frequency Survey Pilot, the team opted to use segmentation as a compromise between preparation time, ease of implementation, and the time and complexity necessary for the weight calculations. The implementation was generally successful despite a number of difficulties in the field. Teams occasionally encountered high-level security threats and exploitative rent-seeking from local leadership. The complexity of the survey protocols, including the sampling design, slowed the implementation of the survey. Also a substantial number of observations had to be discarded because the interviewed points did not fall within the boundaries of the selected segments or because interviewed households did not appear on segment listing forms. Regardless of these challenges, however, it was possible implement a complex and yet rapid, high-quality survey in one of the most challenging urban contexts known to date.