Prospects of Estimating Poverty with Phone Surveys: Experimental Results from Serbia

Telephone surveys enable us to collect data in a cost-effective and timely manner, but may not be conducive for collecting detailed consumption or income data for measuring poverty due to the required length of the interview and complexity of the questions. Combining telephone surveys with a survey-to-survey imputation technique may be a solution, as this technique can produce reliable poverty estimates from only 10 to 20 simple questions. However, this approach may lead to biased results if the interview mode, that is, face-to-face versus telephone interviews, affects how households respond to questions. By conducting the first survey experiment to examine potential differences in poverty estimates between interview modes, this study finds that the reporting patterns changed very little between the two interview modes, and the bias in poverty estimates due to interview mode is statistically insignificant. These findings suggest that poverty monitoring via telephone surveys is promising, but additional experiments in other country contexts are encouraged.

Telephone surveys enable us to collect data in a cost-effective and timely manner, but may not be conducive for collecting detailed consumption or income data for measuring poverty due to the required length of the interview and complexity of the questions. Combining telephone surveys with a survey-to-survey imputation technique may be a solution, as this technique can produce reliable poverty estimates from only 10 to 20 simple questions. However, this approach may lead to biased results if the interview mode, that is, face-to-face versus telephone interviews, affects how households respond to questions. By conducting the first survey experiment to examine potential differences in poverty estimates between interview modes, this study finds that the reporting patterns changed very little between the two interview modes, and the bias in poverty estimates due to interview mode is statistically insignificant. These findings suggest that poverty monitoring via telephone surveys is promising, but additional experiments in other country contexts are encouraged.

1 Introduction and Background
Monitoring progress in poverty reduction is constrained by the limited availability of household expenditure or income surveys. According to Serajuddin et al. 2015, around 60 countries had either zero or only one household survey for monitoring poverty in a 10-year period between 2002 and 2011, and additional 20 countries had two surveys in the 10-year period but the surveys were not carried out regularly. According to the International Monetary Fund's General Dissemination Data System (GDDS), a country needs to have national poverty estimates every three to five years. In this sense, nearly 80 countries or half the countries included in the World Bank's database suffered limited availability of household survey data for estimating poverty indicators to some degree. In the Sub-Saharan Africa region, the situation is even more serious. More than 80 percent of countries in the region did not have regular household income or expenditure surveys to produce national poverty estimates every five years.
The limited availability of poverty data restricts our ability to monitor progress toward the goals of ending extreme poverty and promoting shared prosperity -the goals that the World Bank Group committed itself to achieving. This is particularly the case for monitoring of the latter. The shared prosperity indicator is the growth rate of the mean income or expenditure of the poorest 40 percent of population over a period of roughly 5 years. Therefore, if a country does not have a household expenditure or income survey every five years or less, it is difficult to regularly estimate the shared prosperity index.
One of the main reasons why data are so limited is that collecting household expenditure or income data is costly and time-consuming if the data are collected via traditional face-to-face interviews. To estimate the prevalence of poverty, a household survey needs to collect at minimum 50 food items, each of which needs to include different sources of consumption, such as purchases, gifts, and own production. Completing an interview of a household survey often requires more than one hour, and worse, collection of such data often requires multiple visits by enumerators. As a result, it is not rare to see that a country spent more than two million US dollars for carrying out a household income or expenditure survey.
A recent surge of telephone coverage even in low income countries gives us hope that the use of telephone interviews can save time and costs of data collection significantly. Telephone interviews eliminate transportation and lodging costs, which usually comprise a large share of costs of data collection via the face-to-face interviews. The cost-effectiveness of telephone interviews can be further enhanced using automatic data collection via Short Message Service (SMS) or Interactive Voice Response (IVR). 1 However, such telephone data collections are not flawless. Telephone surveys often have problems in terms of sampling. Since ownership of landlines or even cell phones is skewed toward the richer segment of population, responses to telephone interviews are often not nationally representative. How to ensure representativeness of the sample is therefore a challenge for telephone surveys. Another challenge is that it is difficult to have a long interview via telephone, but as mentioned above, collection of consumption or income data takes easily 3 one hour or more. These might be reasons why, despite high expectations for telephone surveys for filling data gaps, there is no country (as of April 2016) that uses telephone interviews to collect data for estimating official poverty and inequality statistics. This paper proposes a new procedure of telephone surveys that can address the aforementioned issues, and shows its performance in a pilot that was conducted jointly by the Statistical Office of the Republic of Serbia (SORS) and the World Bank. One of the most distinctive features of the new approach is the use of a formula for estimating poverty. Collection of consumption or income data directly is time-consuming and labor intensive. Instead, our proposal is to collect 10 to 15 simple questions and project household expenditure or income from them using a formula. This approach is often called a "Survey-to-Survey imputation approach" or S2S. It saves on interview time significantly and the literature shows that S2S can project poverty rates reasonably well under certain circumstances. But, so far, none has tested whether S2S works well in telephone surveys. This paper is one of the first studies that systematically analyzes whether S2S can be used for telephone surveys. This rest of this paper is organized as follows. Section 2 provides a literature review. Section 3 introduces a new approach for telephone data collection. Section 4 explains the design of this pilot in detail. Section 5 presents the results of this pilot, and Section 6 concludes.

S2S Method
Given that it is difficult to collect consumption or income data directly via telephone interviews, a survey-to-survey (S2S) imputation technique is a good alternative for monitoring monetary poverty via telephone surveys. This subsection reviews the literature on the S2S method.
The first incidence when S2S played a key role for poverty estimation can be traced to Deaton and Dreze (2002) and Kijima and Lanjouw (2003). The Government of India made a slight change in the recall period in part of the food consumption module for its National Sample Survey Organization (NSSO) survey of 1999-2000, and it was widely argued that the resulting data likely overestimated consumption expenditure data. Since consumption data using the recall period consistent with the previous rounds for 1999-2000 were unavailable, S2S was needed to estimate changes in poverty. Deaton and Dreze (2002) created a model, using the portion of the consumption module in which recall remained unchanged, to project total household expenditure. Kijima and Lanjouw (2003) applied the same approach, but using non-consumption data. Whether using a portion of consumption or non-consumption characteristics leads to more accurate predictions remains open to debate. Subsequent analysis using consumption data from so called thin rounds of household surveys, which are collected between years with a large household survey but are not used to estimate official poverty rates, suggests that the finding of Kijima and Lanjouw (2003) seems more plausible. However, there no systematic test to examine the reliability of either approach was conducted. Stifel and Christiaensen (2007) made one of the first attempts of conducting S2S for different household surveys. They created consumption models based on the 1997 Kenya Welfare Monitoring Survey (WMS) and applied the models to three consecutive rounds of the Demographic and Health Survey (DHS) between 1993 and 2003. The 1997 WMS has both consumption and non-consumption data while DHS has only non-consumption data.
In this paper, Stifel and Christiaensen (2007) provide theoretical guidance regarding the choice of variables to be included in imputation models, so as to maintain comparability and reliability of imputed poverty data. They recommend including variables that change over time, but call for the exclusion of variables whose rates of return are likely to change markedly in the face of evolving economic conditions. This argument makes sense in theory, but it is difficult to identify which variables would satisfy these conditions. For example, Stifel and Christiaensen (2007) include ownership of several consumer durables in their imputation models, but Harttgen, Klasen and Vollmer (2012) criticize this decision because of the so-called "asset drift" effectwhere the pace of improvement in asset ownership is much faster than of income growth. Unfortunately, this could not be tested in the Kenya data because only one round of consumption data was available although in the end, which variables satisfy the recommendations of Stifel and Christiaensen (2007) is intrinsically an empirical question. Christiaensen et al. (2012) moved one step further by conducting an experiment using a series of the past household budget surveys available in the Russian Federation, Vietnam, Kenya, and rural China. Their empirical strategy is to first create projection models using one round of household budget survey, impute household expenditure data into other rounds of household budget survey, and then compare poverty rates projected from the imputed expenditure with those estimated directly from actual consumption data. The comparison between projected poverty rates and directly estimated poverty rates can give us a sense of how reliable the S2S is. The results provide suggestive support to the notion that imputation models constructed from the past rounds can predict household consumption data of future rounds. Such results are encouraging and give us a certain level of confidence on the stability of imputation model coefficients over time. Douidich et al. (2013) is the first attempt of carrying out a test on reliability of S2S between different surveys. They created consumption models in one round of Household Expenditure Survey (HES), imputed consumption data into more frequent Labor Force Surveys (LFS) and estimated poverty rates using the imputed data in Morocco. To examine the accuracy of the projected poverty statistics, they compared them with poverty rates directly estimated from consumption data of HES conducted in the same year as the LFS. The results are very encouraging. Projected poverty rates are very close to directly estimated poverty rates, irrespective of which round of HES is used to create the consumption models. But, now, a question is whether this is also the case for other countries and other times. The methodology was examined further in Newhouse et al. (2014) and Dang et al. (2014).
As shown above, the S2S method has been tested in many different settings, but whether it works well with telephone interviews remains an open question. The S2S is attractive for telephone interviews because it needs only 10 to 15 simple questions to estimate poverty rates. But telephone interviews are not flawless and some potential shortcomings of telephone interviews might severely bias poverty estimates produced by the S2S method. The next subsection reviews the literature on the telephone interviews.

Telephone Data Collection
The literature on the telephone data collection suggests several potential pitfalls of telephone interviews. They include (i) high attrition and non-response rates, and lack of representativeness; (ii) response bias; and (iii) lack of welfare indicators necessary to monitor monetary poverty. This subsection summarizes findings on the three key issues.

(i)
High attrition and non-response rates, and lack of representativeness For example, the "Listening to LAC" report (World Bank 2013), which is called L2L hereafter, reported that in one of their pilots, nearly 60 percent of households who agreed to participate in a subsequent telephone survey did not respond to the telephone interviews. Furthermore, it found the non-responses are not non-random. Croke et al. (2012) also show that in their Tanzania survey, on average, only 62 percent of households in the baseline survey and 75 percent of households with telephone access responded to telephone interviews. Leo et al. (2015), which used IVR for collecting data in four countries, 2 show a connection rate (the number of respondents over the number of telephone calls) ranges between 15 percent and 31 percent.
Such low response rates make it difficult to create a sample representative for a population of interest. Croke et al. (2012) found the distribution of a wealth indicator is largely different between the baseline survey and the subsequent telephone survey. Leo et al. (2015) show that although their surveys are designed to be nationally representative, the distributions of key demographic variables differ significantly from those of a Demographic and Health Survey, which is nationally representative.
There are several ways to reduce the biases. For example, even if the non-responses or attritions are not random, as long as they are associated with observable characteristics of respondents, the bias can be addressed by reweighting the remaining respondents by the inverse of the probability of attrition (see more details of this argument in Croke et al. 2012). Leo et al. (2015) follow an iterative proportional fitting algorithm developed by Bergmann (2011). A challenge for this type of adjustment is that it is difficult to find an adjustment in weight so that all variables become representative because each variable shows different rates of disparity from the nationally representative numbers. As a result, to make one variable consistent with the nationally representative number makes others inconsistent. 3 (ii) Lack of welfare indicators necessary to monitor monetary poverty It is often argued that telephone surveys can be used to collect poverty data in a cost-effective and timely manner. However, we did not find any telephone survey that collects consumption or income data necessary to estimate monetary poverty indicators. Croke et al. (2012) collected information on asset ownership and constructed a wealth indicator, but they did not collect consumption or income data that are necessary to estimate monetary poverty indicators. L2L also collected poverty correlates but did not collect consumption or income data directly. Demombynes et al. (2013) also did not collect consumption or income data via telephone surveys although they linked the telephone surveys with the latest Household Budget Survey so that the poverty status of a sampled household is known.
Most telephone surveys did not collect consumption or income data likely because collection of consumption or income data is time-consuming. To estimate a good quality monetary poverty 6 indicator, a survey usually needs at least 50 consumption items. Further, food items need to include own production and gifts in addition to purchases. As a result, collecting full-fledged consumption data requires one hour or more. However, according to L2L, it appears difficult to continue telephone interviews for more than 15 minutes. This would be a reason most telephone surveys do not collect consumption or income data.
(iii) Response bias Given that collection of consumption or income data via telephone interviews is difficult, S2S is very attractive since it requires only 10 to 15 simple questions. But, there might be a problem if responses to telephone interviews differ significantly than those to face-to-face interviews and the response bias causes a bias in poverty estimation using an S2S approach. An S2S formula is usually developed in a household survey that was collected via face-to-face interviews. But, if a household responded to telephone surveys differently than face-to-face interviews, projections of household expenditure or poverty rates would be biased.
Some studies indeed show evidence of such response bias. L2L conducted comprehensive and well-designed experiments on response bias due to interview modes. It compared four different interview modes: a face-to-face interview, IVR, SMS, and Computer Assisted Telephone Interview (CATI). The study found that responses to CATI are consistent with those to face-toface interviews, while responses to IVR or SMS are significantly different from those to face-toface interviews. Croke et al. (2012) also found similar results. Dillman et al. (2009) compared CATI with IVR, web-based data collection, and data collection via mail. They found that respondents to CATI and IVR tend to choose more optimistic or socially desirable answers than those to the web-based data collection or mailing. In terms of comparison between CATI and IVR, there is no clear tendency but they found responses to CATI are significantly different from those to IVR. Mu (1998) found respondents to IVR are less likely to select "10" and more likely to select "9" than those to CATI. This might be because respondents feel it is more cumbersome to type two-digit numbers in a telephone keyboard. Tourangeau et al. (2002) found that answers to IVR are slightly more positive than those to CATI.
There are some features of this proposed approach that are worth noting. First of all, this approach does not collect consumption or income data directly. As mentioned, the L2L study recommends that a telephone survey should not have too many questions; otherwise, a telephone interview is likely to be unfinished due to a sudden loss of connection or interviewees' refusal to answer more questions. This is problematic if consumption or income data need to be collected because a telephone interview often requires one hour or more.
Instead, the above procedure uses an S2S approach. This approach does not collect consumption or income data, but collects simple questions, like ownership of assets, housing conditions, employment and education level of household members, and household size, from which each household's consumption or income data are projected using a formula. The formula is developed by running regressions using the latest round of a household survey that includes both consumption/income and other questions. Literature suggests that the number of questions needed for the S2S imputation approach is as few as 10 to 20; as a result, collection of these variables usually takes only 5 to 10 minutes. Therefore, even telephone interviews can collect enough information necessary to estimate household expenditure or income.
However, there are several challenges for the S2S approach. First, the formula needs to be stable over time. This can be a strong assumption because people might change consumption patterns. Therefore, when we developed the formula in the first step, we conducted several tests to check whether the formula is stable over time using the past data.
Second, reporting patterns can differ across interview modes. As discussed in the literature review, for the S2S approach to work, households need to respond to questions in exactly the same way as when they responded to the household survey used for developing the formula. But, this is not necessarily the case. To minimize the reporting bias, we use CATI for telephone interviews because L2L and Croke et al. (2012) show that the reporting bias of CATI appears much less than SMS or IVR. Therefore, we select CATI in the fourth step.
The second step is also important to ensure the national representativeness of data collected by telephone interviews. Telephone ownership is still skewed in a richer segment of population in many developing countries. If we simply call households randomly, the likelihood to select richer households is much higher than the proportion of their population. If most of the poor do not have telephones, we might not get any information from the poor. As a result, any statistics from the telephone survey might not be representative at the national level. To avoid this, prior to telephone interviews, we conduct a careful sampling exercise, select a nationally representative sample, and carry out a baseline survey with the select households to collect their telephone numbers.
To minimize the high non-response rate, the third step is introduced based on recommendations from the staff in the Statistical Office of the Republic of Serbia (SORS). According to pilots in the L2L study, non-response rates could reach more than 50 percent, and the non-response rate could be not random. As a result, even with post-enumeration adjustments, it is difficult to produce poverty and inequality statistics that are representative at the national level. However, surveys in the L2L study were carried out by a private sector, not a national statistics office. Households do not have any obligation to respond to the surveys.
The SORS argued that if a survey were conducted without an official letter of request to select households, the non-response rate would be very high in Serbia. This is a reason the SORS staff prepared and brought an official letter explaining this survey is part of the official data collection.
To test the performance of this proposed approach, a pilot was conducted in Serbia. This pilot focuses on (i) whether the S2S method works well with telephone interviews, or more specifically, whether responses to a questionnaire can differ between face-to-face interviews (F2F) and CATI, and (ii) how cost-effective the telephone data collection is. Serbia was chosen for this pilot because the SORS had an established telephone call center and extensive experience collecting data via telephone interviews, in particular for the Labor Force Survey, and the SORS showed interest in participating in this experiment.

Design of the Pilot
This section explains how the pilot was designed. It begins by how the formula was developed, followed by the description of how samples are selected for F2F and CATI groups. It also describes training for enumerators and other logistical arrangements. The results of this pilot will be discussed in the next section.

Model Selection
In this pilot, consumption data are not collected, but are projected from 10 or 15 simple questions using a formula that is developed using an S2S imputation approach. The S2S assumes that following relationship between household expenditure and non-monetary indicators: where the dependent variable is (log) per capita consumption of household h in location c, is a vector of household explanatory variables, is a vector of location specific variables, and are vectors of coefficients, and is a constant. The stochastic error term can be divided into a location-specific effect and a household-specific effect such that = + . Hetroskedasticity of the household-specific effect is allowed in this specification.
The model selection involved an iterative process. At first, we tried the variables that are commonly available in the S2S imputation and poverty map literature. We included four categories for candidate explanatory variables: (1) demographics and education, (2) durable asset ownership, (3) current economic status (employment status, main income source, etc.), and (4) location variables (stratum fixed effects). However, the model with the four kinds of variables could not predict the change in poverty rates across time.
Next, we included variables that more sensitive to changing economic conditions and in turn candidates for picking up short-term changes in poverty rates. We added variables on recent purchases, that is, indicators of whether any household member purchased particular items, such as clothes and shoes, during the last three months. Secondly, we included subjective perception questions such as whether monthly income satisfies a household's monthly needs for your households and how a household's current situation is compared to one year ago. Lastly, we included more employment-related variables based on all of economically active household members, in addition to household head's employment-related variables, in order to be able to explain the change in the economic situation. In the end, about 40 variables are used in the model and statistically significant variables are selected using the backward stepwise selection. The list of variables and coefficients from the identified prediction model using the 9 2009 Serbia Household Budget Survey (HBS) is shown in Table 1. Note that although around 40 variables are needed for the imputation model, as few as 21 questions are needed to construct these variables because some variables can be constructed from one question, like location dummies. The models were evaluated for their ability to accurately impute poverty. Each model was used to impute poverty in other HBS years, and the results were compared against the direct poverty estimates using the consumption data. Table 2 shows the official poverty rates observed from the data in the national, urban and rural areas. The national poverty rates increased from 2008 and 2009 to 2010, and the increase is statistically significant although the size of the increase is not so large. 4 Although urban poverty rates remained relatively stable over the three years, rural poverty increased noticeably from 7.5% to 13.6% from 2008 to 2010.
The poverty rates that are within the 95% confidence interval of the official poverty rates are bolded. The

Questionnaire Design and a Baseline Survey
The questionnaire used for this experiment was based on the variables in the final imputation models discussed above. The main questions that are needed to impute poverty are designed to be exactly the same for both the face-to-face interview (F2F) and CATI groups. For CATI group, an additional survey, a baseline survey, was needed to collect telephone numbers and household identification information.
The baseline survey for the CATI group collects as many phone numbers (both telephone and cell phone) as possible from the household to increase the chances for the follow-up interview.
The questionnaire included only three questions: (1) telephone number for either a land line or mobile phone, (2) name of the main user of the telephone number, and (3) relationship to respondent.
The main questionnaire administered for both the F2F and CATI groups was comprised of 21 questions covering the household roster (name, occupancy during last 12 months, sex, age, educational attainment, predominant activity, months worked during last 12 months), main source of household income, economic status and satisfaction, housing characteristics (water supply, sewerage, electricity, district heating, telephone, number of rooms), and ownership of 12 different durable goods, and purchases of clothing and footwear in the last 3 months. The wording of the questions was kept the same as the original questions in the HBS to the extent possible. The F2F's questionnaire includes the main questionnaire and the CATI pre-interview questionnaire. Screen shots of the main questionnaire are available in the Appendix.

Sampling and Balance Checks
Sample selection was conducted to yield two comparable groups, and the samples were subsequently checked for balance. A total of 120 enumeration blocks in Belgrade were selected.
In the first stage, 60 enumeration blocks, which can be called Primary Sampling Units (PSUs), were drawn with probability proportional to size (PPS) for urban and rural areas separately. In the second stage, 10 households were randomly selected from each PSU and assigned to the F2F and CATI groups. Also, an additional 10 households were selected randomly from each PSU as replacement. The large number of candidates for replacement was due to the relatively high non-response rate observed in other surveys in Serbia.
As an extra measure of precaution, we checked for balance of the two randomly selected groups using the corresponding 2011 census data. Three household characteristics were checked for balance: (1) the average household size, (2) highest education level (percentage of households), and (3) home ownership tenure status (percentage of households). In the case that the groups were not balanced, the sampling would be redone until the distribution of household characteristics were statistically insignificant. Table 3 shows the F2F and CATI samples were balanced ex ante in terms of household size and highest education variables in the census. Table 4 shows the number of non-responses and replacement during the F2F survey and the CATI survey. Note that the CATI survey has two stages -a baseline survey and a telephone interview. According to the table, this pilot experienced relatively high non-response rates for both F2F and CATI groups. The non-response rates in F2F and CATI (baseline survey) were 39.7 and 41.8 percent in urban areas, respectively, and 32.4 and 34 percent in rural areas, respectively (see table 4). These households with non-responses were replaced with households in the replacement list to make sure that both F2F and CATI groups would have a planned sample size of 600 households.
For the CATI survey, there is another round of data collection -a telephone survey -after the replacement of the sample. Around 10 percent of households that agreed to participate in the  13 telephone survey did not respond to the telephone survey. 5 Given that we did not know telephone numbers of households in the remaining list of replacement, we could not fill the non-response at this stage.
These non-response rates are certainly not negligible but it is worth noting that they are typical for Serbia and similar to rates observed for other household surveys such as the HBS and LFS where non-responses were also replaced. Further, the attrition rates at the telephone survey are also typical in Serbia LFS telephone surveys, and much lower than the pilots in the L2L study.
In the L2L study, more than 60 percent of households in Peru and around 40 percent of households in Honduras did not respond to telephone interviews after they agreed to participate in the follow-up telephone surveys at the baseline survey. More important, as long as the F2F and CATI samples are balanced after the replacement and attrition, this pilot can serve its objective. Indeed, table 5 shows that these two samples were still balanced ex post. Although he F2F and CATI samples remained balanced, a comparison between the ex-ante check (table 3) and the ex post check (table 5) shows that after the replacement and attrition, the summary statistics did not change much. 6

Assignment of enumerators and interviewer effects
As enumerators, experts for each interview mode conducted interviews in their expertise. SORS has different groups of enumerators for F2F interviews as in HBS and the first round of LFS and for CATI surveys at the second or later round in LFS survey. In this way, each group could build their skill and experience on one of these interview modes.
A potential downside of this approach for this pilot was that the difference in enumerators' skillset can cause a problem in comparability of data between the F2F and CATI surveys. In other words, the differences in data could be attributed to the differences in enumerators' skillset. Instead, we thought it would be better to mix experts of F2F and CATI surveys and select them for F2F and CATI surveys randomly. Given that the assignment of enumerators is random, any difference in the resulting data between the F2F and CATI surveys cannot be attributed to the enumerators' skillset and experiences.
However, the SORS team argued that using an enumerator for an interview mode with which he or she is not familiar can reduce the data quality significantly. This is an important consideration since the total number in the sample is limited. Furthermore, in reality, SORS will never use experts of F2F surveys for CATI surveys, and vice versa. Therefore, a right comparison should be between the F2F survey collected by the experts and the CATI surveys collected by the experts. After all these considerations, the SORS and the World Bank teams agreed to assign the F2F experts for the F2F survey and the CATI experts for the CATI survey.

Training of enumerators
Interviewer training consisted of three blocks: (1) theoretical training for all interviewers, (2) practical training for field interviewers, and (3) practical training for telephone interviewers. The study objectives and the instructions on how to apply the different questionnaires, and how to identify respondents were explained. The Labor Force Survey (LFS) team explained administrative procedures and the questionnaire contents. The HBS team assisted in clarifying questions of the interviewers. The theoretical training ended with the filling of a demonstration vignette.
During the training for field interviewers, participants did a role-playing exercise, which consisted in asking the questions to another participant and filling the questionnaire with the answers. Participants were given a vignette and asked to fill in the questionnaire. The questionnaires were collected, marked and graded. Finally, participants were asked to fill a participant feedback form to evaluate the training.
A separate training session was held for telephone interviewers. First, the LFS team presented the electronic version of the questionnaire, developed in the Blaise platform, which participants were already familiar with. Second, participants did a role-playing exercise, an evaluation and filled the participant feedback form.

Advance notification of interview visits
In order to maximize participation, this experiment followed the same procedure as surveys by SORS. For this respect, letters notifying households of upcoming visits by enumerators were sent out by SORS in advance. As World Bank (2013) discussed that the attrition rate in Peru's result was much lower than that in Honduras due to people's perception to the survey company. Therefore, we expect that this letter would create impression that this experiment was part of the government's official data collection. As shown below, both non-response and attrition rates in this experiment were as good as those of Serbia LFS and HBS.

Survey logistics in F2F and CATI surveys
In the F2F survey and the baseline in the CATI survey, enumerators recorded all responses in paper questionnaires and sent them to the headquarters where all data were entered into the computer. On the other hand, at the telephone interview in the CATI survey, operators recorded all responses into a data enter program directly while they were calling. The same data entry program software (BLAISE) was used for entering data collected from both F2F and CATI surveys. The data collection was done from June to July in 2013.

Comparison of Variable Means
This subsection examines whether data collected by F2F and CATI modes are similar. To do this, means of variables necessary for consumption projections are calculated in both data collected by F2F and CATI, and whether differences in the means are statistically significant by Pearson Chi-square tests. Note that since the household size and education variables were balanced in the sample for these census variables, the difference found here can be considered to be due to the difference in the survey modes. The results are shown in table 6.
Among the two variables that were balanced ex ante, we did not find a difference in household sizes, but we found a difference in the education variable between the two modes especially in the rural areas although the size of the difference is not very large. The mean household size is slightly larger in CATI than in F2F but the difference is statistically insignificant. The educational attainments of household heads are also similar between CATI and F2F data sets, and all but vocational school attainment in rural areas record statistically insignificant differences.
In labor and income variables, we did not find any statistically significant difference between two surveys except for months worked. In demographic and purchase variables, we do not find any statistically significant difference. For dwelling and durable good ownership variables, we found statistically significant differences in five and two out of 18 items in urban and rural areas, respectively. 7 In urban areas, more CATI households have district or local heating and freezers and washing machines than F2F households, but we found the opposite for satellite dish. So, it is not easy to find a pattern for the response in this category. Lastly, for subjective variables, more CAPI households are likely to feel worse off than F2F households in rural areas, but the difference is small. For the first test above, the results need to be carefully interpreted since we have a multiple comparison problem (Benjamini and Hochberg, 1995;Glennerster and Takavarasha, 2013, p366). For 30 variables, we conducted equality tests (e.g., t-test) 30 times using conventional statistical significance levels such as 1 and 5%. However, this approach is problematic since the more outcomes we test, the higher probability of false positive (i.e. type I error where the null hypothesis is incorrectly rejected) we will have.
To control for such false discovery rates when dealing with multiple tests, we apply both the Bonferroni correction and the Benjamini-Hochberg procedure in addition to the standard t-test or Pearson's Chi-squared test.
With the Bonferroni correction, the critical value for the t-tests is adjusted such that it becomes α/m, where α is the significance level and m is the number of comparisons. As we will be making 30 comparisons, the value is 0.05 / 30 = 0.00167. With this correction, it should be noted that the likelihood of type I errors decreases at the expense of a greater likelihood of type II errors.
The Benjamini-Hochberg procedure helps to control for the false discovery rate. This involves ordering the p-values of the standard t-tests from lowest to highest, and comparing them with the corresponding value of i * α/m, where i is the numbered order of the comparison and α and m are as above. Comparisons in which the p-value is less than i * α/m are considered statistically significant. In this study, a 5 percent is used for the false discovery rate.
Such adjustments are useful for this experiment because poverty projections are a sum of the regression variables weighted by regression coefficients. Even if one variable is collected differently between the F2F and CATI modes, that might not affect the projection results so much. However, if many variables are systematically different, then the regression results can be also different. Table 7 summarizes the results. The column (1) shows the p-value for a difference between a mean response to face-to-face interview and one to telephone interview. The table is organized by the p-value in an ascending order. The column (2) in both urban and rural sections shows the results of Bonferroni correction (α/m), the column (3) shows the results of Bnejamini-Hochberg index (iα/m), and the column (4) shows whether the Bnejamini-Hockberge index is significant or not. Since variables are organized by p-values in an ascending order, if a difference of one variable is statistically significant, then differences of all variables below are also statistically significant. The results show except for "district or local heating" and "the number of satellite dish" in the urban area, differences in all other variables as a group are not statistically significant. In other words, the responses from the two modes are not systematically different in general.

Comparison of Predicted Poverty Rates
In this subsection, we measure whether data collected by two different interview modes show differences in predicted poverty rates. To do this, we used two different poverty lines: the official poverty line and 150,000 RSD in a 2009 price. The second line is significantly higher than the official poverty line to see the results are robust to different levels of poverty lines. We conducted the t-tests to check whether the differences of the predicted poverty rates by the two modes are statistically significant or not. All poverty rates were estimated using PovMap2 software, which follows Elbers, Lanjouw and Lanjouw's (2003) methodology.
The predicted poverty rates using the two modes are very close and the difference is statistically insignificant. Table 8 shows the prediction of poverty rates using the data from the F2F and CATI modes in urban and rural data. With the official poverty line, the predicted poverty rates are 4.5 percent by F2F and 3.3 by CATI in urban areas and 7.4 by F2F and 8.2 by CATI in rural areas. In both cases, the differences in the predicted poverty rates by the interview modes are very small and are statistically insignificant. Therefore, we can conclude that the two survey modes do not make a significant difference in terms of the predicted poverty rates.
Indifference in prediction between the two survey modes is not due to a low level of poverty rate with the national poverty line since the difference is still small and statistically insignificant with the second higher poverty line. As shown in the table, the predicted poverty rates are 18.6 and 18.2 percent by the F2F and CATI in urban areas and 28.7 and 31.5 in rural areas, respectively. The percentage point differences between the two survey modes remain very small relative to the level of poverty rates and are statistically insignificant. Thus, the difference in poverty estimates by the F2F and CATI surveys cannot be observed even with the higher poverty rate.
To see the results are not specific to the select poverty lines, we compare poverty rates measured at different levels of poverty lines (figure 1). It is clear that poverty rates estimated from the F2F data are close to those of the CATI data for both urban and rural areas at any level of poverty line.
In summary, even if differences in some variables between the F2F and CATI surveys are statistically significant, poverty rates estimated from both datasets are very close at any level of poverty line. Therefore, we can conclude that interview modes did not cause any bias in poverty estimation based on an S2S method.

Resource Considerations
The cost of implementing the survey experiment was relatively inexpensive. A total of $28,400 was used for the survey implementation, including recruitment and training of field enumerators and telephone operators, sample selection, piloting the questionnaire, programming of the data entry software for CATI, etc. An additional $30k was allocated for technical assistance to supervise the experiment, including consultant fees for a survey expert and trip costs.
In terms of time required to complete the interviews, the average length of the interviews ranged from 5 to 12 minutes. The median times for the interviews by area and interview mode are shown in Table 9. For the F2F survey and CATI baseline, interviewers manually recorded the start and end times of the interviews. By contrast, the times for the telephone interview were recorded automatically by the CATI computer program. As each interview would require an introduction to explain the purpose of the survey and to identify the appropriate respondent, much of the CATI baseline and F2F survey interview time would be for these activities and not just for asking questions and recording responses. It should be noted that the sum of the median times for the CATI pre-interview and interview stages is similar to time for the F2F interviews.
It is also worth noting that a total time needed for the survey implementation was short.  survey was in the field between late June and early July 2013. The final data was delivered by the end of July 2013. In total, this entire pilot was prepared and implemented in five months.

Caveats
While this experiment provides some evidence that differences between F2F and CATI are likely to be negligible, several other challenges exist for S2S imputation methods to succeed. One of the central assumptions with this approach is that the imputation models are stable across time; so one must be alert to potential shifts in the model coefficients. Models might change over time, particularly during a crisis and as the period between the model calibration and prediction year widens.
Furthermore, this approach is not a substitute for the collection of high quality consumption or income data. The full multi-module household surveys allow for more in-depth analysis of poverty and distributional analysis and are vital to expanding the evidence base to inform policy decisions. Also, to develop reliable and accurate imputation models, we need to have rich and reliable multi-topic household surveys. Therefore, it is critical for the S2S approach that high quality multi-topic household surveys are implemented every few years.
As for the role of telephone surveys, they can offer several advantages in terms of time, costs, and flexibility but they also pose several challenges for sampling to obtain nationally representative estimates, particularly in the context of developing countries. First, telephone coverage is not 100 percent in most developing countries. If telephone signals do not reach some areas, populations in the areas will not be included in telephone surveys; as a result, they are not fully nationally representative. This concern is particularly serious in rural areas of developing countries. To overcome this, we might need to carry out a household survey with mixed interview modes -F2F for areas without telephone signals and CATI for areas with telephone signals. In this way, we can maintain representativeness of data while taking advantage of cost-effectiveness of telephone data collection. Needless to say, the findings of this paper are critical for this mixed interview approach.
Second, even if some areas are covered by telephone networks, poor households tend to be the last group of people that own telephones. Given that our main interest is on poverty estimation, this is a potentially serious pitfall. Once poor households are selected and they do not own telephones, it is necessary to provide cell phones to them for representativeness of data to be maintained.

Conclusion
The results of this experiment suggest that conducting phone interviews to collect nonconsumption data to predict poverty using an S2S method would not lead to any systematic bias. Indeed, we found that differences in interview mode are not likely to lead to large differences in responses and in turn bias poverty projections. Although these results do not completely rule out possibility of different response pattern between face to face and telephone interview modes, they provide some evidence that the combination of CATI with the S2S approach appeared successful.
Attrition rates are significantly lower than the previous telephone surveys. In urban areas, around 9.7 percent of sample households did not respond to telephone interviews after agreeing to participate in the telephone survey, while in rural areas, around 9 percent of sample households did not. These numbers are consistent with the official telephone surveys in Serbia, but much lower than the corresponding numbers reported in the L2L study. Such low rates are likely to be attributable to SORS's long experience of telephone data collection.
The sampling for this pilot was conducted very carefully and the non-response rates did not affect comparability of treatment and control groups much. Also, the key characteristics of households did not change much between the census data and the data collected by the pilot.
The cost for this experiment was relatively inexpensive at a total cost of $28,000. This included the cost to visit 1,200 households in the field and 600 phone interviews. The unit cost for this experiment was around $23 per interview (similar to L2L). If telephone numbers of households were already available, conducting the phone survey would be substantially lower.
As for implementation, the interviews were completed quickly with the average time for an interview of about 5 minutes. Also, sending official letters in advance to notify selected households of a potential upcoming visit seemed to contain nonresponse and attrition rates.
Other challenges still remain. Telephone coverage in a developing county is still limited to obtain nationally representative results, although it is rapidly expanding. Mixed mode data collection may be one possible solution in such contexts. The mixed mode data collection approach proved to be effective in increasing response rates but is found to be vulnerable reporting biases (see more in Dillman 2007). Further research will be necessary.
Finally, it is important to note that this pilot was conducted by SORS, who has a long experience of telephone data collection and well-established infrastructure for it, along with respondents who are used to it. To introduce this system in other countries where there is no such experience or infrastructure, there must be unobservable attributions of SORS that helped this pilot succeed well but this paper failed to pick and properly describe. Therefore, if this telephone data collection were to be carried out in other countries, it would be useful to review all steps carefully and consult SORS staff to make sure all fundamentals for the telephone data collection are properly satisfied.