Missing(Ness) in Action: Selectivity Bias in Gps-Based Land Area Measurements

Land area is a fundamental component of agricultural statistics, and of analyses undertaken by agricultural economists. While household surveys in developing countries have traditionally relied on farmers' own, potentially error-prone, land area assessments, the availability of affordable and reliable Global Positioning System (GPS) units has made GPS-based area measurement a practical alternative. Nonetheless, in an attempt to reduce costs, keep interview durations within reasonable limits, and avoid the difficulty of asking respondents to accompany interviewers to distant plots, survey implementing agencies typically require interviewers to record GPS-based area measurements only for plots within a given radius of dwelling locations. It is, therefore, common for as much as a third of the sample plots not to be measured, and research has not shed light on the possible selection bias in analyses relying on partial data due to gaps in GPS-based area measures. This paper explores the patterns of missingness in GPS-based plot areas, and investigates their implications for land productivity estimates and the inverse scale-land productivity relationship. Using Multiple Imputation (MI) to predict missing GPS-based plot areas in nationally-representative survey data from Uganda and Tanzania, the paper highlights the potential of MI in reliably simulating the missing data, and confirms the existence of an inverse scale-land productivity relationship, which is strengthened by using the complete, multiply-imputed dataset. The study demonstrates the usefulness of judiciously reconstructed GPS-based areas in alleviating concerns over potential measurement error in farmer-reported areas, and with regards to systematic bias in plot selection for GPS-based area measurement.


Policy Research Working Paper 6490
Land area is a fundamental component of agricultural statistics, and of analyses undertaken by agricultural economists. While household surveys in developing countries have traditionally relied on farmers' own, potentially error-prone, land area assessments, the availability of affordable and reliable Global Positioning System (GPS) units has made GPS-based area measurement a practical alternative. Nonetheless, in an attempt to reduce costs, keep interview durations within reasonable limits, and avoid the difficulty of asking respondents to accompany interviewers to distant plots, survey implementing agencies typically require interviewers to record GPS-based area measurements only for plots within a given radius of dwelling locations. It is, therefore, common for as much as a third of the sample plots not to be measured, and research has not shed light on the possible selection bias in analyses This paper is a product of the Poverty and Inequality Team, Development Research Group. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org. The author may be contacted at tkilic@worldbank.org. relying on partial data due to gaps in GPS-based area measures. This paper explores the patterns of missingness in GPS-based plot areas, and investigates their implications for land productivity estimates and the inverse scale-land productivity relationship. Using Multiple Imputation (MI) to predict missing GPSbased plot areas in nationally-representative survey data from Uganda and Tanzania, the paper highlights the potential of MI in reliably simulating the missing data, and confirms the existence of an inverse scale-land productivity relationship, which is strengthened by using the complete, multiply-imputed dataset. The study demonstrates the usefulness of judiciously reconstructed GPS-based areas in alleviating concerns over potential measurement error in farmer-reported areas, and with regards to systematic bias in plot selection for GPS-based area measurement.

INTRODUCTION
In Ancient Egypt, dating back to the Old Kingdom (2686 BC -2181 BC), the parcels along the Nile were allocated by the King to individuals, who were in turn recorded in the land administration system and were taxed for their land. However, the shape and the boundaries of the land on the banks of the Great River were often subject to change as a result of the annual flooding. This dynamic required the deployment of land surveyors, known as rope stretchers, in order to re-measure the land, replace the inscriptions that marked the boundaries, and resolve potential disputes between the neighbors so that the taxes could be determined accurately. The measuring devices used by these surveyors included an A-frame shaped level with a plumb bob, and a rope knotted at intervals (Lyons, 1927;Berger, 1934).
Land area measurement endures to be a fundamental component agricultural statistics 2 , and of analyses undertaken by agricultural economists, including the large body of policyrelevant empirical work on testing the existence of an inverse scale-land productivity relationship (see Carletto et al., 2013 and the references cited therein). The traditional tools of land surveyors, such as compass and rope, dating back hundreds of years, have remained in the choice set of agricultural statisticians to measure land areas accurately. These conventional approaches are, however, neither time-nor cost-effective in the context of large data collection efforts, which often only record farmers' own assessments of land areas. The availability of affordable, portable and relatively reliable Global Positioning System (GPS) devices over the last two decades has made GPS-based land area measurement a practical alternative that is increasingly being applied in surveys worldwide. 3 While the use of GPS technology in area measurement is the way of the future and there is increasing evidence of the benefits of utilizing GPS units as part of household survey operations in Africa (Dorward andChirwa, 2010, Kelly andDonovan, 2008), in these relatively early stages of its application, it is important to assess the advantages it can provide over other methods as well as the potential drawbacks associated with its use, particularly in the context of low income countries where the numeracy of respondents is limited, "non-standard" measurement units of land are commonplace, cadastral records are nonexistent, and the financial and human resources of national statistical offices are severely limited.
A key issue on which little analysis has been conducted is the potential selection bias associated with GPS-based plot area estimates. For a variety of reasons connected to (i) reducing survey costs, (ii) keeping household interview durations within reasonable limits, and (iii) the difficulty of asking respondents to accompany enumerators to agricultural plots that are situated far from dwelling locations, survey fieldwork protocols often require enumerators to record GPS-based area measurements only for plots that are within a given radius of the dwelling location. Plots that are located within an acceptable distance from the dwelling may still not be measured due to respondent refusal or lack of physical access at the time of the interview. Consequently, GPS-based plot area measurements, which are still recorded alongside farmers' own assessments in agricultural and household surveys, present large gaps and a score of missing land area values that can be catalogued as "item non-response." It is not uncommon to have missing GPS-based area values for as much as a third of the plots owned and/or cultivated by sample households, thus limiting the analytical value of the GPS-based data collection effort, to the point that one could even question whether it is worth collecting such data. The potential selection bias in GPS-based plot area estimates is most recently recognized but not addressed by Carletto et al. (2013), who explore the empirical implications associated with the choice of farm size measures based on farmer-reported vs. GPS-based plot areas, and document a stronger inverse farm size-land productivity relationship when using the GPS-based farm size measure.
The gaps in GPS-based plot area data are reminiscent of similar patterns of item nonresponse in other statistical series, such as income data in OECD countries. In the US, for instance, the problem is common in some of the Census Bureau series (Scheuren, 2005), in the National Interview Health Survey (NHIS; Schenker et al., 2006), but also in agricultural and forestry surveys such as the Agricultural Resource Management Survey (ARMS, Ahearn et al., 2011) and the National Survey on Recreation and the Environment (Zarnoch et al., 2010). Other examples include the Labor Force Survey of the Municipality of Florence in Italy (Giusti and Little, 2011), and the Labor Force Survey in South Africa (Vermaak, 2012). What several of these experiences have in common is the systematic approach to dealing with missing data as datasets are released in the public domain and analyzed for research and policy purposes. Increasingly, the technique of choice for dealing with missing data in public use datasets is "multiple imputation" (Rubin, 1987;, which, provided that the assumptions regarding the missing data mechanism hold, allows for valid inference on a restored "complete" dataset, while taking into account the uncertainty associated with the imputation process itself. As the drive towards open data advances in the developing world, methodical approaches to reliably address structural gaps in critical data, such as GPS-based land areas, are likely to be of increasing interest to national statistical agencies, their international partners, and data analysts. Our study (i) documents the way in which the process of selecting plots for GPS-based land area measurement in accordance with an arbitrary distance criterion results in a non-random sample of plots with GPS-based area measures, (ii) multiply imputes missing GPS-based plot areas in two recent, public use household survey datasets from Uganda and Tanzania, and (iii) presents evidence, in context of land productivity estimates and the inverse scale-land productivity relationship, on why dealing with missing data in a statistically valid fashion matters at the analysis stage. We find that the missing GPS-based plot areas could be reliably simulated by MI, and that the effects of the imputation on key agricultural statistics are non-trivial. The empirical application of the method on the estimation of the scale-productivity relationship confirms and strengthens the presence of such relation in two African countries of interest. Overall, our study demonstrates the usefulness of judiciously reconstructed GPS-based areas in alleviating concerns over potential measurement error in farmer-reported area assessments, and with regards to systematic bias in plot selection for GPS-based area measurement.
The paper is organized as follows. Section 2 provides a literature review on (i) the use of GPS-based land area measurement in agricultural productivity analysis, and (ii) multiple imputation. Section 3 describes the data and provides summary statistics. Section 4 details the multiple imputation approach, followed by an empirical application focused on the inverse scale-land productivity relationship presented in Section 5. Section 6 concludes.

LITERATURE REVIEW
This paper falls at the intersection of two different strands of literature. The first strand relates to the improvements in methodological tools available to agricultural statisticians in the area of land measurement. The second concerns the advances in statistical procedures to deal with missing data.
Land area measurement is one of the fundamental components of agricultural statistics, which have traditionally been fraught with large measurement errors. It is therefore not surprising that the interest of agricultural statisticians in applying technological innovations such as satellite imagery and GPS technology to land area measurement is growing rapidly with the increasing affordability, precision and applications of these tools. Kelly et al. (1995) have long identified the use of GPS technology as having the potential to render land area measurement a significantly less costly and time-consuming exercise than traditional methods such as rope and compass. An early study mentioning a comparison of GPS-based and farmer-reported plot area measures is Goldstein and Udry (1999). The authors touch upon the issue as part of a study on agrarian innovation in the Eastern Region of Ghana, reporting a correlation coefficient of only 0.15 between the two variables. They attempt to reason the finding in the context of traditional regional land area measures being based only on length (i.e. ropes), and farmers struggling to think in terms of a two dimensional measure (i.e. hectares) with increasing land scarcity. Keita and Carfagna (2009) provide a discussion on the performance of different GPS devices in comparison to the rope and compass alternative, which they assume to be the 'gold standard' for land area measurement. The authors document that while GPS technology allows for the measurement of 80 percent of the sample plot areas with negligible error, it tends to, on average, underestimate plot areas marginally with respect to the gold standard. 4 Possibly the most important problem with relying on GPS-based area measures, however, relates to the inability of obtaining these measures for all agricultural plots owned and/or cultivated by sample households. The inability is typically underlined by field logistics-and cost-related considerations. Some plots may be distant from the place of the interview (usually the sample household's dwelling), and respondents may not have time or willingness to accompany the enumerators to these locations. Even if this constraint does not hold, the operational costs, in terms of time and financial resources, required to record GPS-based area measures for all sample plots are often perceived as prohibitive in most survey operations. This raises analytical concerns, in the form of biased estimates, if the plots that are not measured are systematically different from the ones whose areas are captured with GPS technology. Carletto et al. (2013) use the data from the Uganda National Household Survey (UNHS) 2005/06 to document (i) the determinants of the discrepancy in land areas when using GPS-based versus farmer-reported assessments, and (ii) how the discrepancy varies systematically with plot and farm size. 5 Although the primary focus of their study is to document whether the use of farm size measures based on GPS-based versus farmerreported plot areas weakens, reverses or strengthens the inverse farm size-productivity relationship, they are not able to address the possible selection bias stemming from restricting their sample to households/farms that do not have any plots with missing GPSbased area measures (i.e. discarding approximately 50 percent of the agricultural household sample). Our study explicitly deals with this problem by employing approaches developed over the years by statisticians to overcome issues that are conceptually and practically similar in nature.
Since most datasets feature missing values for at least some variables, it is not surprising that finding new ways of dealing with missing data has attracted the attention of statisticians and applied economists for a long time, and in a number of different contexts. Unless the appropriate assumption regarding why the data are missing in the first place holds, any imputation method runs the risk of understating the true variance in the data, and leading to biased hypothesis testing and parameter estimates. The appropriateness of a imputation approach to fill in missing data, therefore, depends on (i) the nature of the missing data mechanism, (ii) whether the missing data mechanism could be ignored, and (iii) under what conditions.
The data are missing completely at random (MCAR) if the missingness depends on neither observed nor unobserved variables. The missing data on a given variable thus constitute a simple random sample of that variable. By ignoring missing data, as in casewise deletion, researchers implicitly assume that the data are MCAR. If the assumption holds and the researcher opts not to address missingness, the resulting parameter estimates are not biased but have larger standard errors stemming from the smaller sample size. If the MCAR assumption is not tenable, however, the available data are not representative of the population of interest and the resulting parameters are biased.
If the missingness depends on observed but not on unobserved factors, the data are missing at random (MAR), and could be predicted based on observed data underlying the missingness. If the MAR assumption holds and researchers undertake casewise deletion, the resulting parameter estimates are associated with larger standard errors and bias. If the MAR assumption is tenable but analysts conduct random or conditional mean imputation within an arbitrary imputation class that does not appropriately represent the observable attributes underlying the missingness, the subsequent parameter estimates are biased with understated standard errors (Robbins and White, 2011).
Finally, if the missingness depends on both observed and unobserved data, the data are missing not at random (MNAR). Since unobserved data that predict missingness are nonignorable under MNAR, relying only on observable covariates to simulate missing data yields biased inferences on parameters of interest. The resulting bias is not tractable unless additional information from outside the survey can be used to take into account unobserved heterogeneity that predicts missingness (Scheuren, 2005;Giusti and Little, 2011). Graham et al. (1997) argue that sound ways of dealing partially with missing data are better than doing nothing, and that the impact of the non-random component of the data on statistical inference is often smaller than it is commonly thought. In practice, it is close to impossible to determine whether missing data belong to the MNAR or MAR type, precisely because the missing data are, by definition, not observed. Scheuren (2005) provides an interesting discussion on the prevalence of the different types of missing data in practice, and on the advantages and disadvantages of different approaches to dealing (or not dealing) with them. In empirical analyses, imputations procedures are often applied based on the assumption that the data are MAR, sometimes accompanied by sensitivity analysis aimed at gauging the impact that departure from the MAR hypothesis may have on the analysis (see Giusti and Little (2011) for an example).
The conditions under which valid inferences could be obtained from missing data is laid out in Rubin's (1987) seminal work on multiple imputation (MI), which is a Monte Carlo technique that replaces missing values for a given variable by m>1 simulated alternatives.
In repeated imputation inference, each of the m imputed datasets are analyzed separately, and the results are combined so that the uncertainty associated with missing data is incorporated into the computation of estimates and confidence intervals. While MI had originally been deemed as a viable strategy to disseminate complete datasets from sample surveys and censuses, it has evolved to be part of the toolkit of a diverse array of researchers from biomedical, social and behavioral sciences, whose analyses may otherwise be hindered by missing data.
MI assumes the missing data to be MAR, and consists of three steps: (i) m imputations (i.e. m complete datasets) are generated based on an imputation model that encompasses a vector of observable covariates that predict the missingness in a given variable, (ii) an analytical model is estimated separately with each of the m complete datasets, and (iii) the results obtained from m complete data analyses are combined into a single set of multiplyimputed parameter estimates and standard errors, in accordance with Rubin (1987). Multiply-imputed values are chosen to represent "both uncertainty about which values to impute assuming the reasons for nonresponse are known and uncertainty about the reasons for nonresponse." (Rubin 1988: 79) MI also incorporates in the analysis the uncertainty associated with the imputation process itself, by incorporating in the statistical inference the variability across imputations.
A typical example in the literature that is relevant for our discussion of missing GPS-based land areas is missing income data in household surveys. Respondents often refuse to report personal/household income, and the likelihood of refusal tends to increase with wealth. The analogy with GPS-based land areas is clear: the larger the variable of interest (i.e. income, land area), the higher the probability that the data may be missing. Recent empirical work relying on MI to deal with missingness in income data clearly show that the mean estimate and the associated standard error for the imputed variable are affected by the choice of the imputation method, particularly when the proportion of missing values is large (Schenker et al., 2006;Zarnoch et al., 2010, Giusti andLittle, 2011;Ahearn et al., 2011;Vermaak, 2012). All four studies also face a nonresponse rate similar in magnitude to ours, at around 30 percent. In what follows, we take a similar approach, and multiply impute missing GPS-based plot area measures in recent household survey data from Uganda and Tanzania. A further desirable feature of the survey data from both settings is that we have the farmer-reported plot area for all plots, which is an alternative measure of the variable we will be imputing and a luxury that no other study featuring MI to simulate missing data has enjoyed. Following MI, we utilize the multiply-imputed dataset in the study of the relationship between agricultural productivity and cultivated plot area, and present the resulting parameter estimates alongside those that are obtained under casewise deletion.
The use of the multiply-imputed dataset in this empirical application is particularly pertinent since it has been claimed that the long-standing debate on the inverse relationship between farm size and land productivity may rest on a statistical artefact tied to measurement error in land area estimates (Lamb, 2003). Barrett et al. (2010) provide a similar explanation after failing to explain the inverse relationship otherwise. Carletto et al. (2013), on the other hand, demonstrate that the inverse relationship is strengthened by using the aggregated farm size measure underlined by GPS-based vis à vis farmer-reported plot areas. Their GPS-based farm size measure, however, suffers heavily from missingness as they exclude all farming households that had a missing GPS-based area for any of the parcels reported to be cultivated. Their inferences could therefore be inefficient and biased. Our study is the latest contribution to this debate by testing the robustness of the observed relationship in the context of complete case analysis featuring multiply-imputed GPS-based plot areas.

DATA
The Tanzania  In terms of questionnaire design, the TZNPS and the UNPS share a number of similar features. In each setting, all sample households were administered a multi-topic household questionnaire. The households that were involved in agricultural activities (through ownership and/or cultivation of land and/or ownership of livestock) were also given an agriculture questionnaire. On annual/temporary crop production, the same set of modules were administered separately for the long and short rainy seasons in a single visit in Tanzania, and for the two main agricultural seasons in two different visits in the case of Uganda. At the plot-level, the questionnaires solicited detailed information on land area, physical characteristics, labor and non-labor input use, cultivation, and production.
Both surveys relied on mobile survey teams, each of which was headed by a team leader and was composed of 3-4 enumerators and a data entry operator. In Tanzania, as long as respondent refusal or physical inaccessibility (due to inclement weather or unfavorable road conditions) were non-issues, the field protocol required enumerators to record GPSbased area measurements for agricultural plots that were owned and/or cultivated by the farming household sample that were situated within a 1 hour of travel radius with respect to the dwelling unit, regardless of mode of transportation. Similarly, the field protocol in Uganda required enumerators to take GPS-based plot area measurements as long as the plots were confirmed to be located within the sample enumeration area.
In both settings, the GPS-based plot area measurements were taken following the interview, with the help of respondents that accompanied enumerators to plot locations and identified the plot boundaries. The enumerators were instructed to report nonmeasured plots and the associated reasons to their team supervisors. Each enumerator was assigned a handheld GPS unit (Garmin eTrex HC in Tanzania, and Garmin 12 in Uganda) and was trained extensively on GPS-based area measurement during the field staff training prior to the start of the field work. For a given plot, the enumerator was supposed to walk clockwise the perimeter with the GPS unit active and at a reasonably slow speed, stopping for 10 seconds at every point that the plot outline changed direction. The area of each plot was calculated directly by the GPS unit in acres.
Our field experience and interactions with the survey teams in both settings, as well as the quantitative evidence on the reasons for not obtaining GPS-based plot areas, suggest that the missing data are overwhelmingly due to the aforementioned distance criteria for the selection of plots for GPS-based area measurement. Refusal (and physical accessibility) constitute a near-negligible share of reasons for the lack of GPS-based area measures in each setting, in line with the high response rates observed in rural East African settings.
Tables 1 and 2 present summary statistics based on the UNPS and TZNPS, respectively, and report the results from the tests of average differences by GPS-based plot area measurement status. Given the focus of the empirical application on agricultural productivity, we focus on cultivated plots in both samples. A number of noteworthy findings emerge. First, the number of plots lacking a GPS-based area measure is relatively large, with 1,519 and 759 missing data points accounting for 35 and 18 percent of the plot samples in Uganda and Tanzania, respectively. Second, the GPS-based and farmer-reported plot area estimates of average plot area are similar, and just over 2 acres in both countries. Average farmer-reported area is slightly larger for plots that were not measured, but the differences are not statistically significant. Third, several important plot-and householdlevel characteristics, which are expected to be associated with productivity-related outcomes, display statistically significant differences by GPS-based plot area measurement status. Taken together, these observations highlight the importance of systematically addressing missingness in GPS-based plot areas, if such GPS data are to be used in a robust fashion.
Plots without a GPS-based area measure are clearly non-random picks and they differ from their GPS-measured counterparts in similar ways in both Uganda and Tanzania. In line with the expectations informed by the field work protocols, non-measured plots tend to be further away from the household location. They are additionally more likely to be rented and to lack desirable features in terms of soil quality and slope. Plots with a GPS-based area measure originate, on average, from households that are headed by older individuals who are more likely to report agriculture as their primary occupation. In Uganda, heads of households associated with plots with a GPS-based area measure are more likely to be female, while their counterparts in Tanzania have, on average, fewer years of education. Lastly, non-measured plots in both settings are more likely to stem from (intact) mover original and split-off households, as opposed to non-mover households re-interviewed in the TZNPS 2010/11 and the UNPS 2009/10. Given the additional workload associated with tracking these households in their new locations, and the possibility that households moving into distant areas might still keep their plots in their original locations, which in turn may not be GPS-measured in accordance with the survey protocol, this finding is also in line with our expectations.
The descriptive findings 8 reported thus far confirm our initial suspicion that the GPS-based plot area measures cannot be considered MCAR, and indicate that there are distinct observable attributes associated with missingness, and that the results of statistical analyses based on their use may not only be inefficient (due to the smaller sample size), but also biased (due to the non-random missing pattern). Together with the field insights regarding the missing data being overwhelmingly driven by the survey protocols that excluded plots from GPS-based area measurement based on distance criteria, we hypothesize that the data are to a large extent MAR, and that MI could be applied in a statistically valid fashion to predict missing GPS-based plot areas based on observables.
The support to this hypothesis is also rooted in (i) the high degree of correlation between farmer-reported and GPS-based plot areas that will be central to the specification of the imputation model discussed in the subsequent section, and (ii) the fact that even if the data are partly MNAR, MI will still reduce the bias in comparison to casewise deletion or conditional mean imputation within an arbitrary imputation class (Graham, et al., 1997).

MULTIPLE IMPUTATION OF GPS-BASED PLOT AREAS 9
In building the imputation model, the literature (Rubin, 1996;van Buuren et al., 1999) advises to include as explanatory variables: (i) the variables appearing in the analysis model that features the multiply-imputed variable(s), (ii) the variables that are known to have influenced the occurrence of missing data, and other variables for which the distributions differ between the response and non-response groups, (iii) the variables that explain a considerable amount of variance of the multiply-imputed variable(s) and that help to reduce the uncertainty of the imputations, and (iv) the variables with information on the features of the complex survey design, including stratum and cluster identifiers, and sampling weights.
Our imputation model follows the aforementioned guidelines, and includes the variables that relate to missingness, and all variables featured in the empirical application presented in Section 5, including sampling weights, and enumerator and district fixed effects. A key covariate that is included in the imputation model and that is both a powerful predictor and an alternative measure of the GPS-based plot area is the farmer-reported plot area. The availability of this variable distinguishes our study from other studies that have employed MI to tackle item non-response. The availability of the farmer-reported plot area allows us to overcome what could otherwise be a limitation of this application, namely our inability to rely on a panel of plots. In the literature, the multiple imputation procedure is most often applied to longitudinal data, where variables from different survey rounds can be strong predictors of missing observations, and it is thought to be at "its weakest in cross-section surveys. In cross-section surveys, seldom are there strong predictors present" (Scheuren, 2005: 318). We posit that the presence of the self-reported land area makes our dataset an exception to this rule. 10 Other independent variables in the imputation model include the distance of the plot from the household in minutes (Uganda) or kilometers (Tanzania). As noted earlier, these covariates are major predictors of missingness in accordance with the survey protocols. The variables on plot tenure status and farmer-assessed plot slope and soil quality are incorporated since, besides being control variables in the analytical model, they are possible predictors of missingness as well. The imputation model also includes variables on household characteristics related to (i) demographics (household size, dependency ratio, and the age, sex and education of the head of household), and (ii) whether the household is a (intact) mover original or a split-off household. The inclusion of the latter set of variables attempts to accommodate the possibility that the additional workload associated with tracking shifted households and individuals in their new locations may increase the likelihood of missing data. The demographics are deemed to influence the accuracy of the self-reporting, as well as being likely control variables for the analysis. Furthermore, the inclusion of enumerator fixed effects in the imputation model accounts for unobserved heterogeneity across enumerators, such as differences in effort, which might be correlated with the missingness in GPS-based plot area measures. Lastly, we use information from earlier survey rounds on household wealth to accommodate the possibility of higher likelihood of missing plot area measurements among richer households with higher opportunity costs of time.
The first step in our multiple imputation effort is to fit a plot-level Ordinary Least Squares (OLS) regression model with the GPS-based plot area as the dependent variable, to then obtain linear predictions for all plots in the dataset. Under the partially parametric method of predictive mean matching (PMM) (Rubin, 1987;Little, 1988;Schenker and Taylor, 1996), we use the linear prediction as a distance measure to form a set of 5 nearest neighbors chosen from the plot sample with GPS-based area measures, and randomly pick one of the neighbors whose observed GPS-based plot area value replaces the missing value for the incomplete case at hand. 11 The imputation is carried out 50 times to reduce the 10 We cannot apply a panel data model because of issues with the survey design that do not allow matching plots across survey years. It is likely that future surveys in the LSMS-ISA program will allow us to overcome this limitation. Another aforementioned limitation in the case of the TZNPS is that only 25% of the plots were measured in the most recent survey round prior to the 2010/11 wave that informs our analysis. 11 The results are robust to using linear regression, as opposed to PMM. The number of nearest neighbors in the PMM framework is inversely related to the correlation among imputations. While high correlation may increase the variability in MI point estimates, low correlation may increase the bias in MI point estimates. The potential sampling error due to imputation, and 50 complete datasets are generated. 12 The posterior estimates of the model parameters are obtained using sampling with replacement, which is standard practice when the asymptotic normality of parameter estimates is suspect. 13 By drawing from the observed data, PMM preserves the distribution of observed values in the missing part of the data, which makes it more robust than the fully parametric linear regression approach.
For the analysis of the multiply-imputed data, we follow Rubin (1987), and suppose that there are m imputations, thereby m completed datasets, and let denote a scalar population parameter of interest. Estimation of the analysis model with the complete data from a multiply-imputed data set i yields the point estimate ̃ and its estimated variance , where i = 1, 2,…, m.
The overall multiple imputation estimate is defined as:

̅ ∑ ̃
The variance of ̅ has two components. The average within-imputation variance is:

̅ ∑
The between-imputation variance is: The total variance is then specified as:

̅ ( )
literature does not provide definitive guidance on the decision regarding the number of nearest neighbors, but the results are robust to the specification of ten nearest neighbors, with or without bootstrapping. 12 The results are robust to carrying out 100, 150, 200, or 250 imputations. 13 The results are robust to sampling estimates from the posterior distribution of model parameters, as opposed to bootstrapping.
The Student t approximation is used for constructing confidence intervals and significance tests, where: ̅ √ with degrees of freedom: ̅ Finally, the fraction of information about that is missing due to nonresponse is: The results from an illustrative OLS regression that would have underlined MI in the absence of the bootstrapping option for the imputation model are reported in Tables 3 and  4. The models perform well in explaining the variance in the GPS-based plot area, with an R-square of 0.658 in Uganda and 0.688 in Tanzania. The coefficients on the farmerreported plot area are highly significant and large, with a similar order of magnitude (0.945 in Uganda and 0.866 in Tanzania). It is worth emphasizing that the imputation model neither intends to provide a parsimonious description of the data nor attempts to portray structural relationships among variables (Schafer and Graham, 2002). Instead, it attempts to be as comprehensive as possible in order to minimize any bias that could stem from omitting variables that might be relevant to the pattern of missingness or the subsequent analysis. "The possible lost precision when including unimportant predictors is usually viewed as a relatively small price to pay for the general validity of analyses of the resultant multiply-imputed database." (Rubin, 1996: 479) Table 5 provides separately for the UNPS 2009/10 and the TZNPS 2010/11 the summary statistics on (i) GPS-based and farmer-reported plot area for plots with an observed GPSbased area measure, and (ii) multiply-imputed GPS-based and farmer-reported plot area for the entire plot sample. The mean and the associated standard error for the multiplyimputed GPS-based plot area are computed in accordance with Rubin's rules. Looking at the overall multiply-imputed mean GPS-based area and the associated standard error for the entire plot sample in Uganda and Tanzania, we observe that they are in line with the mean GPS-based area and the associated standard error for measured plots. While MI, in comparison to casewise deletion, leads to marginally lower mean and standard error estimates for GPS-based plot area, a simpler method of handling missing data, such as conditional mean imputation, would have understated the true variance in the variable of interest (Schafer and Graham, 2002). 14 Moreover, the mean value and associated standard error for our land productivity measure, namely plot-level gross value of output per acre in local currency, exhibit more drastic changes under complete case analysis following MI in comparison to the incomplete case analysis based on plots with an observed GPS-based area measure. The next section delves deeper into the empirical implications of incomplete vs. multiply-imputed complete case analysis of GPS-based plot areas in the context of the inverse scale-land productivity relationship.

EMPIRICAL APPLICATION: INVERSE SCALE-LAND PRODUCTIVITY RELATIONSHIP
Starting with the seminal work of Sen (1962Sen ( , 1966, who observed an inverse relationship (IR) between farm size and output per hectare in Indian agriculture, a large number of empirical studies have presented evidence that appears to corroborate that hypothesis (Lau and Yotopoulos, 1971;Yotopoulos and Lau, 1973;Berry and Cline, 1979;Carter, 1984, Eswaran and Kotwal, 1985, 1986Barrett, 1996;Benjamin and Brandt, 2002;Larson et al. 2012). A smaller set of empirical studies, however, does not find evidence of such a relationship (Hill, 1972;Kevane, 1996;Zaibet and Dunn, 1998). Binswanger et al. (1995) and Eastwood et al. (2010) provide careful discussions of both the theory and the empirics of the IR debate, a full review of which is beyond the scope of this paper.
Following Barrett et al. (2010), an inverse relation between farm size and land productivity may have three main explanations: (i) imperfect factor markets, (ii) omitted variables and, in particular, omitted controls for land quality, and (iii) statistical issues related to the measurement of plot size. Imperfect factor markets (i.e. labor, land, insurance) are linked to differences in the shadow price of production factors that in turn lead to differences in the application of inputs per unit of land, in ways that are correlated with farm size.
Much of the earlier contributions to the IR debate focused on testing this type of explanation. Assunçao and Ghatak (2003) demonstrate how unobserved heterogeneity in farmer ability may theoretically explain the observed differences in productivity. On one hand, Bhalla and Roy (1988) and Benjamin (1995) have challenged the existence of the IR based on the observation that when land quality controls are introduced in the analysis, the strength of the IR often diminishes substantially or vanishes altogether. On the other hand, Barrett et al. (2010) utilize a dataset that includes laboratory measures of soil testing and conclude that only a very limited proportion of the IR can be explained by differences in land quality. Similarly, Nkonya et al. (2004) find a strong negative effect of farm size on plot output value after controlling for plot size, labor input, equipment and proxies for land quality, suggesting that not only land productivity but also total factor productivity is inversely related to scale. Carter (1984) and Heltberg (1998) control for village and household fixed effects, respectively, and still do not find that the strength of the IR diminishes. Taking these studies together, Eastwood et al. (2010: 3351) consequently posit that "the IR is immune to the land-quality objection." Lastly, it has been suggested that the existence of the IR may be a statistical artefact stemming from measurement error in land data (Lamb, 2003). A similar explanation is provided by Barrett et al. (2010), after failing to explain the observed IR otherwise. However, the data from the UNHS 2005/06, which collected both GPS-based and farmerreported measures of plot areas, have been used by Carletto et al. (2013) to show that the estimates of the IR are robust to potential measurement error introduced by farmer's selfreporting and that the IR is strengthened when more accurate, GPS-based measure of farm size is used in the analysis.
Our study builds on the work of Carletto et al. (2013), who exclude all farming households for whom GPS-based farm size cannot be computed due to one or more missing GPS-based plot areas. As a result, approximately 50 percent of the agricultural household sample is discarded in their analysis, even if the share of non-measured plots stands at 35 percent. In what follows, we investigate the robustness of the inverse scale-land productivity relationship to complete case analysis made possible by MI. We estimate a plot-level production function to test for the existence of the IR, based on the one originally proposed by Binswanger, Deininger and Feder (1995), and not dissimilar from the approach taken by Barrett et al. (2010) and Carletto et al. (2013): where i and h denote plot and household, respectively; Y represents value of agricultural output; A denotes the plot area in acres; P is a vector of plot-level variables spanning logarithm of value of input use per acre, distance from the household, tenure status, and farmer-reported soil quality and slope; H is a vector of household-level attributes that might influence agricultural production, including household size, dependency ratio, and basic characteristics of the head of household; D and E are district and enumerator fixed effects, respectively; and ε is the error term. 15 The coefficient β on A is the parameter of interest while testing for the existence of the IR. A negative β indicates the presence of the IR: the more negative the coefficient, the stronger the IR. We estimate Equation 1 separately with (i) the incomplete dataset using the observed GPS-based plot areas, and (ii) the complete dataset using the multiply-imputed GPS-based plot areas. In the latter case, the regression is estimated m (i.e. 50) times with each of the m complete datasets, and the coefficient estimates and associated standard errors are combined in accordance with the framework presented in Section 5.
The regression results are presented in Tables 6 and 7 for the UNPS 2009/10 and the TZNPS 2010/11, respectively. 16,17 The results from the incomplete case analysis and the multiply-imputed complete case analysis are presented in Columns 1 and 2, respectively. The coefficient on the plot area is negative and statistically significant at the 1 percent level across all columns reported in Tables 6 and 7. However, with respect to the β obtained under the incomplete case analysis, the β associated with multiply-imputed GPS-based plot area is 33 percent and 9 percent higher in Uganda, and Tanzania, respectively, revealing how the selection bias introduced in the GPS-based plot area measurement process carries all the way to the analysis stage. It should also be noted that not only does working with the full sample have implications for the IR, the coefficients and standard errors of several other explanatory variables of interest are observed to change when the complete dataset 15 Plot value of agricultural output in Ugandan/Tanzanian Shillings (USH, TZS) is computed by (i) expressing crop-condition-non-standard measurement unit combinations in kilogram-equivalent terms and the reference crop condition, (ii) multiplying the kilogram-equivalent quantity of production for each crop on a given plot by the median crop sales value per kilogram at the enumeration area-level, and (iii) summing all imputed values of crop production on that plot. The median crop sales value per kilogram is computed at the enumeration area-level only if at least 10 unit sale values are available at that level of aggregation. Otherwise, the median crop unit sale value is computed at a higher level, in the order of parish, sub-county, county, district, and region in the case of Uganda, and in order of ward, district, and region in the case of Tanzania. Plot value of inputs in USH/TZS includes (i) expenditures on hired labor and (ii) the imputed values of nonlabor inputs, including organic fertilizer, inorganic fertilizer, pesticides/herbicides, based on median costs per kilogram computed at the enumeration area-level only if at least 10 unit cost observations are available for a specific class of non-labor input at that level of aggregation. Otherwise, the median unit costs are computed at a higher level, in the order of parish, sub-county, county, district, and region in the case of Uganda, and in the order of ward, district, and region in the case of Tanzania. Our results are robust to (i) using net revenue (i.e. gross value of output net of gross value of non-labor input) per acre as the dependent variable, and/or (ii) including enumeration area, as opposed to district, fixed effects in the imputation and analysis models. 16 Using the sample of plots with observed GPS-based areas, we also estimated Equation 1 using the farmerreported plot areas in both settings. Just as in Carletto et al. (2013), the resulting β is negative and its absolute value is lower than the β obtained by using the observed GPS-based plot areas. These results are available from the authors upon request, but are not reported as they are not the focus of this paper. 17 The combined MI results reported in Tables 6 and 7 are underlined by complex survey regressions that correctly take into account clustering and stratification in weighted regressions. The results are robust to not weighting our regressions.
is restored. Notable in this respect are the changes associated with the distance variables in Tanzania, the rented/other plot tenure category in Uganda, and the household size and the number of plots in the holding in both settings.

CONCLUSION
This paper attempts to address three interrelated questions concerning the use of the GPS technology for land area measurement in household surveys in developing countries: (i) Are the land area statistics informed by partial GPS-based land areas subject to selection bias?, (ii) Is it possible to fill the gaps in GPS-based land area measurements by using statistically valid techniques and other information available in the survey?, and (iii) Does having complete GPS-based land areas matter for the analysis of policy relevant issues, such as the inverse scale-land productivity relationship? Our analysis confirms that the missingness in GPS-based plot area measures is a valid concern, not only because it is pervasive, but also because plots that do not get measured are not random picks among plots that households own or cultivate. This may be of particular concern in the context of large, nationally-representative household surveys, which suggests that the prevailing survey protocols for GPS-based plot area measurement and its supervision should be revisited to minimize the extent to which GPS-based plot area measurements are not taken.
However, it is important to recognize that no matter how effective the protocols and the field supervision could be, a non-negligible degree of missingness in GPS-based plot areas data is bound to remain, as completely eliminating it is unfeasible without incurring prohibitive costs. This paper has, therefore, sought not only to advance our understanding of the potential drawbacks of current GPS-based land area data collected as part of household surveys, but also to explore ways to remedy some of these shortcomings through sound statistical techniques. The good news is that the missingness is largely driven by observable variables. The better news is that the available, yet imprecise, farmerreported plot areas are powerful predictors of the observed GPS-based plot areas, and that these self-reported measures can be effectively used in generating reliable simulations of missing GPS-based plot areas using multiple imputation. Advances in computing power and the availability of multiple imputation routines in an increasing number of statistical software packages make the routine adoption and application of these techniques feasible even in developing country settings.
This is very timely as policies enabling open access to data are spreading quickly across the world, and GPS technology is becoming a standard feature in survey work. Having data producers, as opposed to data analysts, deal with handling missing GPS data has the advantage of (i) reducing the duplication of work, (ii) increasing the comparability between the final data used to perform analyses, and (iii) allowing more analysts (who do not necessarily have the time or ability to deal with the missing data problem correctly) to work with complete case data. Regardless, it is essential to document the imputation approach as part of the survey documentation and provide data users with both sets of data.
Obtaining complete case GPS measures at the fieldwork stage is overly costly, as some plots will always be located at a prohibitive distance from the location of the interview. On the other hand, GPS-based plot area measures exhibiting up to 35 percent of missing values would be of limited value to many data analysts. By combining good fieldwork training, strict fieldwork quality control, sound imputation methods, and emphasis on always soliciting farmer-reported plot areas as possible predictors for missing GPS-based counterparts, it is possible to obtain reliable simulations of GPS-based plot areas while keeping survey costs in check. One limitation of our results is that they are based on two cases from East Africa. Further applications in different geographical, cultural, and socioeconomic contexts are needed to establish the extent to which our findings are generalizable on a broader scale. Also, more research is needed on the trade-offs between the benefits of reducing the rate of missing GPS-based plot area information vis a vis the cost of increasing coverage of plots for GPS-based area measurement.
Last but not least, our findings show how the collection of GPS-based land areas matters for advancing the debate on key policy-related research questions, such as the longstanding debate on the existence of an inverse relationship (IR) between farm size and land productivity. When (complete) GPS-based plot area information is utilized, the presence of the IR is confirmed and strengthened, and MI allows us to overcome the limitations to the power of that analysis and the concerns over potential selection bias rooted in the available GPS-based measures.

(18%)
Note: *** p<0.01, ** p<0.05, * p<0.1. † indicates a dummy variable. The statistics are weighted through the use of household sampling weights, and are based on TZNPS 2010/11 data, unless otherwise stated. 0.658 Note: *** p<0.01, ** p<0.05, * p<0.1. Constant is estimated but not reported. District and enumerator fixed effects as well as sampling weights are included as covariates but not reported. † denotes a dummy variable. The comparison categories are (i) less than 15 minutes, (ii) owned with a title, (iii) good, and (iv) non-mover original household for A, B, C, and D, respectively. All variables based on the UNPS 2009/10 data, unless otherwise stated. 0.688 Note: *** p<0.01, ** p<0.05, * p<0.1. Constant is estimated but not reported. District and enumerator fixed effects as well as sampling weights are included as covariates but not reported. † denotes a dummy variable. The comparison categories are (i) owned with a title, (ii) good, (iii) flat, and (iv) non-mover original household for A, B, C, and D, respectively. All variables based on the TZNPS 2010/11 data, unless otherwise stated.