Policy Research Working Paper 10976 A Longitudinal Cross-Country Dataset on Agricultural Productivity and Welfare in Sub-Saharan Africa Thomas Bentze Philip Wollburg Development Economics Development Data Group November 2024 Policy Research Working Paper 10976 Abstract Since 2008, the World Bank’s Living Standards Measure- (HP) from LSMS-ISA surveys from 2008 to 2021 in seven ment Study–Integrated Surveys on Agriculture (LSMS-ISA) Sub-Saharan African countries: Ethiopia, Malawi, Mali, program has supported the collection of nationally rep- Niger, Nigeria, Tanzania, and Uganda, from 2008 to 2021. resentative, longitudinal, multi-topic household survey It includes more than 200,000 agricultural plot observa- data to inform researchers and policy makers of living tions, more than 400,000 individuals, and about 59,000 standards in Sub-Saharan Africa. The surveys maintain a households. The HP allows for in-depth analysis of farm, distinct focus on the agricultural sector, collecting detailed household, and individual dynamics over time and across plot-level data and information about agricultural activities, countries. It is ideal for researchers interested in studying while measuring socioeconomic conditions of thousands the dynamics between agriculture, economic development, of smallholder farmers and households across multiple and welfare outcomes in Sub-Saharan Africa. countries. This paper presents a harmonized panel dataset This paper is a product of the Development Data Group, Development Economics. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at tbentze@worldbank.org, The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team A Longitudinal Cross-Country Dataset on Agricultural Productivity and Welfare in Sub-Saharan Africa Thomas Bentze1, Philip Wollburg1,2 JEL codes : O12, I32, P36 Keywords: Agricultural Productivity; Household Welfare; Panel Data; Household surveys; Economic Development; 1 TheWorld Bank, Living Standards Measurement Study, Development Economics Data Group, Rome, 00184, Italy.2 Wageningen University, Development Economics Group, Wageningen, 6706KN, The Netherlands. Disclaimer: The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. 1) Background and Summary The agricultural sector concentrates large fractions of the labor force and economic output in many low- income countries. In Sub-Saharan Africa, the world’s poorest region, the agricultural sector accounts for about 50% of the labor force (World Bank, 2021). The agricultural sector, and consequently domestic food production, is dominated by smallholder farmers in the region. Many of the world’s extreme poor rely on the agricultural sector for their income and food security. The importance of the agricultural sector for poverty alleviation, food security, structural transformation, and economic development more broadly has been widely recognized (Dercon and Gollin, 2014; Gollin, 2010). As a result, studying the sector has been of considerable interest to development research, policy interventions, and investment decisions (Adamopoulos and Restuccia, 2020; Gollin et al., 2014, 2013; Gollin and Udry, 2021; Restuccia and Rogerson, 2013). However, the data landscape was long characterized by a persistent lack of high-quality data, requisite to informing research and policy interventions (Carletto, 2021). The World Bank’s Living Standards Measurement Study – Integrated Surveys on Agriculture (LSMS-ISA) program was designed to address this gap. The LSMS-ISA program consists of nationally and sub-nationally representative, longitudinal and multi-topic household surveys, with a distinct focus on agricultural production. Starting in 2008, the surveys were implemented by National Statistical Offices (NSOs) with the support of the World Bank’s Living Standards Measurement Study (LSMS) team. LSMS-ISA data is widely considered the premier source of survey micro-data on agricultural production and productivity in Sub-Saharan Africa – and its relationship to livelihoods, household income, poverty, and food security (Wollburg et al., 2024a, 2024b). In this paper, we present a harmonized panel of LSMS-ISA surveys from seven Sub-Saharan African countries, namely Ethiopia, Malawi, Mali, Niger, Nigeria, Tanzania and Uganda. This harmonized panel dataset (hereafter HP) covers over 200,000 agricultural plot observations, over 400,000 individuals and 58,000 households, over the time period of 2008 to 2021. The data are nationally representative of these countries, which comprise 39% of the population and close to a third of the poor in Sub-Saharan Africa (Azevedo, 2011; World Bank, 2022). The data are also representative of the household and smallholder agriculture sector of the study countries. The longitudinal nature of the data allows tracking households, farms, and individuals across time. The HP provides longitudinal records for close to 58,000 agricultural and non-agricultural households and just over 400,000 individuals, allowing researchers to analyze patterns, dynamics, and trajectories over time. The HP datasets are compiled from the public data releases of the LSMS-ISA surveys, which can be downloaded from the World Bank’s Microdata Library (see Table 1). The HP consists of four datasets, one containing records of households, one of individuals, one of agricultural plots, and one of crops on each plot. The preparation of the HP consisted of constructing, cleaning, and harmonizing close to 150 agricultural, household, and individual indicators with the objective of creating a data asset that is ready for analysis. The datasets are fully mergeable with publicly available raw LSMS-ISA datasets such that users can add additional variables from the LSMS-ISA surveys, and tailor the dataset to according to their research needs. Household, community, and farm locations are georeferenced, so that the datasets can be enriched with geospatial information. The HP datasets allow for highly disaggregated analyses at the 2 household, individual, plot and plot-crop levels. This enables a nuanced and granular understanding of intra-plot, intra-farm, and intra-household dynamics. The multi-topic nature of the data facilitates studying complex problems integrating information from different domains. The HP can be used to generate a wide range of analyses – from analyzing the drivers and impacts of agricultural productivity to measuring livelihood and welfare trends across multiple sectors and countries. This data descriptor will outline the data collection and processing steps that led to its creation, along with a description of its structure and usage recommendations. 2) Methods Survey data collection The HP consists of a total of 29 waves of nationally representative, longitudinal, multi-topic household surveys across seven countries that were supported by the World Bank’s LSMS-ISA project. The surveys were implemented between 2008 and 2021, with varied timeframes in different countries (Figure 1). The HP includes five waves of data from Ethiopia (2011-2022); four waves from Malawi (2010-2019); two waves from Mali (2014-2017); two waves from Niger (2011-2014); four waves from Nigeria (2010-2019); five waves from Tanzania (2008-2019); and eight waves from Uganda (2008-2019). The raw data files of these surveys are publicly available and can be downloaded from the World Bank’s Micro Data Library (see Table 1). Figure 1. Timeline of LSMS-ISA data collection Note: the length of the boxes in the timeline correspond to the timeframe of the fieldwork. Tanzania wave 4 comprises the refreshed sample of households, and wave 5 corresponds to the NPS 2019/2020 with sex-disaggregated data. The LSMS-ISA surveys are designed with a distinct focus on agriculture as a key source of livelihood in Sub- Saharan Africa. To better capture data on agricultural production, the surveys were carried out in two in- person visits based on the agricultural production cycle, one after planting, and one after harvest. The exception is Tanzania, where one in-person interview was carried out that covered the entire season, while Uganda’s bimodal seasonal pattern meant that each of the two visits covered a different season. 3 The surveys administered a household questionnaire, an individual questionnaire, an agriculture questionnaire, and a community questionnaire. The HP dataset, with its focus on agricultural production and productivity, includes variables predominantly from the agriculture, household, and individual questionnaires (see ‘Survey instruments’ below). The survey questionnaires were administered in face- to-face interviews and recorded as Computer Assisted Personal Interviews (CAPI) using the Survey Solutions platform in Ethiopia (waves 4 and 5), Malawi (waves 3 and 4), Mali (wave 2), Nigeria (waves 3 and 4), Tanzania (waves 3, 4 and 5) and Uganda (waves 7 and 8). In Uganda, CAPI software CWEST was used in wave 2, and SurveyBe in waves 3 to 5. In all other rounds of the surveys, Paper Assisted Personal Interviews (PAPI) or Computer Assisted Field Interviews (CAFE) were used. Outside of the survey questionnaires, additional variables were recorded and are provided as part of the publicly available datasets. First, land areas of agricultural plots and/or parcels were in many cases measured using handheld GPS devices, since respondent-reported land areas were shown to be unreliable (Carletto et al., 2013). Furthermore, households, agricultural plots, and communities were georeferenced via GPS during the survey operations, which enables the integration of survey data with geo-spatial data. Sampling, tracking, and survey weights The surveys supported by the LSMS-ISA are designed to be nationally representative of the country’s population of households. To achieve this, a stratified two-stage probability sampling approach is employed, with census Enumeration Areas (EAs) selected as primary sampling units with probability proportional to size. Survey strata consist of urban/rural levels and administrative areas. In each selected EA, all households are listed and then randomly selected from the complete list. Population and housing censuses are used as sampling frames. Once households have been sampled and interviewed, sampling weights are constructed and provided to data users to allow for the calculation of nationally and sub- nationally representative estimates. These weights reflect a given household’s inverse probability of selection into the sample. For each survey, that is for any country-wave, the household weights approximately sum to the total population of households in the country. Weights were adjusted to account for attrition, inclusion of split-off households (i.e. new households that form by splitting from an existing household) and were post-stratified to ensure that they sum up to known population totals. The surveys included in the HP are panel surveys that follow different units from round to round across time. Households are tracked through time in Ethiopia, Niger, and Nigeria, individuals are tracked in Malawi, Uganda, and Tanzania. Where individuals are tracked, the households they reside in are also tracked. Enumeration Areas (EAs) are tracked in Mali, with resampling of households within EAs in each wave. Agricultural plots cannot be tracked through waves (or even across seasons, in the case of Uganda), but parcels are tracked across waves in Ethiopia and Malawi, and across seasons within waves in Uganda. Parcels are defined by the World Program for the Census of Agriculture as a “piece of land of one land tenure type entirely surrounded by other land, water (…) or other features not forming part of the holding”, whereas plots are a section of the parcel “on which a specific crop or crop mixture is cultivated”(Food and Agriculture Organization of the United Nations, 2015). In Malawi, we assume that “gardens” are equivalent to parcels. In Tanzania, “plots” have a definition that is akin to parcels, and parcels are not explicitly tracked. In some cases, such as in Malawi or Tanzania, individuals that split from a sampled household to form a new household are followed, and we had to make a choice as to which household inherited the ID of the 4 previous wave. For example, split-off households in Malawi inherit IDs if they are within a certain distance from their previous location (200m). If multiple split offs satisfy this condition, the previous household head is tracked. If no prior head is found, the household with the most tracked individuals is selected to inherit the household ID. If none of these conditions are met, no household inherits the ID. A similar approach was adopted in Tanzania, but without the distance requirement. In general, split-off households inherit the EA and stratum IDs of the household they originated from. In the following, we discuss in more detail the survey waves included in each country and their respective survey design aspects: • In Ethiopia, data from the Ethiopian Social Survey (ESS) were assembled across five survey periods: 2010/2011, 2012/2013, 2014/2015, 2017/2018 and 2021/2022. The panel was fully refreshed in wave 4, and households are therefore not tracked across more than three waves. Split-off households (see Section SI.I) were not tracked in Ethiopia. The first wave of the panel survey (ESS 2010/2011) is only designed to be representative of rural areas and small towns, and the sample was expanded to urban areas from wave 2 onwards (ESS 2012/2013). Furthermore, waves 1 to 3 of the sample were designed to be representative of the most populous regions of the country (Central Statistical Agency and Living Standards Measurement Study (LSMS), World Bank, 2021). In wave 5, the survey was renamed to the Ethiopia Socioeconomic Panel Survey (ESPS). • In Malawi, data from the Integrated Household Panel Survey (IHPS) were assembled across four periods: 2009/2010, 2012/2013, 2015/2016 and 2018/2019. All split-off households were tracked in Malawi. A random half of EAs were dropped from the sample in wave 3 due to budgetary constraints (National Statistical Office, 2020). This panel runs alongside the Integrated Household Survey (IHS), a cross-sectional survey program. The original households sampled into the IHPS are a subset of the IHS 2010/2011. • In Mali, data from the Enquête Agricole de Conjoncture Intégrée (EACI) was assembled from two periods: 2014 and 2017. The smallest tracking unit in Mali is the EA, and households are therefore not followed through time. The survey covers all regions and urban/rural areas, except Kidal (Ministry of Rural Development, 2019). • In Niger, data were drawn from the Enquête National sur les Conditions de Vie des Ménages et Agriculture – ECVM/A) across two periods: 2011 and 2014. Households, including split off households, were tracked across these waves (Ministry of Finance and National Institute of Statistics, 2016). • In Nigeria, data were assembled from the General Household Survey (GHS) across four periods: 2010/2011, 2012/2013, 2015/2016 and 2018/2019. A partial refresh of the panel was undertaken in wave 4. Split off households were not tracked in Nigeria. While the survey is representative at the regional and urban/rural levels, some areas could not be visited in wave 4 due to security concerns (2018/2019) and this wave is therefore only representative of areas that were accessible (National Bureau of Statistics, 2021a). 5 • In Tanzania, data were assembled from the National Panel Survey (NPS) across five periods: 2008/2009, 2010/2011, 2012/2013, 2014/2015 and 2019/2021. Split off households were tracked in Tanzania (National Bureau of Statistics, 2021b). A “refresh” sample was added to the panel in 2014/2015 and interviewed again in 2020/2021 as part of wave 5. Also in 2014/2015, a representative sub-sample of the panel which started in 2008 was selected to form the “extended” sample, which was re-interviewed in 2019/2020 as part of a survey with sex disaggregated data (NPS-SDD). For practical purposes, we denote the NPS-SDD, along with the 2020/2021 survey of households as part of the refresh sample, as “wave 5”. • In Uganda, data were assembled from the Uganda National Panel Survey (UNPS) across seven periods: 2009/2010, 2010/2011, 2011/2012, 2013/2014, 2015/2016, 2018/2019, 2019/2020. In Wave 4 (2013/2014), one third of the sample was refreshed. Split off households were tracked for a subset of the total sample, consisting of 20% of the baseline households. At the time of writing, wave 6 was not available to the public and is therefore absent from the HP. More detailed survey-specific information, including logistical and operational details and step-by-step descriptions of the calculations of sampling weights for each survey can be found in the Basic Information Documents for each survey which are publicly available on the World Bank’s Microdata Library. We provide links to these documents in Table 1. 6 Table 1: Links to raw data and supporting documents Country Wave Link to the raw data Link to supporting documents https://microdata.worldbank.org/index.php/catalog/2053/data- https://microdata.worldbank.org/index.php/catalog/2053/related- 1 dictionary materials https://microdata.worldbank.org/index.php/catalog/2247/data- https://microdata.worldbank.org/index.php/catalog/2247/related- 2 dictionary materials https://microdata.worldbank.org/index.php/catalog/2783/data- https://microdata.worldbank.org/index.php/catalog/2783/related- Ethiopia 3 dictionary materials https://microdata.worldbank.org/index.php/catalog/3823/data- https://microdata.worldbank.org/index.php/catalog/3823/related- 4 dictionary materials https://microdata.worldbank.org/index.php/catalog/6161/data- https://microdata.worldbank.org/index.php/catalog/6161/related- 5 dictionary s materials 1 and https://microdata.worldbank.org/index.php/catalog/2248/data- https://microdata.worldbank.org/index.php/catalog/2248/related- 2 dictionary materials Malawi 3 and https://microdata.worldbank.org/index.php/catalog/3819/data- https://microdata.worldbank.org/index.php/catalog/3819/related- 4 dictionary materials https://microdata.worldbank.org/index.php/catalog/2583/data- https://microdata.worldbank.org/index.php/catalog/2583/related- 1 dictionary materials Mali https://microdata.worldbank.org/index.php/catalog/3409/data- https://microdata.worldbank.org/index.php/catalog/3409/related- 2 dictionary materials https://microdata.worldbank.org/index.php/catalog/2050/data- https://microdata.worldbank.org/index.php/catalog/2050/related- 1 dictionary materials Niger https://microdata.worldbank.org/index.php/catalog/2676/data- https://microdata.worldbank.org/index.php/catalog/2676/related- 2 dictionary materials https://microdata.worldbank.org/index.php/catalog/1002/data- https://microdata.worldbank.org/index.php/catalog/1002/related- 1 dictionary materials https://microdata.worldbank.org/index.php/catalog/1952/data- https://microdata.worldbank.org/index.php/catalog/1952/related- 2 dictionary materials Nigeria https://microdata.worldbank.org/index.php/catalog/2734/data- https://microdata.worldbank.org/index.php/catalog/2734/related- 3 dictionary materials https://microdata.worldbank.org/index.php/catalog/3557/data- https://microdata.worldbank.org/index.php/catalog/3557/related- 4 dictionary materials https://microdata.worldbank.org/index.php/catalog/76/data- https://microdata.worldbank.org/index.php/catalog/76/related- 1 dictionary materials https://microdata.worldbank.org/index.php/catalog/1050/data- https://microdata.worldbank.org/index.php/catalog/1050/related- 2 dictionary materials https://microdata.worldbank.org/index.php/catalog/2252/data- https://microdata.worldbank.org/index.php/catalog/2252/related- 3 dictionary materials https://microdata.worldbank.org/index.php/catalog/3455/data- https://microdata.worldbank.org/index.php/catalog/3455/related- dictionary materials Tanzania 4 and and https://microdata.worldbank.org/index.php/catalog/2862/data- https://microdata.worldbank.org/index.php/catalog/2862/related- dictionary materials https://microdata.worldbank.org/index.php/catalog/5639/data- https://microdata.worldbank.org/index.php/catalog/5639/related- dictionary materials 5 and and https://microdata.worldbank.org/index.php/catalog/3885/data- https://microdata.worldbank.org/index.php/catalog/3885/related- dictionary materials https://microdata.worldbank.org/index.php/catalog/1001/data- https://microdata.worldbank.org/index.php/catalog/1001/related- 1 dictionary materials https://microdata.worldbank.org/index.php/catalog/2166/data- https://microdata.worldbank.org/index.php/catalog/2166/related- 2 dictionary materials https://microdata.worldbank.org/index.php/catalog/2059/data- https://microdata.worldbank.org/index.php/catalog/2059/related- 3 dictionary materials https://microdata.worldbank.org/index.php/catalog/2663/data- https://microdata.worldbank.org/index.php/catalog/2663/related- Uganda 4 dictionary materials https://microdata.worldbank.org/index.php/catalog/3460/data- https://microdata.worldbank.org/index.php/catalog/3460/related- 5 dictionary materials https://microdata.worldbank.org/index.php/catalog/3795/data- https://microdata.worldbank.org/index.php/catalog/3795/related- 7 dictionary materials https://microdata.worldbank.org/index.php/catalog/3902/data- https://microdata.worldbank.org/index.php/catalog/3902/related- 8 dictionary materials Survey instruments 7 The survey questionnaires of LSMS-ISA supported surveys typically consist of: • An agricultural questionnaire, which is most often divided between post-planting and post- harvest questionnaires. These questionnaires usually elicit information on land ownership and use; land use and agriculture income tax; farm labor; use of inputs; GPS land area measurement and coordinates of household fields; agricultural capital; irrigation; and crop harvest and utilization. An additional livestock questionnaire usually collects information on animal holdings and costs, production, and costs and sales of livestock by products. While base units of observation may change between modules and surveys, these data are typically observed at the parcel, plot, or plot-crop levels – or livestock type level in the case of the livestock modules. • A household questionnaire, which contains information about the household (household modules) and the individual household members (individual modules). The household questionnaire typically starts with a roster of all members of the household. Information about the household members, collected at individual level includes demographics; education; health (including anthropometric measurement for children); financial inclusion; labor. Information about the household, collected at the household level includes food and non-food expenditure; consumption; household nonfarm activities; food security and shocks; safety nets; housing conditions; assets; information and communication technology; credit, tax, and income; and other sources of household income. • A community questionnaire, which contains information on infrastructure; community organizations; resource management; changes in the community; key events; community needs, actions and achievements; and local retail prices. While this information is not included in the HP, data users can, in most cases, merge in community-level data using EA (enumeration area) identifiers. In addition to the survey questionnaires, a set of geospatial variables are derived by the data producers at the household and plot levels and provided to data users as part of the raw data files. These typically include measures of distance, climatology, soil and terrain. A selected set of these geospatial variables are included in the HP’s plot and household-level datasets. Household GPS locations are also provided as part of the public data releases on which the HP data is based. Note that the household geolocations are slightly offset from the true locations, in order to preserve the anonymity of households and survey villages. An offset range of 0-2 km is used for urban areas, while a range of 0-5 km is used in rural areas, where communities are more dispersed, and risk of disclosure may be higher. An additional 0-10 km offset for 1% of rural clusters (10% in Mali) effectively increases the known range for all rural points to 10 km while introducing only a small amount of noise (Central Statistical Agency and Living Standards Measurement Study (LSMS), World Bank, 2021; Ministry of Finance and National Institute of Statistics, 2016; Ministry of Rural Development, 2019; National Bureau of Statistics, 2021a, 2021b; National Statistical Office, 2020). The publicly available raw data is typically structured according to questionnaire modules (e.g. agriculture questionnaire, module 1). Harmonized Panel datasets and variable creation 8 The survey data described above was processed and harmonized to create the HP. The HP consists of four datasets recorded at different units of observation. • Plot-crop-level dataset (based on the agriculture questionnaire); • Plot-level dataset (based on the agriculture questionnaire); • Household-level dataset (based on the household and individual questionnaires/modules); • Individual-level dataset (based on the individual questionnaire/modules). We group variables in the HP into four categories: (i) unit and linking identifiers (ii) crop production and agricultural variables that are defined at the plot-crop, plot or household levels (iii) socio-economic variables that are at the household or individual level and (iv) geospatial variables that are integrated using geocoordinates. A more comprehensive description of each variable can be found in the variable directories provided alongside the dataset (see Supplementary Tables 1 to 5). The variable directories also flag cases where entire variables are missing in some country-years due to missing data or absent questions in the raw data files, and describe additional data cleaning operations for each variable. Unit and linking identifiers The HP provides two sets of key identifiers: unit identifiers and linking identifiers. Unit identifiers are created during data processing and consist of numeric identifiers which can be used to track units longitudinally. All unit identifiers have variable names that end with id_obs (such as plot_id_obs or hh_id_obs). Linking identifiers are included to allow merging with the raw data, so they are the same as in the raw LSMS-ISA files that can be downloaded from the World Bank’s Microdata Library. All linking identifiers have variable names that end in id_merge (such as plot_id_merge or hh_id_merge). To find the names of the specific variables in the raw data that can be merged with the linking identifiers, users may consult the following entries of the variable directories (Supplementary Tables 1 to 5): plot_id_merge, hh_id_merge, indiv_id_merge, plot_manager_id_merge, respondent_id_merge, ea_id_merge. Moreover, we include a common set of identifier and geographic variables, across the all four datasets of the HP. They are the EA and strata IDs, household weights, urban/rural status, geocoordinates and administrative-level codes and names, where applicable. It is important to note that the identification of administrative levels is sometimes only possible via numeric codes, which are not necessarily tracked through time (these cases are flagged in the variable directories). Construction of the harmonized plot-crop and plot datasets This section describes plot-crop and plot-level datasets in the HP, which mostly contain data deriving from the agricultural questionnaires (see Figure 2). They consist of plot-crop and plot-level data from households which report cultivating crops in the main agricultural season, and therefore do not contain the entire population of households interviewed in each survey. Users can refer to the household or individual-level datasets (described below) for the entire sample of households. The HP plot-crop dataset consists of a limited set of agricultural variables that were recorded at the plot and crop level of observation in the raw data, i.e. for each cultivated crop on each of the household’s agricultural plots. This dataset records production by crop on each plot, which can be used for more detailed analysis of production than the aggregated plot-level dataset can. The variables included are harvest output quantity and value, seed input quantity and value, harvest and planting months, use of improved seeds, pesticide use and crop shocks. The latter include a catch-all crop shock variable, along 9 with more specific shocks types: drought, pests rains and floods. Since crop types differ across countries and there is no unified list of codes, crop types are provided in text format, without a numerical classification. All these variables are obtained from farmer self-reports. Harvest and seed values are converted from local currency units (LCUs) to US dollars (USD) using exchange rates and a deflator obtained from a World Bank database of indicators that is hosted on Stata (Azevedo, 2011). Figure 2. Chart summarizing the creation of the HP from the LSMS-ISA raw data. Note: Arrows indicate the processing of raw data variables into the harmonized indicators that form the HP. The numbered arrows denote 1) the processing of household and asset-level data from the raw household questionnaire into an agricultural asset index and 2) the processing of individual-level data from the raw household questionnaire into variables that capture plot manager and respondent characteristics. This chart presents the simplification of a complex coding process which can be viewed on Github (Bentze, 2024). The HP plot-level dataset is the main dataset for analysis of agricultural production, bringing together crop production output and inputs. It contains harvest quantity and value; yield in quantity and value terms; seed input quantity and value; an indicator for whether improved seed was used on at least one of the crops on the plot; and indicator on whether pesticides were used on any crop on the plot. These variables were recorded at the plot crop-level and aggregated to the plot level. 10 Other inputs, recorded at the plot level, include land input (cultivated plot and farm area in hectares); labor inputs including labor days spent on the plot by family members, total labor days spent on the plot including by hired labor and free labor from non-family members, hired labor value; use of organic and inorganic fertilizer inputs, inorganic fertilizer quantity in nitrogen equivalent, inorganic fertilizer value. We also include information on the number of seasonal crops grown on the plot, which was the “main crop” (i.e. the crop type with the highest value of production on the plot) and the share of output attribute to the “main crop” type. Agricultural asset indices are computed using a principal component analysis (PCA), quantifying asset ownership in single dimensions drawn from inventories of household agricultural assets (either from agriculture or household questionnaire). A regression method is used to predict factor scores (Rencher, 2002). Variables on the production environment and practices include crop shocks; irrigation status of the plot; use of erosion protection techniques; use of a tractor on the plot and plot ownership modalities, including plot ownership and possession of a formal title for the plot; livestock ownership; perennial crop production; number of plots under management (cultivated or not); and number of plots left fallow by the household in the current agricultural season; beginning and end dates of the current agricultural season. We draw on the household questionnaire to create variables describing plot manager characteristics (the plot manager is the household member mainly responsible for the cultivation of the plot), which include plot manager age, education, and gender. All variables described above are based on farmer self-reports. In addition, we include key geospatial variables, which are provided as part of the publicly available data files. These include agro-ecological zones, distance of the household from the nearest market and population center as well distance of the plot from the household (in kilometers), plot slope, elevation, a wetness index and soil quality indicators. The soil quality indicators contain binary variables for nutrient availability, nutrient retention, rooting conditions, oxygen availability, excess salts, toxicity, and workability. These variables are equal to one if there is no or a slight constraint to the soil in each dimension. To summarize these variables into a single dimension, we compute a soil fertility index using a principal component analysis (PCA), which also employs a regression method to predict factor scores (Rencher, 2002). All the geospatial variables outlined above are obtained from remote sensing products and integrated via the geocoordinates of the interviews (without offset). They are provided alongside the raw survey data, and their integration was therefore conducted prior to the creation of the HP. More information, along with their origin, can be found in the surveys’ online documentation (see Table 1). A few crucial processing steps, undertaken while creating the HP, must be highlighted: • A fraction of plot areas is self-reported, and not measured with handheld GPS devices. Due to the importance of that variable to calculate key agricultural variables such as yields, and known inaccuracies of self-reported plot areas, we impute missing GPS area values. We employ a predictive mean matching method with five nearest neighbors, using self-reported areas along with administrative divisions to obtain linear predictions which serve as our distance variable (Little, 1988). 11 • For the agricultural variables, we focus on the “main” (or long) growing season of each country, that is, the datasets contain agricultural inputs and outputs that were used and produced during the main agricultural season. As such, the minor agricultural seasons, for the surveys sometimes record agricultural production data, are not covered in the HP. The exception is Uganda, which has a bimodal rainfall pattern in many areas. Here, two growing seasons are observed in each survey wave. • The plot and plot-crop datasets contain only plots observations for years in which (i) they are cultivated in the long or main growing season and (ii) they contain a recorded harvest quantity for at least one crop. Perennial crops such as fruit or tree crops are excluded from the dataset. A harvest quantity of zero is only included if there is recorded evidence of an exogenous shock on the plot, such as a natural disaster. Other cases, such as non-response, are dropped from the dataset. As a result, the plot and plot-crop samples contain a subset of all the plots listed in raw data. • A set of key agricultural input and output variables are valued both in LCUs as well as US dollars, to facilitate cross-country analysis. The valuation consists of calculating the sale or purchase price of the various categories of each input and output variable in each wave. Precise crop variety categories such as maize or teff are used for the calculation of output values. Categories of fertilizer types are used to calculate inorganic fertilizer prices, categories of men, women and children workers are used to calculate hired labor wages, and categories of crop types and improved status of seeds are used to calculate seed values. To account for the highly noisy nature of the price estimates, median EA-level input and output prices are calculated within each category (e.g. median EA-level maize price per kg, median hourly wage of male agricultural workers). If the number price observations are below a threshold of 10, median prices are calculated across the next larger administrative unit. If the number of observations in that larger administrative unit is also below 10, the next larger unit is chosen, and this process is repeated up to the national level. The prices are then multiplied by the output and input quantities recorded on the plot, which have themselves been converted to standardized quantity units such as kilograms. The resulting variables are in local currency units (LCUs). We provide another set of variables which are converted into USD using an exchange rate at the year of the survey and deflated to 2020 dollars values. The CPI and exchange rate data are drawn from World Bank collections of development indicators and consist of yearly time series, made available through the World Bank Open Data Initiative. • “Main crop” types are defined to facilitate inter-country comparisons with the plot-level dataset. Each crop is classified into the following categories: barley, wheat, rice, sorghum, maize, millet, perennials (including fruit and tree crops), legumes, root crops, nuts and a catch-all “other” category. In case users want to define their own categories, the exact names of the crops (i.e., as they appear in the raw data) are maintained in the plot-crop dataset. • The plot-level dataset in Uganda is effectively at the parcel level, as GPS measurements were obtained at this level of granularity. This also allows users to merge across seasons, since parcels 12 can be tracked across seasons, unlike plots. This is not the case for the plot-crop-level dataset which is at the plot and crop level. • In both Mali and Niger, perennial crops were provided at the household and crop level. In order to retain this information to some degree, and for merging purposes, each household-crop pair was associated with a bespoke ID. These lines can be identified by cases where the plot link identifier begins with “missing_”. To limit confusion in the plot dataset, however, these observations were not taken into account. Construction of the harmonized household and individual datasets This section describes the HP household and individual datasets, which mostly contain variables deriving from the household questionnaires (see Figure 2). An important aspect of these datasets is that they include all households that are listed in the household questionnaire cover module, regardless of their agricultural activities. They therefore include close to 30,000 households in the seven countries which are not in the plot-crop or plot-level datasets. The household-level dataset consists of variables from the household questionnaire that are key to analyzing livelihood and welfare outcomes. Included are access to electricity, household dependency ratios – defined as the ratio of individuals younger than 15 or older than 65 to the other members of the household, whether the household operates a nonfarm enterprise, and household size. Critically, the dataset also contains information on household consumption. The variables included are the total consumption value of the household in LCU and in USD and the consumption quintile of the household (where the quintiles are defined within country-waves). The consumption aggregates are provided in LCU by the data producers are part of the publicly available survey data. More information on the steps taken to calculate the consumption aggregate can be found in the surveys’ online documentation (see Table 1). We also compute a household asset index using a principal component analysis (PCA), quantifying asset ownership in single dimensions drawn from inventories of household assets. A regression method is used to predict factor scores (Rencher, 2002). We further compute the household dietary diversity index, following guidelines from the Food and Agriculture Organization (FAO). It consists of the sum of distinct food groups consumed by the household over the past 7 days (Kennedy et al., 2011). Finally, we include information on whether the household suffered any shocks in the last 12 months prior to the interview. These variables are all drawn from self-reported data. In addition, a geospatial variable capturing population density is also included in this dataset. Finally, the individual-level dataset includes information about the household members, which is derived from the household questionnaire. A first set of demographic variables includes age, sex, educational attainment including indicators for any formal and primary education. A variable which identifies the members’ relationship to the household head has been included in text format, due to the absence of a unified set of codes. On time use and work/employment, we have indicators capturing any work done during the reference period, the number of hours worked overall, on a farm, in a small enterprise, or in a wage job over a reference seven-day period (either the last seven days, or a typical seven days in the year). Additional variables were added to identify the industry of members’ wage jobs, if applicable. The 13 industries include agriculture, fishing, manufacturing, mining, construction, and services. These variables are all based on respondent reporting. Moreover, anthropometric data were collected by measuring children’s height and weight during survey fieldwork. Using this information, a series of variables were computed for children that range from 0 to 5 years of age: length/height-for-age, weight-for-height, BMI- for-age and weight-for-age Z-scores. These estimates are calculated using the 2006 WHO child growth standards, and an indicator for wasting was created for children who have a weight for height values under -2 standard deviations from the WHO child growth standards (Onis et al., 2006). Coding steps The construction of HP datasets was done in Stata. It consists of a series of structured Stata scripts (“do- files”), made available in a dedicated public GitHub repository (Bentze, 2024). A consistent processing protocol was executed in each wave. As a first step, plot-crop, plot, individual, and household frames were created, using the crop harvest modules (for the plot-crop and plot frames), the cover information of the household questionnaire (for the household frame), and the household roster (for the individual frame). Following a modularized approach, each variable in the HP was then cleaned and processed separately. This approach guarantees a stable sample size and facilitates reuse by data users, who can download the code and include new variables as they see fit (see “Usage Notes” below). The coding operations described above are undertaken separately for each country-wave pair, in do files which are given the following naming structure: country ISO code, survey name, and wave number. For example, the do file for the first wave of the Ethiopia ESS is named ETH_ESS1.do.For each of the plot-crop, plot, household, and individual- level datasets, survey waves were then appended (i.e. combined) for each country, and panel unit IDs were created. During this process, variables were merged to their appropriate frames. Finally, variables were converted from LCU to USD, where relevant. These steps were undertaken is a series of do files named according to the following structure: ISO code followed by “append” (e.g. ETH_append.do). Finally, country-level datasets were appended to obtain the final cross-country HP datasets. The flow of Stata do- files is summarized in Figure 3. 14 Figure 3. Overview of the do file execution sequence for the creation of the HP Note: arrows indicate the do file execution flow. Two do files are created for each wave in Uganda, one for each season. Moreover, do files for the Tanzania NPS are divided across the refresh and extended samples for waves 4 and 5. The dashed arrow and file illustrate an example in which a user includes custom-made variables assembled from raw UNPS data to the HP. This graph is intended to be illustrative, and does not present an exhaustive listing of do files which need to be run in order to create the HP. All relevant files are listed in the Master.do file, which, if executed, will allow users to create the datasets from scratch. 3) Data records As explained above, we provide detailed information at the levels of the household, plot, and specific crop on each plot. Each unit of observation can be uniquely identified by its ID, and the survey ID which marks records belonging to the different survey waves, and the season ID which distinguishes between growing seasons within years in Uganda, and which is equal to unity in other countries. The HP is composed of four datasets, each at a different level of aggregation/unit of observation: 15 • A plot and crop-level dataset in which units are uniquely identified by the crop name variable, plot unit ID, survey wave ID and season ID. • A plot-level dataset in which units are uniquely identified by plot ID and season ID. • A household-level dataset in which units are uniquely identified by household ID, survey wave ID and season ID. • An individual-level dataset, in which units are uniquely identified by individual ID, survey wave ID and season ID. All datasets include a household ID. Individual IDs as well as plot IDs are nested within household IDs (see Usage Notes). All datasets are provided as Stata data files, and code is publicly available as Stata do-files on GitHub (Bentze, 2024). 4) Technical Validation The LSMS-ISA surveys have undergone rigorous technical validation and methodological innovation throughout their lifetime. For example, different approaches to land area measurement and labor time use have been tested in the context rigorous field experiments, and their insights drawn upon to ensure state-of-the-art data collection methods ( see, for example, Carletto et al., 2013). Moreover, processes prior to publishing the raw LSMS-ISA data, substantial efforts were carried out by the World Bank LSMS-ISA team to clean and correct errors stemming from data collection, editing and entry. Errors were often corrected at various stages of the fieldwork. In PAPI and CAFE interviews, Stata and CSPro software were used to identify potential errors such as out of range values and inconsistent skip patterns. When deemed necessary, paper questionnaires were reviewed, and data was revised. In some cases, data entry was performed multiple times to better identify data entry errors. The use of CAPI software such as Survey Solutions allowed for error and inconsistency flags in the field, and for real-time data availability which enabled high-frequency quality checks. The harmonized variables created for and included in the HP were also validated. This involved inspecting key summary statistics of each variable: mean, median, minimum, maximum, as well as number of missing values, to detect any problems or irregularities. In case of any suspicious values, these values were further inspected and corrected if necessary. Each correction is noted in the directories of variables (Supplementary Tables 1 to 5). 5) Usage Notes Using the datasets together Each dataset contains both unit and linking identifiers. They can be used to merge the four datasets which comprise the HP. Observations can be uniquely identified by combining identifiers with the survey wave variable and the season indicator. The latter distinguishes observations between the two seasons in Uganda. Household IDs are present in all datasets, and users can therefore perform one-to-many joins between the household dataset (using wave and household_id_obs as keys) and any other dataset. 16 Moreover, users can perform a one-to-many merge between the plot dataset and plot-crop dataset (using wave and plot_id_obs as keys). These relationships are summarized in the chart below. Figure 4. Framework for joining datasets in the HP Note: The ‘key’ variables refer to the variables that uniquely identify observations Linking the Harmonized Panel with other datasets Linking identifiers can be used to merge with the publicly available LSMS-ISA data files. The exact names of the variables with which the HP can be merged differ across the raw datasets, but all are listed in the variable directories (Supplementary Tables 1 to 5). The public LSMS-ISA data releases can be downloaded on the World Bank’s Microdata Library (see Table 1). To merge raw data files with the HP, users should download the raw data files in Stata format. The raw data package contains various datasets that correspond to the questionnaire modules of the four questionnaires (Figure 2). A list of datasets and variables available is available through the ‘Data Description’ site on the World Bank’s Microdata library. The variables lat_modified and lon_modified contain the anonymized location of households recorded in the HP datasets. These can be used to merge with other spatially reference data, including GIS data, for example on climate, infrastructure, and others. By the same token, geographic identifier variables, such as Administrative Divisions, can be used to bring in geographic information recorded at these levels. Using and altering the Stata codes All Stata code used to process and format the raw data and create the HP is available in GitHub. A Master.do file allows users to view the sequence in which the files are executed to create the HP. Executing the Master.do file will allow users to automatically set directories, download required community-contributed Stata commands, and run the sequence of do files to create the harmonized datasets. To successfully execute the Master.do file (which will thus sequentially execute all other do files), users need to specify their file path under the ‘global pathways’ section of the file. The folder structure for input data, temporary data and final data is already provided. However, users must populate the input data folders with raw data files downloaded from the World Bank’s microdata library (see Table 1 for links to the data). 17 The do files are designed to allow users to add variables according to their needs. As explained above (see “Coding Steps”), variables are created in discrete modules within each country-wave pair. They can then be stored in a user-defined location that is dedicated to temporary files. Users can therefore add their own variables by writing their own scripts to pre-process the data, and incorporate them into the final dataset by plugging their custom-made variables into the relevant dataset and country (see Figure 3 for an example). The only requirement for merging in new data is that the linking identifiers match. Users may refer to the variable directories to identify the variable name of the linking identifier in the raw data (Supplementary tables 1 to 5). The LSMS-ISA raw data, which is publicly available (see Table 1), is typically structured according to questionnaire modules (e.g. module 1 from household questionnaire). Users need to consult the survey questionnaires to identify which modules contain the variables of their interest and pull the variables from the corresponding raw data file. Once a user-made variable has been created from the raw LSMS-ISA data, and it is uniquely identified by the linking identifier, it can be merged in the appropriate dataset (either plot-crop, plot, household, or individual), within one of the do files prefixed with “Append_”. Users may follow the example of how other variables are merged into the HP data frames to do this. Once the new variable is successfully merged in the dataset, the remaining do files can be executed without further modification. Performing analysis with the HP The LSMS-ISA surveys are designed to be nationally representative. To obtain nationally representative estimates, users need to make use of the sampling weights, EA identifiers, and strata identifiers. As mentioned above, these three variables are present in all four datasets of the HP. Statistical software typically provide packages that enable use of these variables to estimate nationally representative indicators, such as Stata’s svy set of commands which allow the running of statistical models with complex survey data. Users must use the sample weights (pw) as probability weights to obtain valid point estimates, EA identifiers (ea_id_obs) and strata identifiers (strataid) to obtain valid standard errors. As noted above, variables that identify panel units end with the _obs suffix. Users are advised to use these variables when performing panel analysis. Panel identifiers exist for EAs (ea_id_obs), parcels (parcel_id_obs), households (hh_id_obs), individuals (indiv_id_obs), plot managers (plot_manager_id_obs), and respondents (respondent_id_obs). Moreover, while EA and stratum variables remain fixed, households may move during the lifetime of the panel. A variable named geocoords_id is provided to identify each unique set of geocoordinates. The time dimension is identified by country, wave (denoting the survey round), and season variables. The latter is necessary to distinguish between seasons in Uganda. Variables identifying the administrative area codes of the households’ residences were also added to the dataset (admin_1, admin_2, admin_3). However, codes may have changed throughout the lifetime of the panel (for instance, when administrative boundaries were redrawn) such that we cannot guarantee persistent identifiers. When possible, administrative area names were also included (admin_1_name, admin_2_name, admin_3_name). We have deliberately refrained for extensive outlier correction, as these may depend on users’ analytical preferences. Users are therefore advised to check for extreme and outlier values, and winsorize or trim variables as they see fit. Moreover, missing values were not imputed, except for GPS plot areas (see the Methods section above). Users may wish to implement imputation methods that suit their research needs. 18 Bibliography Adamopoulos, T., Restuccia, D., 2020. Land Reform and Productivity: A Quantitative Analysis with Micro Data. Am. Econ. J. Macroecon. 12, 1–39. https://doi.org/10.1257/mac.20150222 Azevedo, J.P., 2011. WBOPENDATA: Stata module to access World Bank databases. Bentze, T., 2024. Github repository: LSMS-ISA-harmonised-dataset-on-agricultural-productivity-and- welfare: https://github.com/lsms-worldbank/LSMS-ISA-harmonised-dataset-on-agricultural- productivity-and-welfare. Carletto, C., 2021. Better data, higher impact: improving agricultural data systems for societal change. Eur. Rev. Agric. Econ. 48, 719–740. https://doi.org/10.1093/erae/jbab030 Carletto, C., Savastano, S., Zezza, A., 2013. Fact or artifact: The impact of measurement errors on the farm size–productivity relationship. J. Dev. Econ. 103, 254–261. https://doi.org/10.1016/j.jdeveco.2013.03.004 Central Statistical Agency, Living Standards Measurement Study (LSMS), World Bank, 2021. Ethiopia Socioeconomic Survey (ESS), ESS Panel II, 2018/2019. Dercon, S., Gollin, D., 2014. Agriculture in African Development: Theories and Strategies. Annu. Rev. Resour. Econ. 6, 471–492. https://doi.org/10.1146/annurev-resource-100913-012706 Food and Agriculture Organization of the United Nations, 2015. World programme for the census of agriculture 2020. Gollin, D., 2010. Chapter 73 Agricultural Productivity and Economic Growth, in: Handbook of Agricultural Economics. Elsevier, pp. 3825–3866. https://doi.org/10.1016/S1574-0072(09)04073-0 Gollin, D., Lagakos, D., Waugh, M.E., 2014. Agricultural Productivity Differences across Countries. Am. Econ. Rev. 104, 165–170. https://doi.org/10.1257/aer.104.5.165 Gollin, D., Lagakos, D., Waugh, M.E., 2013. The agricultural productivity gap. Q. J. Econ. 129, 939–993. Gollin, D., Udry, C., 2021. Heterogeneity, Measurement Error, and Misallocation: Evidence from African Agriculture. J. Polit. Econ. 129, 1–80. https://doi.org/10.1086/711369 Kennedy, G., Ballard, T., Dop, M.C., 2011. Guidelines for measuring household and individual dietary diversity. Food and Agriculture Organization of the United Nations, Rome. Little, R.J.A., 1988. Missing-Data Adjustments in Large Surveys. J. Bus. Econ. Stat. 6, 287–296. https://doi.org/10.1080/07350015.1988.10509663 Ministry of Finance, National Institute of Statistics, 2016. Basic Information Document: 2014 National Survey on Household Living Conditions and Agriculture (ECVM/A-2014). Ministry of Rural Development, 2019. Basic Information Document: Enquete Agricole de Conjoncture Intégrée aux conditions de vie des ménages (EAC-I 2017). National Bureau of Statistics, 2021a. Basic Information Document: Nigeria General Household Survey– Panel 2018/19. National Bureau of Statistics, 2021b. Basic Information Document: National Panel Survey (NPS 2019- 2020), Extended Panel with Sex-Disaggregated Data. National Statistical Office, 2020. Basic Information Document: Malawi Integrated Household Panel Survey (IHPS) 2019. Onis, M. de, Elaine Borghi, Onyango, A., Amani Siyam, Alain Pinol, 2006. Length/height-for-age, weight- for-age, weight-for-length, weight-for-height and body mass index-for-age; methods and development, WHO child growth standards. WHO Press, Geneva. Rencher, A.C., 2002. Methods of multivariate analysis, 2nd ed. ed, Wiley series in probability and mathematical statistics. J. Wiley, New York. 19 Restuccia, D., Rogerson, R., 2013. Misallocation and productivity. Rev. Econ. Dyn. 16, 1–10. https://doi.org/10.1016/j.red.2012.11.003 Wollburg, P., Bentze, T., Lu, Y., Udry, C., Gollin, D., 2024a. Crop yields fail to rise in smallholder farming systems in sub-Saharan Africa. Proc. Natl. Acad. Sci. 121, e2312519121. https://doi.org/10.1073/pnas.2312519121 Wollburg, P., Markhof, Y., Bentze, T., Ponzini, G., 2024b. Substantial impacts of climate shocks in African smallholder agriculture. Nat. Sustain. https://doi.org/10.1038/s41893-024-01411-w World Bank, 2022. Poverty and Inequality Platform (database). World Bank, 2021. World Bank Open Data. Acknowledgments We are grateful to the World Bank LSMS team and the National Statistical Offices for their efforts to collect and publish the LSMS-ISA data and to support this data harmonization effort. We would like to thank, in particular, Alemayehu Ambel, Asmelash Haile Tsegay, Manex Bule Yonis, Wondu Yemanebirhan Kassa (Ethiopia), Heather Moylan, Wilbert Drazi Wondru (Malawi), Marco Tiberti, Ismael Yacoubou Djima (Mali, Niger), Akiko Sagesaka, Ivette Contreras, Gbemisola Oseni, Kevin McGee (Nigeria), Akuffo Amankwah, Amparo Palacios Lopez (Tanzania), Giulia Ponzini (Uganda). We would also like to thank Siobhan Murray for geospatial support, and Gero Carletto and Talip Kilic for their guidance and comments. This work was supported by the World Bank Research Support Budget grant “On agricultural productivity measurement in Sub-Saharan Africa”. 20 Supplementary information Supplementary Table 1. Directory of identifiers (present in all datasets) Variable name Label Data processing notes Additional notes country Country name wave Wave number household_id_obs Household ID Variable created by assigning (panel a numeric code to each identificator) unique household. The first number of the code determines the country. household_id _merge Household ID ID to merge with raw LSMS Can be merged with the following (to merge with data variables in the raw data raw data) (concatenated variables are separated by hyphens): ETH: household_id (wave 1), household_id2 (wave 2), household_id2 (wave 3), household_id (wave 4), household_id (wave 5) MWI: case_id (wave 1), y2_hhid (wave 2), y3_hhid (wave 3), y4_hhid (wave 4) MLI: concatenation of grappe and menage (wave 1), concatenation of grappe and exploitation (wave 2) NER: hid (wave 1), concatenation of GRAPPE, MENAGE and EXTENSION (wave 2) NGA: hhid (wave 1), hhid (wave 2), hhid (wave 3), hhid (wave 4) TZA: hhid (wave 1), y2_hhid (wave 2), y3_hhid (wave 3), y4_hhid (wave 4), sdd_hid (wave 5) UGA: Hhid (wave 1), HHID (wave 2), HHID (wave 3), HHID (wave 4), HHID (wave 5), hhid (wave 7), hhid (wave 8) season Season ID Variable created to This variable is set to 1 by default in (UGA) distinguish between seasons all other countries. in UGA, which has a bimodal seasonal pattern, and households are therefore observed twice within each wave ea_id_obs EA ID Variable created by assigning Longitudinal ID to track units a numeric code to each through time unique EA. The first number of the code determines the country. ea_id_merge ID to merge with raw LSMS Can be merged with the following data variables: 21 ETH: ea_id (wave 1), ea_id2 (wave 2), ea_id2 (wave 3), ea_id (wave 4), ea_id (wave 5) MWI: ea_id (wave 1), ea_id (wave 2), ea_id (wave 3), ea_id (wave 4) MLI: not uniquely identified by ea, ea_id in ag questionnaire (wave 1), no community questionnaire, ea_id in ag questionnaire (wave 2) NER: not uniquely identified by ea, concatenation of ms00q10 ms00q11 ms00q12 ms00q14 (wave 1) not uniquely identified by ea, concatenation of MS00Q10 MS00Q11 MS00Q12 MS00Q14 (wave 2) NGA: concatenation of lga and ea (wave 1), concatenation of lga and ea (wave 2), concatenation of lga and ea (wave 3), concatenation of lga and ea (wave 4) TZA: concatenation of region, district, ward and ea (wave 1), carried over from wave 1 for waves in extended panel. Concatenation of hh_a01_1 hh_a02_1 hh_a03_1 hh_a04_1 (wave 4, refresh), carried over for wave 5 refresh. UGA: comm (wave 1), carried over from wave 1 (waves 2, 3, 4) pw Household weight strataid Strata ID Variable created by assigning a numeric code to each unique stratum. The first number of the code determines the country. urban Is this an urban EA? admin_1 Administrative Source: https://data.humdata.org level 1 admin_2 Administrative Source: https://data.humdata.org level 2 admin_3 Administrative Source: https://data.humdata.org level 3 admin_1_name Name of administrative level 1 admin_2_name Name of ETH: not available administrative MLI: not available level 2 admin_3_name Name of ETH: not available administrative MLI: not available level 3 22 lat_modified EA Longitude “Modified” refers to the offset of the lon_modified (WGS84) coordinates. Modified EA Latitude TZA: in waves 4 (extended panel (WGS84) only) and 5, only households who Modified have not moved could be assigned geocoordinates. UGA: missing in waves 4, 5, 7, 8 geocoords_id Geocoordinate Variable created by assigning TZA: in waves 4 (extended panel ID a numeric code to each only) and 5, only households who unique pairs of have not moved could be assigned geocoordinates. The first geocoordinates. number of the code UGA: missing in waves 4, 5, 7, 8 determines the country. 23 Supplementary Table 2. Directory of variables in the plot-crop-level dataset Variable name Label Data processing notes Additional notes crop_name Crop name Absent of a harmonised list of codes for crops, this variable is coded a string. plot_id_obs Plot ID (panel Variable created by assigning identificator) a numeric code to each unique plot. The first number of the code determines the country. UGA: plot is equal to the parcel ID in the Plot-level dataset plot_id_merge Plot ID (to ID to merge with raw LSMS Can be merged with the following merge with data variables in the raw data raw data) (concatenated variables are separated by hyphens): ETH: concatenation of holder_id, parcel_id and field_id (wave 1), concatenation of holder_id, parcel_id and field_id (wave 2), concatenation of holder_id, parcel_id and field_id (wave 3), concatenation of holder_id, parcel_id and field_id (wave 4), concatenation of holder_id, parcel_id and field_id, OR concatenation of holder_id, and field_id for households answering section 12c (wave 5) MWI: concatenation of case_id and plot ID (wave 1), concatenation of y2_hhid and plot ID (wave 2), concatenation of y3_hhid, garden ID and plot ID (wave 3), concatenation of y4_hhid, garden ID and plot ID (wave 4) MLI: concatenation of grappe menage, bloc and parcelle (wave 1 concatenation of grappe menage, bloc and parcelle (wave 2) NER: concatenation of hid, field number and parcel number (wave 1), concatenation of GRAPPE, MENAGE, EXTENSION, champ and parcelle (wave 2) NGA: concatenation of hhid and plot ID (wave 1), concatenation of hhid and plot ID (wave 2), concatenation of hhid and plot ID (wave 3), concatenation of hhid and plot ID (wave 4) TZA: concatenation of hhid and plot number (wave 1), concatenation of 24 y2_hhid and plot number (wave 2), concatenation of y3_hhid and plot number (wave 3), concatenation of y4_hhid and plot number (wave 4), concatenation of sdd_hhid and plot number (wave 5) UGA: concatenation of Hhid, parcel ID and plot ID (wave 1), HHID , parcel ID and plot ID (wave 2), HHID, parcel ID and plot ID (wave 3), HHID, parcel ID and plot ID (wave 4), HHID, parcel ID and plot ID (wave 5), hhid, parcel ID and plot ID (wave 7), hhid, parcel ID and plot ID (wave 8) parcel_id_obs Parcel ID Variable created by assigning Only tracked through waves in (panel a numeric code to each unique Ethiopia and Malawi. identificator) parcel. The first number of the code determines the NGA, TZA, MWI w1: absent country UGA: plot is equal to the parcel ID in the Plot-level dataset. Parcel ID is set to missing to prevent confusion parcel_id_merge Parcel ID (to ID to merge with raw LSMS NGA, TZA, MWI w1: absent merge with data raw data) Can be merged with the following UGA: plot is equal to the variables in the raw data parcel ID in the Plot-level (concatenated variables are separated dataset. Parcel ID is set to by hyphens): missing to prevent confusion ETH: concatenation of holder_id, and parcel_id (waves 1 to 5), absent for households answering section 12c (wave 5) MWI: concatenation of y2_hhid and garden ID (wave 2), concatenation of y3_hhid, garden ID (wave 3), concatenation of y4_hhid, garden ID (wave 4) MLI: concatenation of grappe menage, bloc (wave 1 concatenation of grappe menage, bloc (wave 2) NER: concatenation of hid, field number (wave 1), concatenation of GRAPPE, MENAGE, EXTENSION, champ (wave 2) UGA: concatenation of Hhid, parcel ID (wave 1), HHID , parcel ID (wave 2), HHID, parcel ID (wave 3), HHID, parcel ID (wave 4), HHID, parcel ID (wave 5), hhid, parcel ID (wave 7), hhid, parcel ID (wave 8) 25 harvest_end_month Harvest end If different crops harvested at UGA: absent wave 1 month various times, latest date is chosen. ETH: Dates had to be converted, so there is a one- week discrepancy between the months in the data and the months in our Gregorian calendar. planting_month Month of If different crops planted at TZA: absent planting various times, earliest date is UGA: absent wave 1, wave 2 chosen. ETH: Dates had to be converted, so there is a one- week discrepancy between the months in the data and the months in our Gregorian calendar. harvest_kg Total harvest ETH: using conversion factor Summed up to the plot level in the in kg database, or enumerator plot dataset conversion if not available. NGA: In w4 and w3, full harvest quantity was imputed based on realized harvest at the time of the interview + expected harvest after the interview. MLI: full harvest quantity imputed based on a “% area left to harvest” variable. MWI: Post planting harvest may have been reported in post-harvest period. harvest_value_LCU Value of plot Harvest valued using median Summed up to the plot level in the harvest, in prices per in a defined plot dataset LCU geographical area. The area in question is the EA (or admin 1 in some surveys) under the condition that there are more than 10 observed sales in the area. If there are less than 10 sales, prices are calculated at a higher geographical level. Prices are calculated independently for each crop type. UGA: sale prices multiplied by 100 in wave 5 and divided by 100 in wave 1, season1. harvest_value_USD Value of plot Harvest values Summed up to the plot level in the harvest, in (harvest_value_LCU) are plot dataset USD converted to current USD and deflated to 2020 dollar values, using exchange rates and a deflator extracted from World 26 Bank databases (Azevedo et al., 2011) seed_kg Quantity of NER: Seeds are not at the plot Summed up to the plot level in the seeds (in kg) level, they are at the plot dataset household level. We divide TZA: Absent in wave 1 and wave 2 seeds among plots based on UGA: Absent in wave 1, wave 2, the distribution of plot areas. wave 3 seed_value_LCU Value of Seeds valued using median Summed up to the plot level in the seeds in LCU prices per in a defined plot dataset geographical area. The area in question is the EA (or admin 1 in some surveys) under the condition that there are more than 10 observed sales in the area. If there are less than 10 sales, prices are calculated at a higher geographical level. Prices are calculated independently for each crop type and separately for improved and traditional seeds, where possible TZA: No question asking for the quantity of seeds used, in w1 and w2, only the value of seeds bought. Median prices could therefore not be computed, and seed values are taken as they are reported. Seeds only taken into account if they were bought. MLI: all seeds are assumed to be purchased (no question asking for the quantity purchased). This might lead to an underestimation of seed prices as a portion may be obtained for free as gifts or leftover from the previous season. UGA: all seeds are assumed to be purchased (no question asking for the quantity purchased). This might lead to an underestimation of seed prices as a portion may be obtained for free as gifts or leftover from the previous season. In w1 w2, w3 no question asking for the quantity of seeds used, only value. Median prices could therefore not be computed, and seed values are taken as they are reported. Seeds only 27 taken into account if they were bought seed_value_USD Value of Seed values seeds in USD (seed_value_LCU) are converted to current USD and deflated to 2020 dollar values, using exchange rates and a deflator extracted from World Bank databases (Azevedo et al., 2011) improved Does this plot NGA: absent in w1, w2 contain TZA: absent in w1, w2 improved UGA: absent in w1, w2, w3 seeds? (default: traditional) used_pesticides Were pesticides used on this plot? crop_shock Did a shock UGA: crop shocks are missing in w7, affect crops in w5 the current season? pests_shock Were crops Flood shock is absent in MWI w1, rain_shock affected by MLI all waves, drought_shock pests in the Rain shock is absent in NER all flood_shock current waves,, UGA all waves agricultural Rain and flood shocks could not be season? distinguished in TZA Were crops Shocks only reported on plots with affected by full losses in NGA waves 1 to 3, and rains in the UGA current UGA: crop shocks are missing in w7, agricultural w5 season? Were crops affected by drought in the current agricultural season? Were crops affected by floods in the current agricultural season? 28 Supplementary Table 3. Directory of variables in the plot-level dataset Variable name Label Data processing notes Additional notes main_crop Crop with the Crops have been categorized in highest value on 10 groups for comparability: the plot - Barley - Legumes - Maize - Millet - Nuts - Rice - Sorghum - Tubers - Wheat - Other plot_id_obs (see above) Plot ID (panel identificator) plot_id_merge (see above) Plot ID (to merge with raw data) parcel_id_obs (see above) Parcel ID (panel identificator) parcel_id_merge (see above) Parcel ID (to merge with raw data) plot_area_GPS Area (in hectares) When plot areas are absent, they UGA: 80% missing of plot have been imputed using multiple in wave 7 imputation approach which relies on self-reported areas and administrative areas. We employ a predictive mean matching method with five nearest neighbours, using self- reported areas along with administrative divisions to obtain linear predictions which serve as our distance variable (Little, 1988). yield_kg Yield quantity Ratio between plot-level (harvest in kg/ha) harvest_kg and plot_area_GPS yield_value_LCU Yield value Ratio between plot-level (harvest value/ha), harvest_value_LCU and inLCU plot_area_GPS yield_value_USD Yield value Ratio between plot-level (harvest value/ha), harvest_value_USD and inUSD plot_area_GPS farm_size Farm size (ha) Sum of the size of all plots ( regardless of whether they are cultivated - could include plots 29 with perennial crops, or fallowed plots) plot_manager_id_obs Unique plot Only first manager considered UGA: absent when manager ID when multiple are cited. there are multiple This ID can be merged with managers in wave indiv_obs_id in the individual- 7 level dataset plot_manager_id_merge Only first manager considered when multiple are cited. This ID can be merged with indiv_obs_merge in the individual-level dataset formal_education_manager Does the plot manager possess any formal education? primary_education_manager Did the plot “Primary manager complete education” consists primary school? of: - 6 years in NGA - 6 years in MALI - 7 years in TZA - 8 years in ETH - 6 years in NIGER - 8 years in MALAWI - 7 years in UGANDA NGA: not asked for old members who are not in school (w1, w2) age_manager Age (in years) of Capped at 100 the plot manager female_manager Is the plot NER: plot manager manager female? = hh manager if managed jointly married_manager Is the plot UGA: w7 has no manager married? ID information on plots managed jointly. The resulting sample could be biased. respondent_id_obs This ID can be merged with TZA: absent in indiv_obs_id in the individual- wave 5 level dataset UGA: absent in wave 1, wave 2 respondent_id_merge This ID can be merged with TZA: absent in indiv_obs_merge in the wave 5 individual-level dataset 30 UGA: absent in wave 1, wave 2 formal_education_respondent Does the plot TZA: absent in respondent wave 5 possess any UGA: absent in formal education? wave 1, wave 2 primary_education_respondent Did the plot TZA: absent in respondent wave 5 complete primary NGA: not asked for school? old members who are not in school (w1, w2) UGA: absent in wave 1, wave 2 “Primary education” consists of: - 6 years in NGA - 6 years in MALI - 7 years in TZA - 8 years in ETH - 6 years in NIGER - 8 years in MALAWI - 7 years in UGANDA age_respondent What is the age of Capped at 100 TZA: absent in the respondent? wave 5 UGA: absent in wave 1, wave 2 female_respondent Is the respondent TZA: absent in female? wave 5 UGA: absent in wave 1, wave 2 married_respondent Is the respondent TZA: absent in married? wave 5 UGA: absent in wave 1, wave 2 nb_seasonal_crop Number of seasonal crops grown on plot intercropped Is any crop intercropped? contains_BARLEY Plot contains contains_LEGUMES BARLEY contains_MAIZE Plot contains contains_MILLET LEGUMES 31 contains_OTHER Plot contains contains_RICE MAIZE contains_SORGHUM Plot contains contains_TUBERS MILLET contains_WHEAT Plot contains contains_NUTS OTHER Plot contains RICE Plot contains SORGHUM Plot contains TUBERS Plot contains WHEAT Plot contains NUTS share_value_BARLEY Share of plot value Shares do not take perennial share_value_LEGUMES derive from crops into account. share_value_MAIZE BARLEY share_value_MILLET Share of plot value share_value_NUTS derive from share_value_OTHER LEGUMES share_value_RICE Share of plot value share_value_SORGHUM derive from share_value_TUBERS MAIZE share_value_WHEAT Share of plot value derive from MILLET Share of plot value derive from NUTS Share of plot value derive from OTHER Share of plot value derive from RICE Share of plot value derive from SORGHUM Share of plot value derive from TUBERS Share of plot value derive from WHEAT maincrop_valueshare Share of plot value attribute to main crop inorganic_fertilizer Has at least one NGA: no dedicated organic_fertilizer inorganic fertilizer section to organic been used on this fertilizers before plot? w3 Has at least one organic fertilizer been used on this plot? 32 nitrogen _kg Nitrogen ETH: UREA (46 - equivalent of 0 - 0), DAP (18 - applied inorganic 46 - 0), and NPS fertilizer (kg) (19 - 46 - 7). NGA: UREA and NPK (20- 10 – 10 appears to be the most prevalent blend). Mali: UREA, DAP, NPK and cereal complex ( 15 – 15 – 15) are included. A few less prevalent inorganic fertilizers are left out. TZA: CAN and SA also used. inorganic_fertilizer_value_LCU Inorganic fertilizer Inorganic fertilizers valued using ETH (w1), TZA value, in LCU median prices per in a defined (all waves): all geographical area. The area in fertilizers are question is the EA (or admin 1 in assumed to be some surveys) under the purchased (no condition that there are more than question asking for 10 observed sales in the area. If the quantity there are less than 10 sales, prices purchased). This are calculated at a higher might lead to an geographical level. When underestimation of possible, separate prices were fertilizer prices as a calculated for units that couldn’t portion may be be converted into kilograms. obtained for free as gifts or leftover from the previous season. Moreover, purchased fertilizers are at the plot holder level in ETH (wave 1), and purchases are assumed to be equally distributed among plots. inorganic_fertilizer_value_USD Inorganic fertilizer Fertilizer values value, in USD (inorganic_fertilizer_value_LCU) are converted to current USD and deflated to 2020 dollar values, using exchange rates and a deflator extracted from World Bank databases (Azevedo et al., 2011) total_family_labor_days Total family labor UGA: Absent in days on the plot wave 7. total_hired_labor_days Total hired labor UGA: Absent in days on the plot wave 7, season 2. 33 hired_labor_value_LCU Vale of hired Hired labor value valued using TZA: overall labor, in LCU median prices per in a defined median wages geographical area. The area in calculated, then question is the EA (or admin 1 in applied to total some surveys) under the number of labor condition that there are more than days (no 10 observed wages in the area. If man/child/woman there are less than 10 wages, wage distinction). prices are calculated at a higher Moreover, “free” geographical level. When labor is not possible, separate wages were distinguished from calculated for men, women and paid labor, and children. wages may therefore appear lower than reality. hired_labor_value_USD Vale of hired Hired labor values labor, in USD (hired_labor_value_LCU) are converted to current USD and deflated to 2020 dollar values, using exchange rates and a deflator extracted from World Bank databases (Azevedo et al., 2011) total_labor_days Total labor days Family labor days on the plot + hired labor days + “other” labor days (non-hired and non-family) NGA: no PP harvest in wave 1. No “other” labor until wave 4. TZA: no “other” labor. UGA: no family labor in wave 7. No “other” labor after wave 3 irrigated Is the plot irrigated? tractor Did the household ETH: absent in use a tractor in wave 1 this season? UGA: absent in waves 1 and 2 plot_owned Farmer declares Owned= owning the plot purchased, inherited, granted by local leaders… plot_certificate Possession of a MWI: absent in certificate for the wave 1 and wave 4 plot erosion_protection Is the plot NGA: absent in protected from w1, w2. Only for erosion? plots with erosion 34 problems in wave 3. UGA: only for plots with erosion problems from wave 4 onwards livestock Is the respondent engaged in livestock activities? ag_asset_index Agricultural assets Index based on a PCA of all UGA: missing in index agricultural assets owned by the wave 1 household perennial_crops_hh Does this household grow perennial crops? nb_fallow_plots Number of fallow plots under household management nb_plots Number of plots under household management share_kg_sold Share of harvest Set to missing if reported share output (in kg) sold above 100% percent. Set to missing if household does not harvest any crops. agro_ecological_zone Agro-ecological From the Zones geovariables files, which are missing in NER w2, TZA w4 and w5, UGA wave 4 and above dist_popcenter HH Distance in From the (KMs) to Nearest geovariables files, Population Center which are missing with +20,000 in NER w2, TZA w4 and w5, UGA wave 4 and above dist_market HH Distance in From the (KMs) to Nearest geovariables files, Market which are missing in NER w2, TZA w4 and w5, UGA wave 4 and above Also missing in MWI, MLI, NER, and all of TZA and UGA elevation Elevation (m) From the geovariables files, which are missing in NER w2, TZA 35 w4 and w5, UGA wave 4 and above twi Potential wetness From the index geovariables files, which are missing in NER w2, TZA w4 and w5, UGA wave 4 and above nutrient_availability Nutrient From the nutrient_retention Availability geovariables files, rooting_conditions Nutrient Retention which are missing oxygen_availability Rooting in NER w2, TZA excess_salts conditions w4 and w5, UGA toxicity Oxygen wave 4 and above workability availability Excess salts Toxicity Workability soil_fertility_index Soil fertility index PCA of soil variables above plot_slope Plot Slope From the (percent) geovariables files, which are missing in NER w2, TZA w4 and w5, UGA wave 4 and above plot_dist_household Plot Distance in From the (KMs) to HH geovariables files, which are missing in NER w2, TZA w4 and w5, UGA wave 4 and above Only present in ETH, MWI. Supplementary Table 4. Directory of variables in the household-level dataset Variable name Label Data processing notes Additional notes hh_size Household size hh_shock Was the MALI: 3 year recall period. household TZA: 2 year recall period in w4, w5. negatively impacted by a shock over the past 12 months? 36 hh_electricity_access Does the Houses which have a generator/ solar household panels but aren’t connected to a grid have access to are also recorded as having access to electricity? electricity hh_dependency_ratio Household Individuals that are younger than 15 dependency or older than 65 are classified as ratio “dependents” hh_formal_education Does anyone in the household posses any formal education? nonfarm_enterprise Someone in household owns a non- farm enterprise totcons_LCU Consumption MWI: absent in w3, w4 aggregate per MLI: absent in w2 capita, in LCU totcons_USD Consumption Consumption aggregates TZA: consumption aggregates for the aggregate per (totcons_LCU) are converted refresh sample, wave 5, include rent capita, in to current USD and deflated and durables, which is not the case USD to 2020 dollar values, using elsewhere. exchange rates and a deflator extracted from World Bank databases (Azevedo et al., 2011) cons_quint Household Quintile of totcons. This amounts to a consumption within wave quintile quintile hh_asset_index Household Index based on a PCA of all asset index assets owned by the household HDDS Household Index derived from a MLI: absent in w2 dietary household consumption diversity module. See the manuscript index for more information 37 Supplementary Table 5. Directory of variables in the individual-level dataset Variable name Label Data processing notes Additional notes indiv_id_obs Individual ID Variable created by assigning (panel a numeric code to each unique identificator) individual. The first number of the code determines the country. indiv_id_merge Individual ID ID to merge with raw LSMS Can be merged with the following (to merge data variables in the raw data with raw data) (concatenated variables are separated by hyphens): ETH: individual_id (wave 1), individual_id2 (wave 2), individual_id2 (wave 3), concatenation of household_id and individual_id, (wave 4), concatenation of household_id and individual_id (wave 5) MWI: PID (wave 1), PID (wave 2), PID (wave 3), PID (wave 4) MLI: concatenation of grappe menage, and numero d’ordre (wave 1), concatenation of grappe exploitation and numero d’ordre (wave 2) NER: concatenation of hhid and numero d’ordre (wave 1), concatenation of GRAPPE, MENAGE, EXTENSION and numero d’ordre (wave 2) NGA: concatenation of hhid and id (wave 1), concatenation of hhid and id (wave 2), concatenation of hhid and id (wave 3), concatenation of hhid and id (wave 4) TZA: concatenation of hhid and member number (wave 1), concatenation of y2_hhid and indidy2 (wave 2), concatenation of y3_hhid and indidy3 (wave 3), concatenation of y4_hhid and indidy4 (wave 4), concatenation of sdd_hhid and sdd_indid (wave 5) UGA: concatenation of Hhid, and PID (wave 1), HHID and PID (wave 2), HHID and PID (wave 3), HHID and PID (wave 4), HHID and PID (wave 5), hhid and PID (wave 7), hhid and PID (wave 8) relationship_head HH Member Absent of unified codes, this Relationship variable is coded as a string to HH Head age Age (in years) Capped at 100 38 married Is the individual married? female Is the individual a female? formal_education Any formal education? primary_education Completed “Primary education” consists of: primary - 6 years in NGA education? - 6 years in MALI - 7 years in TZA - 8 years in ETH - 6 years in NIGER - 8 years in MALAWI - 7 years in UGANDA NGA: not asked for old members who are not in school (w1, w2) height Individual Absent in MLI and NER height weight Individual Absent in MLI and NER weight wasting Child with Absent in MLI and NER, TZA wave wasting 4, wave5 haz06 Height-for- These estimates are calculated Absent in MLI and NER, TZA wave age Z-score using the 2006 WHO child 4, wave5 growth standards waz06 Weight-for- These estimates are calculated Absent in MLI and NER, TZA wave age Z-score using the 2006 WHO child 4, wave5 growth standards whz06 Weight-for- These estimates are calculated Absent in MLI and NER, TZA wave length/height using the 2006 WHO child 4, wave5 Z-score growth standards bmiz06 BMI-for-age These estimates are calculated Absent in MLI and NER, TZA wave Z-score using the 2006 WHO child 4, wave5 growth standards farm_work Individual has in last 7 days, or typical 7 In MLI, NER and NGA (w1, w2), worked in days this information is only available for own farm in individuals’ “main job” (past) 7 days SOB_work Individual has in last 7 days, or typical 7 In MLI, NER and NGA (w1, w2), worked in days this information is only available for own business individuals’ “main job” in (past) 7 days wage_work Individual has in last 7 days, or typical 7 In MLI, NER and NGA (w1, w2), worked for days this information is only available for own wage in individuals’ “main job” (past) 7 days farm_hrs Number of In MLI, NER and NGA (w1, w2), hours spent in this information is only available for own farm ag individuals’ “main job” work in (past) 7 days 39 SB_hrs Number of In MLI, NER and NGA (w1, w2), hours spent in this information is only available for own business individuals’ “main job” work in (past) 7 days wage_hrs Number of In MLI, NER and NGA (w1, w2), hours spent in this information is only available for wage labor in individuals’ “main job” (past) 7 days ind_ag Any wage work in agriculture ind_fish Any wage work in fishing ind_mining Any wage work in mining ind_manuf Any wage work in manufacturing ind_const Any wage work in construction ind_serv Any wage work in services working_age Working age The cutoff changes across surveys household member (according to questionnaire) 40